🤖 Robotics arXiv Digest

Sunday, May 4, 2026

📄 30 papers 📂 7 research areas ✨ Generated by Claude

🔭 Research Landscape

Today's batch reveals a robotics community intensely focused on making Vision-Language-Action models practical for real deployment. MolmoAct2 sets a new bar as a fully open VLA with flow-matching action experts, adaptive-depth reasoning, and the largest open bimanual dataset (720 hours), while Latent Bridge tackles the inference bottleneck by predicting VLM output deltas to cut backbone calls by 50–75% with minimal performance loss. Meanwhile, Seeing Realism from Simulation addresses the data hunger problem by converting simulated VLA videos into realistic training data via conditional video transfer, improving RDT-1B and π₀ by 5–8%. Together, these three papers outline a complete pipeline: generate cheap sim data (Seeing Realism), train powerful open models (MolmoAct2), and deploy them efficiently (Latent Bridge).

A second major thread is the convergence of classical optimization with learning-based methods. OT-MPC replaces the information-theoretic foundations of MPPI/CEM with optimal transport to avoid mode-averaging in complex cost landscapes, while the NANO filter reframes Bayesian filtering through information geometry for exact natural-gradient updates on robot state estimation. On the manipulation side, PIEGraph fuses analytical spring-mass physics with equivariant GNNs for data-efficient deformable object dynamics, and ShapeGrasp iteratively refines object shape representations through visuo-haptic feedback during grasping — both demonstrating that hybrid physics+learning approaches outperform either paradigm alone.

Navigation research today emphasizes robustness across environmental conditions and sensor modalities. LTR² introduces the first cross-modal LiDAR-teach/radar-repeat system validated over 40+ km across 6 months, while DynoSLAM embeds stochastic GNN-based pedestrian prediction directly into the SLAM factor graph. The procedural map generator study (Beyond Specialization) provides compelling evidence that training diversity — not architecture — is the primary determinant of navigation policy generalization, with mixed-generator training achieving 91.5% success versus 3.3% for a sparse-only specialist tested on mazes.

VLA & Foundation Models

Open VLAs, efficient inference, sim-to-real video transfer, and VLM-integrated navigation

#6 MolmoAct2
#7 Seeing Realism from Simulation
#11 Latent Bridge
#23 Semantic Autonomy Framework

Manipulation & Grasping

Physics-augmented dynamics, visuo-haptic shape completion, desk organization, and mobile grasping

#1 PIEGraph
#15 Robotic Desk Organization
#16 ShapeGrasp
#20 Visibility-Aware Mobile Grasping

Navigation & SLAM

UAV planning, cross-modal teach-and-repeat, dynamic SLAM, and RL navigation generalization

Control & State Estimation

Optimal transport MPC, adaptive aerial manipulation, geometry-aware filtering, and SE(3) derivatives

#2 OT-MPC
#3 Robust Adaptive Aerial MPC
#13 NANO Filter
#24 SE(3) Higher-Order Derivatives

Perception & Scene Understanding

Monocular depth grounding, open-set segmentation, temporally consistent pose, and indoor scene synthesis

#4 AnchorD Depth Grounding
#5 Hyp2Former
#17 Temporally Consistent 6D Pose
#18 ZoneMaestro Scene Generation

Human-Robot Interaction & Assistive Robotics

Affective touch, shared autonomy with impedance guidance, tensegrity crutches, and exoskeleton gait

#8 Robotic Affection
#9 Tensegrity Crutches
#12 Exoskeleton Adaptive Gait
#21 IAGF Shared Autonomy

Robot Learning & Multi-Robot Systems

RL generalizability analysis, sim-to-real for aquatic robots, multi-robot AoI optimization, and parallel manipulator kinematics

#10 AoI-Aware Multi-Robot
#19 ASV Waste Capture Sim-to-Real
#25 SHAP Analysis for RL
#30 Parallel Manipulator Configurations

VLA & Foundation Models

6 h=27

MolmoAct2: Action Reasoning Models for Real-world Deployment

2026-05-04 cs.RO Jaemin Cho · h=27

Haoquan Fang, Jiafei Duan, Donovan Clay, Sam Wang, Shuo Liu

Core Contributions

First fully open VLA (weights, code, data) to outperform Pi-05 across 7 simulation and real-world benchmarks, while its MolmoER backbone surpasses GPT-5 and Gemini Robotics ER-1.5 on 13 embodied-reasoning benchmarks
Introduces flow-matching continuous-action expert grafted onto a discrete-token VLM via per-layer KV-cache conditioning — bridging the gap between language model architectures and continuous robot control
Releases MolmoAct2-BimanualYAM: 720 hours of teleoperated bimanual data, the largest open bimanual manipulation dataset to date
MolmoThink adaptive-depth reasoning re-predicts depth tokens only for changed scene regions, retaining geometric grounding while drastically cutting latency versus full re-computation
OpenFAST action tokenizer trained on millions of trajectories across 5 embodiments provides a standardized action representation for cross-platform VLA deployment

Show abstract

Vision-Language-Action (VLA) models aim to provide a single generalist controller for robots, but today's systems fall short on the criteria that matter for real-world deployment. Frontier models are closed, open-weight alternatives are tied to expensive hardware, reasoning-augmented policies pay prohibitive latency for their grounding, and fine-tuned success rates remain below the threshold for dependable use. We present MolmoAct2, a fully open action reasoning model built for practical deployment, advancing its predecessor along five axes. We introduce MolmoER, a VLM backbone specialized for spatial and embodied reasoning, trained on a 3.3M-sample corpus with a specialize-then-rehearse recipe. We release three new datasets spanning low-to-medium cost platforms, including MolmoAct2-BimanualYAM, 720 hours of teleoperated bimanual trajectories that constitute the largest open bimanual dataset to date, together with quality-filtered Franka (DROID) and SO100/101 subsets. We provide OpenFAST, an open-weight, open-data action tokenizer trained on millions of trajectories across five embodiments. We redesign the architecture to graft a flow-matching continuous-action expert onto a discrete-token VLM via per-layer KV-cache conditioning. Finally, we propose MolmoThink, an adaptive-depth reasoning variant that re-predicts depth tokens only for scene regions that change between timesteps, retaining geometric grounding at a fraction of prior latency. In the most extensive empirical study of any open VLA to date, spanning 7 simulation and real-world benchmarks, MolmoAct2 outperforms strong baselines including Pi-05, while MolmoER surpasses GPT-5 and Gemini Robotics ER-1.5 across 13 embodied-reasoning benchmarks. We release model weights, training code, and complete training data.

7 h=19

Seeing Realism from Simulation: Efficient Video Transfer for Vision-Language-Action Data Augmentation

2026-05-04 cs.CV cs.RO Shan You · h=19

Chenyu Hui, Xiaodi Huang, Siyu Xu, Yunke Wang, Shan You

Core Contributions

Converts simulated VLA videos into photorealistic training data while preserving exact action trajectories and task semantics — unlike prior sim-to-real transfer that focuses on static images or loses action alignment
Diffusion feature-reuse mechanism shares video tokens across adjacent timesteps, making generation practical at the scale needed for VLA training rather than prohibitively expensive
Coreset sampling strategy identifies a compact, non-redundant subset of simulation data for augmentation, maximizing diversity under fixed compute budgets
Improves RDT-1B by 8% on Robotwin 2.0 and π₀ by 5.1% on the challenging LIBERO-Plus benchmark — demonstrating consistent gains across different VLA architectures

Show abstract

Vision-language-action (VLA) models typically rely on large-scale real-world videos, whereas simulated data, despite being inexpensive and highly parallelizable to collect, often suffers from a substantial visual domain gap and limited environmental diversity, resulting in weak real-world generalization. We present an efficient video augmentation framework that converts simulated VLA videos into realistic training videos while preserving task semantics and action trajectories. Our pipeline extracts structured conditions from simulation via video semantic segmentation and video captioning, rewrites captions to diversify environments, and uses a conditional video transfer model to synthesize realistic videos. To make augmentation practical at scale, we introduce a diffusion feature-reuse mechanism that reuses video tokens across adjacent timesteps to accelerate generation, and a coreset sampling strategy that identifies a compact, non-redundant subset for augmentation under limited computation. Extensive experiments on Robotwin 2.0, LIBERO, LIBERO-Plus, and a real robotic platform demonstrate consistent improvements.

11 h=13

Latent Bridge: Feature Delta Prediction for Efficient Dual-System Vision-Language-Action Model Inference

2026-05-04 cs.RO Taotao Jing · h=13

Yudong Liu, Yuan Li, Zijia Tang, Yuxi Zheng, Yueqian Lin

Core Contributions

Identifies that VLM backbone features are temporally redundant in dual-system VLAs, and exploits this by predicting feature deltas rather than recomputing full VLM outputs at every control step
Demonstrates generality across two architecturally distinct VLAs — GR00T-N1.6 (feature-space bridge) and π0.5 (KV-cache bridge) — showing the approach is not architecture-specific
Achieves 95–100% performance retention while reducing expensive VLM calls by 50–75%, yielding 1.65–1.73× net per-episode speedup across LIBERO, RoboCasa, and ALOHA benchmarks
Task-agnostic DAgger training pipeline transfers across benchmarks without modification, avoiding the need for task-specific fine-tuning of the bridge module

Show abstract

Dual-system Vision-Language-Action (VLA) models achieve state-of-the-art robotic manipulation but are bottlenecked by the VLM backbone, which must execute at every control step while producing temporally redundant features. We propose Latent Bridge, a lightweight model that predicts VLM output deltas between timesteps, enabling the action head to operate on predicted outputs while the expensive VLM backbone is called only periodically. We instantiate Latent Bridge on two architecturally distinct VLAs: GR00T-N1.6 (feature-space bridge) and π0.5 (KV-cache bridge), demonstrating that the approach generalizes across VLA designs. Our task-agnostic DAgger training pipeline transfers across benchmarks without modification. Across four LIBERO suites, 24 RoboCasa kitchen tasks, and the ALOHA sim transfer-cube task, Latent Bridge achieves 95-100% performance retention while reducing VLM calls by 50-75%, yielding 1.65-1.73x net per-episode speedup.

23 h=5

A Semantic Autonomy Framework for VLM-Integrated Indoor Mobile Robots: Hybrid Deterministic Reasoning and Cross-Robot Adaptive Memory

2026-05-04 cs.RO cs.AI B. Abaza · h=5

Bogdan Felician Abaza, Andrei-Alexandru Staicu, Cristian Vasile Doicin

Core Contributions

Handles 88% of natural language navigation instructions in under 0.1ms via a seven-step parametric resolver, escalating only genuinely ambiguous instructions to the VLM — making VLM-integrated navigation feasible on Raspberry Pi 5 without GPU
Introduces cross-robot semantic memory transfer: preferences learned through VLM interactions on one robot are compiled into a shared digest and transferred to a second robot, achieving a measured 103,000-fold latency reduction
Validates 100% semantic transfer accuracy (33/33 decisions, 95% CI [0.894, 1.000]) across two custom differential-drive robots over three sessions with zero training data required
Unlike prior VLM-for-robotics work that treats the language model as always-on, this system explicitly manages the compute/accuracy tradeoff by categorizing instructions by ambiguity level

Show abstract

Autonomous indoor mobile robots can navigate reliably to metric coordinates using established frameworks such as ROS 2 Navigation 2, yet they lack the ability to interpret natural language instructions that express intent rather than positions. Vision-Language Models offer the semantic reasoning required to bridge this gap, but their inference latency (2-9 seconds per decision on consumer hardware) and session-by-session amnesia limit practical deployment. This paper presents the Semantic Autonomy Stack, a six-layer reference framework for semantically autonomous indoor navigation, and validates a complete instance featuring hybrid deterministic-VLM reasoning and cross-robot adaptive memory on physical robots with off-the-shelf edge hardware.

Manipulation & Grasping

1 h=49

Learning Equivariant Neural-Augmented Object Dynamics From Few Interactions

2026-05-04 cs.RO cs.AI cs.CV cs.LG G. Konidaris · h=49

Sergio Orozco, Tushar Kusnur, Brandon May, George Konidaris, Laura Herlant

Core Contributions

Combines a physically informed spring-mass analytical model with an equivariant GNN to achieve accurate dynamics prediction for both rigid and deformable objects from limited real-world interactions — unlike pure data-driven approaches that need orders of magnitude more data
The equivariant GNN exploits symmetries in particle interactions via a novel action representation, guiding the analytical model rather than replacing it, which enforces physically feasible motion over long horizons
Validated on robot hardware for reorientation and repositioning of ropes, cloth, stuffed animals, and rigid objects — demonstrating breadth across object categories that most prior methods handle individually
Enables reliable downstream manipulation planning, outperforming state-of-the-art baselines in both prediction accuracy and task success on physical robots

Show abstract

Learning data-efficient object dynamics models for robotic manipulation remains challenging, especially for deformable objects. A popular approach is to model objects as sets of 3D particles and learn their motion using graph neural networks. In practice, this is not enough to maintain physical feasibility over long horizons and may require large amounts of interaction data to learn. We introduce PIEGraph, a novel approach to combining analytical physics and data-driven models to capture object dynamics for both rigid and deformable bodies using limited real-world interaction data. PIEGraph consists of two components: (1) a Physically Informed particle-based analytical model (implemented as a spring-mass system) to enforce physically feasible motion, and (2) an Equivariant Graph Neural Network with a novel action representation that exploits symmetries in particle interactions to guide the analytical model. We evaluate PIEGraph in simulation and on robot hardware for reorientation and repositioning tasks with ropes, cloth, stuffed animals and rigid objects. We show that our method enables accurate dynamics prediction and reliable downstream robotic manipulation planning, which outperforms state of the art baselines.

15 h=9

Robotic Desk Organization: A Multi-Primitive Approach to Manipulating Heterogeneous Objects via Environmental Constraints

2026-05-04 cs.RO Jinjun Duan · h=9

Yi Dong, Yangjun Liu, Jinjun Duan, Yang Li, Zhendong Dai

Core Contributions

Introduces environment-assisted manipulation primitives — contact-based grasping, edge-based push-grasping, and levering-based grasping — that exploit table edges and inter-object constraints rather than relying solely on gripper dexterity
Handles both rigid and deformable planar objects with a unified task planner, unlike prior work that typically addresses one object type
Perception pipeline augments existing datasets with uncommon desktop items and performs geometry-based pose and keypoint estimation alongside environmental constraint detection
Real-world experiments demonstrate robust multi-object organization including collection and stacking tasks across heterogeneous object sets

Show abstract

Desktop organization remains challenging for service robots because of heterogeneous objects and diverse manipulation objectives, such as collection and stacking. In this article, a task-oriented framework is presented for organizing planar rigid and deformable objects on desks. A perception pipeline was developed that augments existing datasets with uncommon desktop items and makes geometry-based pose and keypoint estimation possible, along with the detection of environmental constraints, such as table edges. To handle diverse manipulation requirements, environment-assisted primitives are used, including contact-based grasping for small objects, edge-based push-grasping for planar rigid objects, and levering-based grasping for planar deformable objects. These primitives leverage environmental and interobject constraints to improve robustness. A task planner was designed to integrate these primitives into multiobject organization.

16 h=7

ShapeGrasp: Simultaneous Visuo-Haptic Shape Completion and Grasping for Improved Robot Manipulation

2026-05-04 cs.RO Lukas Rustler · h=7

Lukas Rustler, Matej Hoffmann

Core Contributions

First approach to update object shape representations after real-world grasp attempts — each grasp yields tactile contacts and gripper-occupied space that refine the implicit surface model for subsequent attempts
Couples implicit surface visuo-haptic shape completion with physics-based grasp planning in an iterative loop: infer shape → plan grasp → execute → update shape from feedback → regrasp if needed
Achieves 84% grasp success with a three-finger gripper and 91% with a two-finger gripper on real robots, outperforming baselines while simultaneously improving 3D reconstruction quality across all metrics
Works from a single RGB-D view without object-specific training, making it applicable to novel objects encountered in unstructured environments

Show abstract

Humans grasp unfamiliar objects by combining an initial visual estimate with tactile and proprioceptive feedback during interaction. We present ShapeGrasp, a robotic implementation of this approach. The proposed method is an iterative grasp-and-complete pipeline that couples implicit surface visuo-haptic shape completion with physics-based grasp planning. From a single RGB-D view, ShapeGrasp infers a complete shape, generates candidate grasps via rigid-body simulation, and executes the best feasible grasp. Each grasp attempt yields additional geometric constraints — tactile surface contacts and space occupied by the gripper body — which are fused to update the object shape. Failures trigger pose re-estimation and regrasping using the refined shape.

20 h=6

Visibility-Aware Mobile Grasping in Dynamic Environments

2026-05-04 cs.RO Anxing Xiao · h=6

Tianrun Hu, Anxing Xiao, David Hsu, Hanbo Zhang

Core Contributions

Addresses the fundamental see-vs-move tradeoff in mobile manipulation: the robot must balance gathering visual information about unobserved regions with making task progress, all under a limited field of view
Combines a whole-body planner with velocity-aware active perception for safe navigation in dynamic environments, and a behavior-tree-based high-level planner for adaptive subgoal generation and runtime failure recovery
Achieves 68.8% and 58.0% success in unknown static and dynamic environments respectively, improving over the baseline by 22.8% and 18.0% — validated on a Fetch mobile manipulator in real-world deployment
Unlike prior approaches that assume known or static environments and decouple seeing from acting, this system jointly optimizes visibility and motion in a unified framework

Show abstract

This paper addresses the problem of mobile grasping in dynamic, unknown environments where a robot must operate under a limited field-of-view. The fundamental challenge is the inherent trade-off between "seeing" around to reduce environmental uncertainty and "moving" the body to achieve task progress in a high-dimensional configuration space, subject to visibility constraints. Previous approaches often assume known or static environments and decouple these objectives, failing to guarantee safety when unobserved dynamic obstacles intersect the robot's path during manipulation. In this paper, we propose a unified mobile grasping system comprising two core components: (1) an iterative low-level whole-body planner coupled with velocity-aware active perception to navigate dynamic environments safely; and (2) a hierarchical high-level planner based on behavior trees that adaptively generates subgoals to guide the robot through exploration and runtime failures.

Navigation & SLAM

14 h=9

SAGA: A Robust Self-Attention and Goal-Aware Anchor-based Planner for Safe UAV Autonomous Navigation

2026-05-04 cs.RO Sio-Kei Im · h=9

Junhao Wei, Yanxiao Li, Dexing Yao, Yifu Zhao, Haochen Li

Core Contributions

Achieves 100% navigation success across all tested speed settings (2.0–4.0 m/s) in cluttered environments, while YOPO drops from 90.91% to 62.50% and Ego-planner from 71.43% to 52.63% as speed increases
Formulates UAV local planning as a one-stage joint regression-and-ranking problem over motion anchors — a single forward pass predicts refined terminal states and planning scores for all candidates simultaneously
Introduces polar positional encoding derived from anchor yaw and pitch to preserve directional structure in the self-attention token space, enabling cross-anchor global reasoning about obstacle geometry
At 4.0 m/s, improves minimum safety clearance from 0.44m (YOPO) to 0.76m while reducing total flight time from 40.5s to 27.5s — achieving both safer and faster navigation simultaneously

Show abstract

Agile unmanned aerial vehicle (UAV) navigation in cluttered environments demands a planning architecture that is both computationally efficient and structurally expressive enough to reason over multiple feasible motions. This paper presents SAGA, a robust self-attention and goal-aware anchor-based planner for safe UAV autonomous navigation. SAGA formulates local planning as a one-stage joint regression-and-ranking problem over a fixed lattice of motion anchors. Given a depth image and a body-frame motion state, the planner predicts refined terminal states and planning scores for all anchors in a single forward pass, after which the best candidate is decoded into a dynamically feasible trajectory.

22 h=5

Beyond Specialization: Robust Reinforcement Learning Navigation via Procedural Map Generators

2026-05-04 cs.RO cs.LG Christian Jestel · h=5

Christian Jestel, Nicolas Bach, Marvin Wiedemann, Jan Finke, Peter Detzner

Core Contributions

Provides the first systematic comparison of how procedural map generator types (sparse, maze, graph, Wave Function Collapse) affect RL navigation policy generalization — revealing strongly asymmetric cross-generator transfer (sparse specialist drops to 3.3% on mazes)
A policy trained on the combined generator set achieves 91.5±1.1% mean success across all environments, demonstrating that training diversity is the primary driver of generalization
Shows A* path-planner subgoal inputs are the dominant robustness factor (raising success to 98.9±0.4%), outperforming GRU recurrence which only helps reactive baselines — challenging the assumption that memory architectures are the key to navigation generalization
Learned DRL policies outperform a classical Carrot+A* controller, which matches success only at low speeds (1.0 m/s) but collapses to 24.9% at 2.0 m/s, highlighting learned speed adaptation as the decisive advantage

Show abstract

Deep reinforcement learning (DRL) navigation policies often overfit to the structure of their training environments, as environmental diversity is typically constrained by the manual effort required to design diverse scenarios. While procedural map generation offers scalable diversity, no prior work systematically compares how different generator types affect policy generalization. We integrate four generators (sparse, maze, graph, and Wave Function Collapse) with guaranteed navigability into MuRoSim, a 2D simulator focusing on training efficiency for LiDAR-based navigation.

26 h=3

Semantic Risk-Aware Heuristic Planning for Robotic Navigation in Dynamic Environments: An LLM-Inspired Approach

2026-05-04 cs.RO Hamza Durrani · h=3

Hamza Ahmed Durrani, Rafay Suleman Durrani

Core Contributions

Encodes LLM-inspired semantic cost functions that penalize geometrically cluttered or high-risk zones into an A* search framework with closed-loop replanning — a lightweight alternative to running a full LLM at planning time
Achieves 62.0% task success versus 56.5% for BFS with replanning and 4.0% for Greedy without replanning across 200 randomized trials in a 15×15 grid with dynamic obstacles
Obstacle-density ablation shows semantic cost shaping consistently improves navigation across varying difficulty levels, suggesting the benefit is not specific to one environment regime

Show abstract

The integration of Large Language Model (LLM) reasoning principles into classical robot path planning represents a rapidly emerging research direction. In this paper, we propose a Semantic Risk-Aware Heuristic (SRAH) planner that encodes LLM-inspired cost functions penalising geometrically cluttered or high-risk zones into an A* search framework, augmented with closed-loop replanning upon dynamic obstacle detection.

27 h=3

LiDAR Teach, Radar Repeat: Robust Cross-Modal Navigation in Degenerate and Varying Environments

2026-05-04 cs.RO Yushuai Chen · h=3

Renxiang Xiao, Yichen Chen, Yuanfan Zhang, Qianyi Shao, Yushuai Chen

Core Contributions

First cross-modal, cross-platform LiDAR-Teach-and-Radar-Repeat navigation system: teaches with precise LiDAR under good conditions, repeats with robust 4D radar under degraded conditions (nighttime smoke, weather)
Cross-Modal Registration network jointly exploits Doppler-based motion priors and physical laws governing LiDAR intensity and radar power density to align sparse, noisy radar with dense LiDAR maps
Adaptive fine-tuning incrementally updates the registration network based on localization errors without ground-truth labels, enabling long-term adaptability to static environmental changes
Validated across 3 robot platforms over 40+ km across 6 months — achieving centimeter-level accuracy and significantly outperforming existing cross-modal approaches in the most extensive deployment reported for radar-based teach-and-repeat

Show abstract

Long-term autonomy requires robust navigation in environments subject to dynamic and static changes, as well as adverse weather conditions. Teach-and-Repeat (T&R) navigation offers a reliable and cost-effective solution by avoiding the need for consistent global mapping; however, existing T&R systems lack a systematic solution to tackle various environmental variations such as weather degradation, ephemeral dynamics, and structural changes. This work proposes LTR², the first cross-modal, cross-platform LiDAR-Teach-and-Radar-Repeat system that systematically addresses these challenges.

28 h=3

DynoSLAM: Dynamic SLAM with Generative Graph Neural Networks for Real-World Social Navigation

2026-05-04 cs.RO cs.CV Gonzalo Ferrer · h=3

Danil Tokhchukov, Veronika Morozova, Gonzalo Ferrer

Core Contributions

Integrates socially-aware GNN-based pedestrian prediction directly into the SLAM factor graph via a dynamic Mahalanobis distance factor — unlike conventional approaches that treat SLAM and motion prediction as separate pipelines
Uses Monte Carlo rollouts from a stochastic World Model formulation to capture multimodal epistemic uncertainty of human interactions, avoiding the "argmax problem" that causes deterministic prediction approaches to fail
Extracts empirical mean and covariance matrices of future pedestrian states to provide a mathematically rigorous probabilistic safety envelope for downstream local planners in crowded environments
Demonstrates through extensive simulations that the stochastic formulation prevents optimization failures while maintaining highly accurate retrospective tracking of dynamic agents

Show abstract

Traditional Simultaneous Localization and Mapping (SLAM) algorithms rely heavily on the static environment assumption, which severely limits their applicability in real-world spaces populated by moving entities, such as pedestrians. In this work, we propose DynoSLAM, a tightly-coupled Dynamic GraphSLAM architecture that integrates socially-aware Graph Neural Networks (GNNs) directly into the factor graph optimization. Unlike conventional approaches that use rigid constant-velocity heuristics or deterministic single-agent neural priors, our framework formulates pedestrian motion forecasting as a stochastic World Model.

29 h=2

Parking Assistance for Trailer-Truck Transport Vehicles Using Sensor Fusion and Motion Planning

2026-05-04 cs.RO J. Fletcher · h=2

George Alenchery, Thomas Jeske, Tova Quinones, Lentz Fortune, Tristan Lindo-Slones

Core Contributions

Proposes a unified framework for autonomous trailer-truck parking integrating sensor fusion, Hybrid A* path planning, NMPC control, and infrastructure awareness for the particularly challenging articulated vehicle domain
Adapts an open-source A* path planning simulation to incorporate a tractor-trailer kinematic model, demonstrating feasibility of articulated vehicle planning in a command-line simulation environment
Identifies jackknife prevention as a critical remaining challenge for autonomous trailer-truck parking, providing a roadmap for future system-level coordination work

Show abstract

Autonomous driving technology has rapidly evolved over the past decade, offering significant improvements in transportation efficiency, safety, and cost reduction. While much of the progress has focused on highway driving and obstacle avoidance, low-speed maneuvers such as parking remain among the most difficult challenges for autonomous systems. This challenge is especially pronounced in trailer-truck transport vehicles due to their articulated motion and environmental constraints. This paper presents a proposed framework for autonomous truck parking that integrates perception, motion planning, control systems, and infrastructure awareness.

Control & State Estimation

2 h=47

Sampling-Based Control via Entropy-Regularized Optimal Transport

2026-05-04 cs.RO math.OC Evangelos A. Theodorou · h=47

Vincent Pacelli, Akash Ratheesh, Evangelos A. Theodorou

Core Contributions

Replaces the information-theoretic (KL-divergence) foundation of MPPI and CEM with an entropy-regularized optimal transport formulation, directly addressing the mode-averaging pathology that plagues these methods on multimodal cost landscapes
Computes an optimal coupling between candidate control sequences and low-cost proposals, refining each candidate toward its nearest promising sample while maintaining ensemble coverage of the full solution space
Derives closed-form, gradient-free updates via the Sinkhorn algorithm, preserving the real-time performance advantage of sampling-based MPC while gaining geometric awareness
Demonstrates improved success rates over MPPI and CEM on navigation, manipulation, and locomotion tasks — the first application of OT-based sampling to real-time nonlinear control

Show abstract

Sampling-based model predictive control methods like MPPI and CEM are essential for real-time control of nonlinear robotic systems, particularly where discontinuous dynamics preclude gradient-based optimization. However, these methods derive from information-theoretic objectives that are agnostic to the geometry of the control problem, leading to pathological behaviors such as mode-averaging when the cost landscape is complex. We present OT-MPC, a sampling-based algorithm that overcomes these limitations through an entropy-regularized optimal transport formulation. By computing an optimal coupling between candidate control sequences and low-cost proposals, OT-MPC refines candidates toward nearby promising samples while coordinating updates across the ensemble to maintain coverage of the solution space.

3 h=43

Robust Adaptive Predictive Control for Hook-Based Aerial Transportation Between Moving Platforms

2026-05-04 cs.RO eess.SY M. Zeilinger · h=43

Péter Antal, Andrea Carron, Melanie Zeilinger, Roland Tóth, Tamás Péni

Core Contributions

First MPC approach for autonomous pick-and-place between moving platforms using a hook-equipped aerial manipulator — a significantly more complex task than static-platform aerial manipulation
Uses a MuJoCo digital twin as the predictive model, enabling rapid and accurate modeling of the complex quadcopter-hook-payload dynamics without manual equation derivation
Integrates zero-order robust optimization (zoRO) for uncertainty propagation with an EKF for online parameter estimation, ensuring robust constraint satisfaction under aerodynamic uncertainty and unknown payloads
Validated in both complex simulated scenarios and real-world flight experiments, demonstrating computational efficiency suitable for onboard deployment

Show abstract

This paper presents a novel model predictive control (MPC) approach for autonomous pick-and-place between moving platforms with a hook-equipped aerial manipulator. First, for accurate and rapid modeling of the complex dynamics, a digital twin model of the quadcopter equipped with a hook-based gripper, implemented in MuJoCo, is constructed and used as the predictive model for the MPC. To handle uncertainties of the predictive model (e.g. due to aerodynamics and uncertain payloads), a robust adaptive MPC approach is proposed. By systematic integration of zero-order robust optimization (zoRO) based uncertainty propagation and an extended Kalman filter (EKF) for parameter estimation, the MPC algorithm ensures robust constraint satisfaction, high performance, and computational efficiency.

13 h=13

Natural Gradient Bayesian Filtering: Geometry-Aware Filter for Dynamical Systems

2026-05-04 cs.RO eess.SY Ting Yuan · h=13

Chang Liu, Wenhan Cao, Zeju Sun, Tianyi Zhang, Jiayu Yuan

Core Contributions

Reframes Gaussian filtering from an information-geometric perspective, using natural gradient descent on the statistical manifold of Gaussian distributions to iteratively refine posterior mean and covariance
Proves that a single natural-gradient step exactly recovers the classical Kalman measurement update in the linear-Gaussian case, establishing a direct theoretical connection between information geometry and Kalman filtering
The NANO filter preserves positive definiteness of the covariance matrix by construction (following the manifold geometry), avoiding the numerical issues that plague ad-hoc covariance corrections in standard EKF/UKF
Demonstrated on satellite attitude estimation, SLAM, quadruped and humanoid robot state estimation — showing practical benefits across a diverse range of nonlinear estimation problems

Show abstract

Bayesian filtering is a cornerstone of state estimation in complex systems such as aerospace systems, yet exact solutions are available only for linear Gaussian models. In practice, nonlinear systems are handled through tractable approximations, with Gaussian filters such as the extended and unscented Kalman filters being among the most widely used methods. This tutorial revisits Gaussian filtering from an information-geometric perspective, viewing the prediction and measurement update steps as inference procedures over state distributions. Within this framework, we introduce a geometry-aware Gaussian filtering approach that leverages natural gradient descent on the statistical manifold of Gaussian distributions.

24 h=5

Exact Higher-Order Derivatives for SE(3) via Analytical/AD Methods

2026-05-04 cs.RO cs.MS F. Kuehnel · h=5

Frank O. Kuehnel

Core Contributions

Provides a practical hybrid analytical/AD recipe for computing exact Hessians and higher-order derivative tensors of SE(3) objectives — placing the analytical/AD seam at the point-action interface y=Tx to maximize efficiency
The seeded-Hessian path is approximately 5× faster than finite-differencing the AD gradient while matching a nested-AD oracle to machine precision, adding only ~70 lines of analytical-Jacobian code over an AD-only baseline
Identifies and fixes a removable singularity in the standard SO(3)/SE(3) scalar basis that produces NaNs at the origin under seeded AD — a previously undocumented pitfall for practitioners
Enables exact Newton steps, observed-information covariance estimates, and covariance correction for SE(3) optimization problems without finite-difference tuning — critical for robotics SLAM and state estimation

Show abstract

Fast prototyping of new SE(3) estimation objectives remains awkward in practice. Modern Lie-group frameworks — GTSAM, manif, Sophus, SymForce, Ceres — target first-order workloads through different code-generation and automatic-differentiation strategies. The remaining gap is a compact, AD-safe path from these first-order primitives to exact Hessians, observed-information matrices, and higher-order derivative tensors. This paper presents a hybrid analytical/AD recipe for SE(3) negative log-likelihoods.

Perception & Scene Understanding

4 h=35

AnchorD: Metric Grounding of Monocular Depth Using Factor Graphs

2026-05-04 cs.RO cs.CV Abhinav Valada · h=35

Simon Dorer, Martin Büchner, Nick Heppert, Abhinav Valada

Core Contributions

Proposes a training-free depth grounding framework that anchors monocular depth foundation model predictions in raw sensor depth through patch-wise affine alignment via factor graph optimization — preserving fine-grained geometric structure while correcting metric scale
Introduces a benchmark dataset with dense scene-wide ground truth depth for non-Lambertian objects (transparent, specular surfaces) obtained via matte reflection spray and multi-camera fusion — overcoming the reliance on CAD-based annotations
Works across diverse sensors and domains without any retraining, making it immediately applicable to existing depth sensor setups that struggle with reflective materials
Addresses a practical robotics pain point: depth sensors fail on transparent and specular surfaces, but monocular foundation models provide good structure — this method combines the best of both worlds

Show abstract

Dense and accurate depth estimation is essential for robotic manipulation, grasping, and navigation, yet currently available depth sensors are prone to errors on transparent, specular, and general non-Lambertian surfaces. To mitigate these errors, large-scale monocular depth estimation approaches provide strong structural priors, but their predictions can be potentially skewed or mis-scaled in metric units, limiting their direct use in robotics. Thus, in this work, we propose a training-free depth grounding framework that anchors monocular depth estimation priors from a depth foundation model in raw sensor depth through factor graph optimization.

5 h=35

Hyp2Former: Hierarchy-Aware Hyperbolic Embeddings for Open-Set Panoptic Segmentation

2026-05-04 cs.CV cs.AI cs.RO Abhinav Valada · h=35

Yao Lu, Rohit Mohan, Florian Drews, Yakov Miron, Abhinav Valada

Core Contributions

Exploits the natural hierarchy of semantic categories (e.g., "dog" → "animal" → "object") by learning embeddings in hyperbolic space, enabling unknown objects to remain close to higher-level concepts even when their fine-grained category was never seen during training
Does not require explicit modeling of unknowns during training — unlike prior open-set methods that need proxy unknown classes or out-of-distribution sampling
Achieves the best balance between unknown object discovery and in-distribution robustness across MS COCO, Cityscapes, and Lost&Found benchmarks
The hierarchical embedding structure provides interpretable failure modes: an unknown animal will cluster near "animal" rather than "electronics," giving downstream reasoning systems a meaningful semantic neighborhood for the detected object

Show abstract

Recognizing unknown objects is crucial for safety-critical applications such as autonomous driving and robotics. Open-Set Panoptic Segmentation (OPS) aims to segment known thing and stuff classes while identifying valid unknown objects as separate instances. Prior OPS approaches largely treat known categories as a flat label set, ignoring the semantic hierarchy that provides valuable structural priors for distinguishing unknown objects from in-distribution classes. In this work, we propose Hyp2Former, an end-to-end framework for OPS that does not require explicit modeling of unknowns during training, and instead learns hierarchical semantic similarities continuously in hyperbolic space.

17 h=6

Temporally Consistent Object 6D Pose Estimation for Robot Control

2026-05-04 cs.RO cs.CV Médéric Fourmy · h=6

Kateryna Zorina, Vojtech Priban, Mederic Fourmy, Josef Sivic, Vladimir Petrik

Core Contributions

Addresses the critical gap between single-frame pose estimation accuracy and the temporal consistency required for stable robot feedback control — off-the-shelf pose estimators produce frame-to-frame jitter that destabilizes controllers
Develops a factor graph approach incorporating object motion models, explicit pose measurement uncertainty estimation, and online optimization-based smoothing with outlier rejection
Significantly improves results on standardized pose estimation benchmarks while enabling stable visual servoing on a torque-controlled manipulator — bridging the perception-to-control gap
The explicit uncertainty estimation allows the controller to adapt its response based on pose estimate confidence, rather than treating all estimates as equally reliable

Show abstract

Single-view RGB object pose estimators have reached a level of precision and efficiency that makes them good candidates for vision-based robot control. However, off-the-shelf methods lack temporal consistency and robustness that are mandatory for a stable feedback control. In this work, we develop a factor graph approach to enforce temporal consistency of the object pose estimates. In particular, the proposed approach: (i) incorporates object motion models, (ii) explicitly estimates the object pose measurement uncertainty, and (iii) integrates the above two components in an online optimization-based estimator.

18 h=6

Orchestrating Spatial Semantics via a Zone-Graph Paradigm for Intricate Indoor Scene Generation

2026-05-04 cs.RO cs.AI Shizhao Sun · h=6

Meisheng Zhang, Shizhao Sun, Yang Zhao, Ziyuan Liu, Zhijun Gao

Core Contributions

Shifts indoor scene synthesis from object-centric to zone-graph orchestration — translating high-level semantic intent into functional zones and topological constraints that handle non-convex rooms where prior methods fail
Constructs Zone-Scene-10K, a large-scale dataset with explicit zone-graph annotations, and releases SCALE, a stress-test benchmark specifically for irregular indoor scenarios with complex spatial relations
Alternating Alignment Strategy cycles between reasoning internalization and Zone-Aware Group Relative Policy Optimization, reconciling semantic richness with geometric validity without external physics engines
Resolves the density-safety dichotomy: places furniture densely enough to be functional while avoiding physically invalid configurations — a tradeoff that object-level generators consistently fail at in irregular rooms

Show abstract

Autonomous 3D indoor scene synthesis breaks down in non-convex rooms with tightly coupled spatial constraints. Data-driven generators lack topological priors for long-horizon planning, while iterative agents fragment semantics and become geometrically brittle. We present ZoneMaestro, a unified framework that shifts the paradigm from object-centric synthesis to Zone-Graph Orchestration.

Human-Robot Interaction & Assistive Robotics

8 h=19

Robotic Affection — Opportunities of AI-based Haptic Interactions to Improve Social Robotic Touch

2026-05-04 cs.HC cs.RO J. Gerken · h=19

Ali Askari, Jens Gerken

Core Contributions

Proposes a neurobiology-inspired multi-model architecture that decomposes affective touch into distinct specialized subtask models, treating it as a distributed closed-loop perceptual task rather than a monolithic motoric movement
Introduces a peer-to-peer, state-sharing framework designed to overcome the "haptic uncanny valley" — the phenomenon where near-human but imperfect touch feels more disturbing than clearly robotic touch
Outlines a Sim-to-Real pipeline for affective touch that enables haptics, AI, and robotics researchers to contribute independently yet coherently to the same system
Position paper that provides a structured roadmap for a largely underexplored area — while robotic grasping and dexterity have advanced rapidly, social touch capabilities remain primitive

Show abstract

Despite the advancement in robotic grasping and dexterity through haptic information, affective social touch, such as handshaking or reassuring stroking, remains a major challenge in Human-Robot-Interaction. This position paper examines current progress and limitations across artificial intelligence, haptics and robotics research, and proposes a novel multi-model architecture to address these gaps.

9 h=18

Tensegrity Crutches with Compliance from a Pre-stressed Self-tensile Module

2026-05-04 cs.RO nlin.AO D. Dotov · h=18

Jingxian Gu, Joanna Spyra, Andrew Walski, Lyla Elsaesser, Samuel Bierner

Core Contributions

Designs a biologically inspired tensegrity crutch using a pre-stressed self-tensile two-cell structure that provides compliance without compromising stability — unlike existing spring-loaded designs that reduce perceived stability
Human trials (N=18) show the tensegrity design improves effort, comfort, pain, and usability versus rigid crutches, while spring-loaded crutches reduce perceived stability and walking speed
Achieves favorable nonlinear stiffness, ground-following, and force feedback through the tensegrity structure — properties that emerge from the pre-stressed geometry rather than requiring active control
Addresses a significant accessibility problem: 6 million US crutch users face secondary upper-joint injuries from rigid designs, and this is the first tensegrity-based solution validated with human participants

Show abstract

Purpose: Six million people use crutches as mobile aids in the US. Rigid designs with no axial mobility limit sensory feedback and lead to secondary injury on the upper joints. Spring-loaded designs offer compliance but may compromise stability. We designed a biologically inspired tensegrity crutch with a compliant module aiming to achieve favorable mechanical properties.

12 h=13

Adaptive Gait Generation for Multi-Terrain Exoskeletons via Constrained Kernelized Movement Primitives

2026-05-04 cs.RO S. Tortora · h=13

Edoardo Trombin, Miroljub Mihailovic, Matheus Henrique Ferreira Moura, Luca Tonin, Emanuele Menegatti

Core Contributions

Proposes a Kernelized Movement Primitives framework for adaptive gait generation across multiple indoor terrains (flat, slopes, stairs, obstacles) — current exoskeletons are limited to flat, even surfaces
Learns a probabilistic gait representation in both joint and task spaces from a limited number of human demonstrations, adapted in real-time using via-points from onboard RGB-D environmental sensing
Formulates adaptive gait generation as a linearly constrained optimization problem, ensuring kinematic feasibility while adapting to terrain detected by the onboard camera
Validated on a commercial lower limb exoskeleton in real-world scenarios including stair climbing and obstacle crossing — demonstrating feasibility of environment-aware gait planning for assistive robotics

Show abstract

Lower limb exoskeletons (LLEs) present the potential to make motor-impaired individuals walk again. Their application in real-world environments is still limited by the lack of effective adaptive gait planning. Indeed, current exoskeletons are meant to walk only on a flat and even terrain. Generating environment-aware, physiologically consistent gait trajectories in real-time is an open challenge. To overcome this, we propose a novel Kernelized Movement Primitives (KMP)-based framework for adaptive gait generation (AGG) across multiple indoor terrains.

21 h=6

Shared Autonomy Assisted by Impedance-Driven Anisotropic Guidance Field

2026-05-04 cs.RO cs.HC Yupu Lu · h=6

Sihan Chen, Hang Xu, Yupu Lu, Chen Wang, Benfang Duan

Core Contributions

Addresses the mutual understanding gap in shared autonomy: while most research focuses on robots inferring human intent, this work enables humans to understand the robot's intent through impedance-based physical communication
Adaptively modulates the robot's dynamic response to human input via an anisotropic guidance field, providing continuous, physically grounded communication without requiring additional visual or auditory interfaces
User studies across three scenarios and two teleoperation interfaces show improvements in task performance, human-robot agreement, and subjective experience
Inspired by impedance control principles, the approach is more intuitive than prior explicit intent communication methods (screen overlays, haptic displays) because the communication channel is embedded in the task interaction itself

Show abstract

Shared autonomy (SA) enables robots to infer human intent and assist in its achievement. While most research focuses on improving intent inference, it overlooks whether humans can understand the robot's intent in return. Without such mutual understanding, collaboration becomes less effective, degrading user experience and task performance. Inspired by impedance control, we propose Impedance-Driven Anisotropic Guidance Field Enhanced Shared Autonomy (IAGF-SA), a novel paradigm that extends SA with an embodied, physically-grounded communication channel.

Robot Learning & Multi-Robot Systems

10 h=14

AoI-Aware Multi-Robot Sensing and Transport on Connected Graphs

2026-05-04 cs.RO John Tadrous · h=14

John Tadrous

Core Contributions

Derives per-node and network-wide Age of Information (AoI) lower bounds that cleanly decompose into a sensing term (mean group sensing times) and a propagation term (shortest-path distances) — providing theoretical foundations for multi-robot monitoring system design
Shows the sensing component minimization yields a separable discretely convex resource allocation problem, solvable optimally by a greedy water-filling algorithm — avoiding combinatorial explosion
Constructs a shortest-path-tree conveyor architecture with Euler-walk deployment that provably attains the AoI lower bound in the full-conveyor regime
Captures both stochastic parallel sensing delays and hop-based propagation in a unified framework — measuring AoI from sensing start rather than just transport, providing a more realistic model for monitoring applications

Show abstract

A team of mobile robots monitors spatially distributed processes and delivers measurements to a base, where AoI is measured from sensing start, capturing both stochastic parallel sensing delays and hop-based propagation. At each non-base node, multiple robots may collaborate, yielding node-dependent geometric group sensing times, while other robots act as mobile conveyors that transport samples along unit-time edges.

19 h=6

Sim-to-Real Transfer and Robustness Evaluation of RL Control on an ASV for Floating Waste Capture

2026-05-04 cs.RO S. Aravecchia · h=6

Luis F. W. Batista, Stéphanie Aravecchia, Cédric Pradalier

Core Contributions

Presents a complete field-validated system combining polarimetric camera perception with DRL-based control for autonomous floating-waste detection and capture — deployed on a retrofitted ASV platform
Introduces a systematic sim-to-real testing methodology with a two-stage simulation protocol and perception abstraction module that mimics real camera behavior, enabling reproducible field trials
Evaluates robustness across 14 disturbance regimes in matched simulation and field experiments, identifying actuation-model fidelity as the primary source of degradation — not perception or control policy
Demonstrates search-and-capture over areas up to 450 m² with centimeter-level terminal accuracy, distilling practical lessons including the importance of latency and timestamp management across modules

Show abstract

Autonomous surface vessels for floating-waste removal operate under varying hydrodynamics, external disturbances, and challenging water-surface perception. We present a field-validated system that combines camera-based polarimetric perception with a lightweight DRL-based controller for floating-waste detection and capture.

25 h=4

Enhancing RL Generalizability in Robotics through SHAP Analysis of Algorithms and Hyperparameters

2026-05-04 cs.LG cs.AI cs.RO O. Beyan · h=4

Lingxiao Kong, Cong Yang, Oya Deniz Beyan, Zeyd Boukhers

Core Contributions

First systematic quantitative decomposition of how specific RL algorithm and hyperparameter configurations contribute to the generalization gap — using Shapley values to move beyond aggregate performance metrics
Establishes a theoretical foundation connecting Shapley values to RL generalizability, then uses SHAP-guided configuration selection to improve cross-environment transfer
Reveals consistent configuration impact patterns across diverse robotic tasks and environments, suggesting that SHAP insights transfer and can serve as practical guidance for RL practitioners
Provides actionable recommendations: rather than extensive hyperparameter sweeps per environment, practitioners can use SHAP rankings from reference environments to select configurations likely to generalize

Show abstract

Despite significant advances in Reinforcement Learning (RL), model performance remains highly sensitive to algorithm and hyperparameter configurations, while generalization gaps across environments complicate real-world deployment. Although prior work has studied RL generalization, the relative contribution of specific configurations to the generalization gap has not been quantitatively decomposed and systematically leveraged for configuration selection.

30 h=2

Higher-Order Flexible Configurations of Planar Parallel Manipulators Constructed by Averaging

2026-05-04 cs.RO math.AG Georg Nawratil · h=2

Yudi Zhao, Georg Nawratil

Core Contributions

Investigates singular configurations of planar 3-RPR parallel manipulators arising from averaging solution pairs of the direct kinematic problem — a mathematical approach to understanding manipulator flexibility
Parametrizes input pairs and determines their relative orientation to increase the flexion order of averaged configurations without computing zeros of the degree-6 forward kinematics polynomial
The methodology extends to spherical and spatial analogues of planar 3-RPR manipulators, providing a unified algebraic-geometric framework for studying parallel mechanism singularities

Show abstract

This paper investigates singular configurations of planar 3-RPR parallel manipulators, which result from applying the averaging technique to solution pairs of their direct kinematic problem. Without computing the zeros of the corresponding degree 6 polynomial we parametrize the input pairs and determine their relative orientation in a way that the flexion order of the averaged configurations increases. Moreover, the obtained results are visualized for concrete examples.