🤖 Robotics arXiv Digest

Sunday, May 4, 2026

📄 30 papers 📂 7 research areas ✨ Generated by Claude

🔭 Research Landscape

Today's batch reveals a robotics community intensely focused on making Vision-Language-Action models practical for real deployment. MolmoAct2 sets a new bar as a fully open VLA with flow-matching action experts, adaptive-depth reasoning, and the largest open bimanual dataset (720 hours), while Latent Bridge tackles the inference bottleneck by predicting VLM output deltas to cut backbone calls by 50–75% with minimal performance loss. Meanwhile, Seeing Realism from Simulation addresses the data hunger problem by converting simulated VLA videos into realistic training data via conditional video transfer, improving RDT-1B and π₀ by 5–8%. Together, these three papers outline a complete pipeline: generate cheap sim data (Seeing Realism), train powerful open models (MolmoAct2), and deploy them efficiently (Latent Bridge).

A second major thread is the convergence of classical optimization with learning-based methods. OT-MPC replaces the information-theoretic foundations of MPPI/CEM with optimal transport to avoid mode-averaging in complex cost landscapes, while the NANO filter reframes Bayesian filtering through information geometry for exact natural-gradient updates on robot state estimation. On the manipulation side, PIEGraph fuses analytical spring-mass physics with equivariant GNNs for data-efficient deformable object dynamics, and ShapeGrasp iteratively refines object shape representations through visuo-haptic feedback during grasping — both demonstrating that hybrid physics+learning approaches outperform either paradigm alone.

Navigation research today emphasizes robustness across environmental conditions and sensor modalities. LTR² introduces the first cross-modal LiDAR-teach/radar-repeat system validated over 40+ km across 6 months, while DynoSLAM embeds stochastic GNN-based pedestrian prediction directly into the SLAM factor graph. The procedural map generator study (Beyond Specialization) provides compelling evidence that training diversity — not architecture — is the primary determinant of navigation policy generalization, with mixed-generator training achieving 91.5% success versus 3.3% for a sparse-only specialist tested on mazes.

VLA & Foundation Models

Open VLAs, efficient inference, sim-to-real video transfer, and VLM-integrated navigation

  • #6 MolmoAct2
  • #7 Seeing Realism from Simulation
  • #11 Latent Bridge
  • #23 Semantic Autonomy Framework

Manipulation & Grasping

Physics-augmented dynamics, visuo-haptic shape completion, desk organization, and mobile grasping

  • #1 PIEGraph
  • #15 Robotic Desk Organization
  • #16 ShapeGrasp
  • #20 Visibility-Aware Mobile Grasping

Navigation & SLAM

UAV planning, cross-modal teach-and-repeat, dynamic SLAM, and RL navigation generalization

  • #14 SAGA UAV Planner
  • #22 Procedural Map RL Navigation
  • #26 Semantic Risk-Aware Planning
  • #27 LiDAR Teach, Radar Repeat
  • #28 DynoSLAM
  • #29 Trailer-Truck Parking

Control & State Estimation

Optimal transport MPC, adaptive aerial manipulation, geometry-aware filtering, and SE(3) derivatives

  • #2 OT-MPC
  • #3 Robust Adaptive Aerial MPC
  • #13 NANO Filter
  • #24 SE(3) Higher-Order Derivatives

Perception & Scene Understanding

Monocular depth grounding, open-set segmentation, temporally consistent pose, and indoor scene synthesis

  • #4 AnchorD Depth Grounding
  • #5 Hyp2Former
  • #17 Temporally Consistent 6D Pose
  • #18 ZoneMaestro Scene Generation

Human-Robot Interaction & Assistive Robotics

Affective touch, shared autonomy with impedance guidance, tensegrity crutches, and exoskeleton gait

  • #8 Robotic Affection
  • #9 Tensegrity Crutches
  • #12 Exoskeleton Adaptive Gait
  • #21 IAGF Shared Autonomy

Robot Learning & Multi-Robot Systems

RL generalizability analysis, sim-to-real for aquatic robots, multi-robot AoI optimization, and parallel manipulator kinematics

  • #10 AoI-Aware Multi-Robot
  • #19 ASV Waste Capture Sim-to-Real
  • #25 SHAP Analysis for RL
  • #30 Parallel Manipulator Configurations
VLA & Foundation Models
6 h=27
2026-05-04 cs.RO Jaemin Cho · h=27
Haoquan Fang, Jiafei Duan, Donovan Clay, Sam Wang, Shuo Liu
Core Contributions
  • First fully open VLA (weights, code, data) to outperform Pi-05 across 7 simulation and real-world benchmarks, while its MolmoER backbone surpasses GPT-5 and Gemini Robotics ER-1.5 on 13 embodied-reasoning benchmarks
  • Introduces flow-matching continuous-action expert grafted onto a discrete-token VLM via per-layer KV-cache conditioning — bridging the gap between language model architectures and continuous robot control
  • Releases MolmoAct2-BimanualYAM: 720 hours of teleoperated bimanual data, the largest open bimanual manipulation dataset to date
  • MolmoThink adaptive-depth reasoning re-predicts depth tokens only for changed scene regions, retaining geometric grounding while drastically cutting latency versus full re-computation
  • OpenFAST action tokenizer trained on millions of trajectories across 5 embodiments provides a standardized action representation for cross-platform VLA deployment
Show abstract
Vision-Language-Action (VLA) models aim to provide a single generalist controller for robots, but today's systems fall short on the criteria that matter for real-world deployment. Frontier models are closed, open-weight alternatives are tied to expensive hardware, reasoning-augmented policies pay prohibitive latency for their grounding, and fine-tuned success rates remain below the threshold for dependable use. We present MolmoAct2, a fully open action reasoning model built for practical deployment, advancing its predecessor along five axes. We introduce MolmoER, a VLM backbone specialized for spatial and embodied reasoning, trained on a 3.3M-sample corpus with a specialize-then-rehearse recipe. We release three new datasets spanning low-to-medium cost platforms, including MolmoAct2-BimanualYAM, 720 hours of teleoperated bimanual trajectories that constitute the largest open bimanual dataset to date, together with quality-filtered Franka (DROID) and SO100/101 subsets. We provide OpenFAST, an open-weight, open-data action tokenizer trained on millions of trajectories across five embodiments. We redesign the architecture to graft a flow-matching continuous-action expert onto a discrete-token VLM via per-layer KV-cache conditioning. Finally, we propose MolmoThink, an adaptive-depth reasoning variant that re-predicts depth tokens only for scene regions that change between timesteps, retaining geometric grounding at a fraction of prior latency. In the most extensive empirical study of any open VLA to date, spanning 7 simulation and real-world benchmarks, MolmoAct2 outperforms strong baselines including Pi-05, while MolmoER surpasses GPT-5 and Gemini Robotics ER-1.5 across 13 embodied-reasoning benchmarks. We release model weights, training code, and complete training data.
7 h=19
2026-05-04 cs.CV cs.RO Shan You · h=19
Chenyu Hui, Xiaodi Huang, Siyu Xu, Yunke Wang, Shan You
Core Contributions
  • Converts simulated VLA videos into photorealistic training data while preserving exact action trajectories and task semantics — unlike prior sim-to-real transfer that focuses on static images or loses action alignment
  • Diffusion feature-reuse mechanism shares video tokens across adjacent timesteps, making generation practical at the scale needed for VLA training rather than prohibitively expensive
  • Coreset sampling strategy identifies a compact, non-redundant subset of simulation data for augmentation, maximizing diversity under fixed compute budgets
  • Improves RDT-1B by 8% on Robotwin 2.0 and π₀ by 5.1% on the challenging LIBERO-Plus benchmark — demonstrating consistent gains across different VLA architectures
Show abstract
Vision-language-action (VLA) models typically rely on large-scale real-world videos, whereas simulated data, despite being inexpensive and highly parallelizable to collect, often suffers from a substantial visual domain gap and limited environmental diversity, resulting in weak real-world generalization. We present an efficient video augmentation framework that converts simulated VLA videos into realistic training videos while preserving task semantics and action trajectories. Our pipeline extracts structured conditions from simulation via video semantic segmentation and video captioning, rewrites captions to diversify environments, and uses a conditional video transfer model to synthesize realistic videos. To make augmentation practical at scale, we introduce a diffusion feature-reuse mechanism that reuses video tokens across adjacent timesteps to accelerate generation, and a coreset sampling strategy that identifies a compact, non-redundant subset for augmentation under limited computation. Extensive experiments on Robotwin 2.0, LIBERO, LIBERO-Plus, and a real robotic platform demonstrate consistent improvements.
11 h=13
2026-05-04 cs.RO Taotao Jing · h=13
Yudong Liu, Yuan Li, Zijia Tang, Yuxi Zheng, Yueqian Lin
Core Contributions
  • Identifies that VLM backbone features are temporally redundant in dual-system VLAs, and exploits this by predicting feature deltas rather than recomputing full VLM outputs at every control step
  • Demonstrates generality across two architecturally distinct VLAs — GR00T-N1.6 (feature-space bridge) and π0.5 (KV-cache bridge) — showing the approach is not architecture-specific
  • Achieves 95–100% performance retention while reducing expensive VLM calls by 50–75%, yielding 1.65–1.73× net per-episode speedup across LIBERO, RoboCasa, and ALOHA benchmarks
  • Task-agnostic DAgger training pipeline transfers across benchmarks without modification, avoiding the need for task-specific fine-tuning of the bridge module
Show abstract
Dual-system Vision-Language-Action (VLA) models achieve state-of-the-art robotic manipulation but are bottlenecked by the VLM backbone, which must execute at every control step while producing temporally redundant features. We propose Latent Bridge, a lightweight model that predicts VLM output deltas between timesteps, enabling the action head to operate on predicted outputs while the expensive VLM backbone is called only periodically. We instantiate Latent Bridge on two architecturally distinct VLAs: GR00T-N1.6 (feature-space bridge) and π0.5 (KV-cache bridge), demonstrating that the approach generalizes across VLA designs. Our task-agnostic DAgger training pipeline transfers across benchmarks without modification. Across four LIBERO suites, 24 RoboCasa kitchen tasks, and the ALOHA sim transfer-cube task, Latent Bridge achieves 95-100% performance retention while reducing VLM calls by 50-75%, yielding 1.65-1.73x net per-episode speedup.
23 h=5
2026-05-04 cs.RO cs.AI B. Abaza · h=5
Bogdan Felician Abaza, Andrei-Alexandru Staicu, Cristian Vasile Doicin
Core Contributions
  • Handles 88% of natural language navigation instructions in under 0.1ms via a seven-step parametric resolver, escalating only genuinely ambiguous instructions to the VLM — making VLM-integrated navigation feasible on Raspberry Pi 5 without GPU
  • Introduces cross-robot semantic memory transfer: preferences learned through VLM interactions on one robot are compiled into a shared digest and transferred to a second robot, achieving a measured 103,000-fold latency reduction
  • Validates 100% semantic transfer accuracy (33/33 decisions, 95% CI [0.894, 1.000]) across two custom differential-drive robots over three sessions with zero training data required
  • Unlike prior VLM-for-robotics work that treats the language model as always-on, this system explicitly manages the compute/accuracy tradeoff by categorizing instructions by ambiguity level
Show abstract
Autonomous indoor mobile robots can navigate reliably to metric coordinates using established frameworks such as ROS 2 Navigation 2, yet they lack the ability to interpret natural language instructions that express intent rather than positions. Vision-Language Models offer the semantic reasoning required to bridge this gap, but their inference latency (2-9 seconds per decision on consumer hardware) and session-by-session amnesia limit practical deployment. This paper presents the Semantic Autonomy Stack, a six-layer reference framework for semantically autonomous indoor navigation, and validates a complete instance featuring hybrid deterministic-VLM reasoning and cross-robot adaptive memory on physical robots with off-the-shelf edge hardware.
Manipulation & Grasping
1 h=49
2026-05-04 cs.RO cs.AI cs.CV cs.LG G. Konidaris · h=49
Sergio Orozco, Tushar Kusnur, Brandon May, George Konidaris, Laura Herlant
Core Contributions
  • Combines a physically informed spring-mass analytical model with an equivariant GNN to achieve accurate dynamics prediction for both rigid and deformable objects from limited real-world interactions — unlike pure data-driven approaches that need orders of magnitude more data
  • The equivariant GNN exploits symmetries in particle interactions via a novel action representation, guiding the analytical model rather than replacing it, which enforces physically feasible motion over long horizons
  • Validated on robot hardware for reorientation and repositioning of ropes, cloth, stuffed animals, and rigid objects — demonstrating breadth across object categories that most prior methods handle individually
  • Enables reliable downstream manipulation planning, outperforming state-of-the-art baselines in both prediction accuracy and task success on physical robots
Show abstract
Learning data-efficient object dynamics models for robotic manipulation remains challenging, especially for deformable objects. A popular approach is to model objects as sets of 3D particles and learn their motion using graph neural networks. In practice, this is not enough to maintain physical feasibility over long horizons and may require large amounts of interaction data to learn. We introduce PIEGraph, a novel approach to combining analytical physics and data-driven models to capture object dynamics for both rigid and deformable bodies using limited real-world interaction data. PIEGraph consists of two components: (1) a Physically Informed particle-based analytical model (implemented as a spring-mass system) to enforce physically feasible motion, and (2) an Equivariant Graph Neural Network with a novel action representation that exploits symmetries in particle interactions to guide the analytical model. We evaluate PIEGraph in simulation and on robot hardware for reorientation and repositioning tasks with ropes, cloth, stuffed animals and rigid objects. We show that our method enables accurate dynamics prediction and reliable downstream robotic manipulation planning, which outperforms state of the art baselines.
15 h=9
2026-05-04 cs.RO Jinjun Duan · h=9
Yi Dong, Yangjun Liu, Jinjun Duan, Yang Li, Zhendong Dai
Core Contributions
  • Introduces environment-assisted manipulation primitives — contact-based grasping, edge-based push-grasping, and levering-based grasping — that exploit table edges and inter-object constraints rather than relying solely on gripper dexterity
  • Handles both rigid and deformable planar objects with a unified task planner, unlike prior work that typically addresses one object type
  • Perception pipeline augments existing datasets with uncommon desktop items and performs geometry-based pose and keypoint estimation alongside environmental constraint detection
  • Real-world experiments demonstrate robust multi-object organization including collection and stacking tasks across heterogeneous object sets
Show abstract
Desktop organization remains challenging for service robots because of heterogeneous objects and diverse manipulation objectives, such as collection and stacking. In this article, a task-oriented framework is presented for organizing planar rigid and deformable objects on desks. A perception pipeline was developed that augments existing datasets with uncommon desktop items and makes geometry-based pose and keypoint estimation possible, along with the detection of environmental constraints, such as table edges. To handle diverse manipulation requirements, environment-assisted primitives are used, including contact-based grasping for small objects, edge-based push-grasping for planar rigid objects, and levering-based grasping for planar deformable objects. These primitives leverage environmental and interobject constraints to improve robustness. A task planner was designed to integrate these primitives into multiobject organization.
16 h=7
2026-05-04 cs.RO Lukas Rustler · h=7
Lukas Rustler, Matej Hoffmann
Core Contributions
  • First approach to update object shape representations after real-world grasp attempts — each grasp yields tactile contacts and gripper-occupied space that refine the implicit surface model for subsequent attempts
  • Couples implicit surface visuo-haptic shape completion with physics-based grasp planning in an iterative loop: infer shape → plan grasp → execute → update shape from feedback → regrasp if needed
  • Achieves 84% grasp success with a three-finger gripper and 91% with a two-finger gripper on real robots, outperforming baselines while simultaneously improving 3D reconstruction quality across all metrics
  • Works from a single RGB-D view without object-specific training, making it applicable to novel objects encountered in unstructured environments
Show abstract
Humans grasp unfamiliar objects by combining an initial visual estimate with tactile and proprioceptive feedback during interaction. We present ShapeGrasp, a robotic implementation of this approach. The proposed method is an iterative grasp-and-complete pipeline that couples implicit surface visuo-haptic shape completion with physics-based grasp planning. From a single RGB-D view, ShapeGrasp infers a complete shape, generates candidate grasps via rigid-body simulation, and executes the best feasible grasp. Each grasp attempt yields additional geometric constraints — tactile surface contacts and space occupied by the gripper body — which are fused to update the object shape. Failures trigger pose re-estimation and regrasping using the refined shape.
20 h=6
2026-05-04 cs.RO Anxing Xiao · h=6
Tianrun Hu, Anxing Xiao, David Hsu, Hanbo Zhang
Core Contributions
  • Addresses the fundamental see-vs-move tradeoff in mobile manipulation: the robot must balance gathering visual information about unobserved regions with making task progress, all under a limited field of view
  • Combines a whole-body planner with velocity-aware active perception for safe navigation in dynamic environments, and a behavior-tree-based high-level planner for adaptive subgoal generation and runtime failure recovery
  • Achieves 68.8% and 58.0% success in unknown static and dynamic environments respectively, improving over the baseline by 22.8% and 18.0% — validated on a Fetch mobile manipulator in real-world deployment
  • Unlike prior approaches that assume known or static environments and decouple seeing from acting, this system jointly optimizes visibility and motion in a unified framework
Show abstract
This paper addresses the problem of mobile grasping in dynamic, unknown environments where a robot must operate under a limited field-of-view. The fundamental challenge is the inherent trade-off between "seeing" around to reduce environmental uncertainty and "moving" the body to achieve task progress in a high-dimensional configuration space, subject to visibility constraints. Previous approaches often assume known or static environments and decouple these objectives, failing to guarantee safety when unobserved dynamic obstacles intersect the robot's path during manipulation. In this paper, we propose a unified mobile grasping system comprising two core components: (1) an iterative low-level whole-body planner coupled with velocity-aware active perception to navigate dynamic environments safely; and (2) a hierarchical high-level planner based on behavior trees that adaptively generates subgoals to guide the robot through exploration and runtime failures.
Navigation & SLAM
14 h=9
2026-05-04 cs.RO Sio-Kei Im · h=9
Junhao Wei, Yanxiao Li, Dexing Yao, Yifu Zhao, Haochen Li
Core Contributions
  • Achieves 100% navigation success across all tested speed settings (2.0–4.0 m/s) in cluttered environments, while YOPO drops from 90.91% to 62.50% and Ego-planner from 71.43% to 52.63% as speed increases
  • Formulates UAV local planning as a one-stage joint regression-and-ranking problem over motion anchors — a single forward pass predicts refined terminal states and planning scores for all candidates simultaneously
  • Introduces polar positional encoding derived from anchor yaw and pitch to preserve directional structure in the self-attention token space, enabling cross-anchor global reasoning about obstacle geometry
  • At 4.0 m/s, improves minimum safety clearance from 0.44m (YOPO) to 0.76m while reducing total flight time from 40.5s to 27.5s — achieving both safer and faster navigation simultaneously
Show abstract
Agile unmanned aerial vehicle (UAV) navigation in cluttered environments demands a planning architecture that is both computationally efficient and structurally expressive enough to reason over multiple feasible motions. This paper presents SAGA, a robust self-attention and goal-aware anchor-based planner for safe UAV autonomous navigation. SAGA formulates local planning as a one-stage joint regression-and-ranking problem over a fixed lattice of motion anchors. Given a depth image and a body-frame motion state, the planner predicts refined terminal states and planning scores for all anchors in a single forward pass, after which the best candidate is decoded into a dynamically feasible trajectory.
22 h=5
2026-05-04 cs.RO cs.LG Christian Jestel · h=5
Christian Jestel, Nicolas Bach, Marvin Wiedemann, Jan Finke, Peter Detzner
Core Contributions
  • Provides the first systematic comparison of how procedural map generator types (sparse, maze, graph, Wave Function Collapse) affect RL navigation policy generalization — revealing strongly asymmetric cross-generator transfer (sparse specialist drops to 3.3% on mazes)
  • A policy trained on the combined generator set achieves 91.5±1.1% mean success across all environments, demonstrating that training diversity is the primary driver of generalization
  • Shows A* path-planner subgoal inputs are the dominant robustness factor (raising success to 98.9±0.4%), outperforming GRU recurrence which only helps reactive baselines — challenging the assumption that memory architectures are the key to navigation generalization
  • Learned DRL policies outperform a classical Carrot+A* controller, which matches success only at low speeds (1.0 m/s) but collapses to 24.9% at 2.0 m/s, highlighting learned speed adaptation as the decisive advantage
Show abstract
Deep reinforcement learning (DRL) navigation policies often overfit to the structure of their training environments, as environmental diversity is typically constrained by the manual effort required to design diverse scenarios. While procedural map generation offers scalable diversity, no prior work systematically compares how different generator types affect policy generalization. We integrate four generators (sparse, maze, graph, and Wave Function Collapse) with guaranteed navigability into MuRoSim, a 2D simulator focusing on training efficiency for LiDAR-based navigation.
26 h=3
2026-05-04 cs.RO Hamza Durrani · h=3
Hamza Ahmed Durrani, Rafay Suleman Durrani
Core Contributions
  • Encodes LLM-inspired semantic cost functions that penalize geometrically cluttered or high-risk zones into an A* search framework with closed-loop replanning — a lightweight alternative to running a full LLM at planning time
  • Achieves 62.0% task success versus 56.5% for BFS with replanning and 4.0% for Greedy without replanning across 200 randomized trials in a 15×15 grid with dynamic obstacles
  • Obstacle-density ablation shows semantic cost shaping consistently improves navigation across varying difficulty levels, suggesting the benefit is not specific to one environment regime
Show abstract
The integration of Large Language Model (LLM) reasoning principles into classical robot path planning represents a rapidly emerging research direction. In this paper, we propose a Semantic Risk-Aware Heuristic (SRAH) planner that encodes LLM-inspired cost functions penalising geometrically cluttered or high-risk zones into an A* search framework, augmented with closed-loop replanning upon dynamic obstacle detection.
27 h=3
2026-05-04 cs.RO Yushuai Chen · h=3
Renxiang Xiao, Yichen Chen, Yuanfan Zhang, Qianyi Shao, Yushuai Chen
Core Contributions
  • First cross-modal, cross-platform LiDAR-Teach-and-Radar-Repeat navigation system: teaches with precise LiDAR under good conditions, repeats with robust 4D radar under degraded conditions (nighttime smoke, weather)
  • Cross-Modal Registration network jointly exploits Doppler-based motion priors and physical laws governing LiDAR intensity and radar power density to align sparse, noisy radar with dense LiDAR maps
  • Adaptive fine-tuning incrementally updates the registration network based on localization errors without ground-truth labels, enabling long-term adaptability to static environmental changes
  • Validated across 3 robot platforms over 40+ km across 6 months — achieving centimeter-level accuracy and significantly outperforming existing cross-modal approaches in the most extensive deployment reported for radar-based teach-and-repeat
Show abstract
Long-term autonomy requires robust navigation in environments subject to dynamic and static changes, as well as adverse weather conditions. Teach-and-Repeat (T&R) navigation offers a reliable and cost-effective solution by avoiding the need for consistent global mapping; however, existing T&R systems lack a systematic solution to tackle various environmental variations such as weather degradation, ephemeral dynamics, and structural changes. This work proposes LTR², the first cross-modal, cross-platform LiDAR-Teach-and-Radar-Repeat system that systematically addresses these challenges.
28 h=3
2026-05-04 cs.RO cs.CV Gonzalo Ferrer · h=3
Danil Tokhchukov, Veronika Morozova, Gonzalo Ferrer
Core Contributions
  • Integrates socially-aware GNN-based pedestrian prediction directly into the SLAM factor graph via a dynamic Mahalanobis distance factor — unlike conventional approaches that treat SLAM and motion prediction as separate pipelines
  • Uses Monte Carlo rollouts from a stochastic World Model formulation to capture multimodal epistemic uncertainty of human interactions, avoiding the "argmax problem" that causes deterministic prediction approaches to fail
  • Extracts empirical mean and covariance matrices of future pedestrian states to provide a mathematically rigorous probabilistic safety envelope for downstream local planners in crowded environments
  • Demonstrates through extensive simulations that the stochastic formulation prevents optimization failures while maintaining highly accurate retrospective tracking of dynamic agents
Show abstract
Traditional Simultaneous Localization and Mapping (SLAM) algorithms rely heavily on the static environment assumption, which severely limits their applicability in real-world spaces populated by moving entities, such as pedestrians. In this work, we propose DynoSLAM, a tightly-coupled Dynamic GraphSLAM architecture that integrates socially-aware Graph Neural Networks (GNNs) directly into the factor graph optimization. Unlike conventional approaches that use rigid constant-velocity heuristics or deterministic single-agent neural priors, our framework formulates pedestrian motion forecasting as a stochastic World Model.
29 h=2
2026-05-04 cs.RO J. Fletcher · h=2
George Alenchery, Thomas Jeske, Tova Quinones, Lentz Fortune, Tristan Lindo-Slones
Core Contributions
  • Proposes a unified framework for autonomous trailer-truck parking integrating sensor fusion, Hybrid A* path planning, NMPC control, and infrastructure awareness for the particularly challenging articulated vehicle domain
  • Adapts an open-source A* path planning simulation to incorporate a tractor-trailer kinematic model, demonstrating feasibility of articulated vehicle planning in a command-line simulation environment
  • Identifies jackknife prevention as a critical remaining challenge for autonomous trailer-truck parking, providing a roadmap for future system-level coordination work
Show abstract
Autonomous driving technology has rapidly evolved over the past decade, offering significant improvements in transportation efficiency, safety, and cost reduction. While much of the progress has focused on highway driving and obstacle avoidance, low-speed maneuvers such as parking remain among the most difficult challenges for autonomous systems. This challenge is especially pronounced in trailer-truck transport vehicles due to their articulated motion and environmental constraints. This paper presents a proposed framework for autonomous truck parking that integrates perception, motion planning, control systems, and infrastructure awareness.
Control & State Estimation
2 h=47
2026-05-04 cs.RO math.OC Evangelos A. Theodorou · h=47
Vincent Pacelli, Akash Ratheesh, Evangelos A. Theodorou
Core Contributions
  • Replaces the information-theoretic (KL-divergence) foundation of MPPI and CEM with an entropy-regularized optimal transport formulation, directly addressing the mode-averaging pathology that plagues these methods on multimodal cost landscapes
  • Computes an optimal coupling between candidate control sequences and low-cost proposals, refining each candidate toward its nearest promising sample while maintaining ensemble coverage of the full solution space
  • Derives closed-form, gradient-free updates via the Sinkhorn algorithm, preserving the real-time performance advantage of sampling-based MPC while gaining geometric awareness
  • Demonstrates improved success rates over MPPI and CEM on navigation, manipulation, and locomotion tasks — the first application of OT-based sampling to real-time nonlinear control
Show abstract
Sampling-based model predictive control methods like MPPI and CEM are essential for real-time control of nonlinear robotic systems, particularly where discontinuous dynamics preclude gradient-based optimization. However, these methods derive from information-theoretic objectives that are agnostic to the geometry of the control problem, leading to pathological behaviors such as mode-averaging when the cost landscape is complex. We present OT-MPC, a sampling-based algorithm that overcomes these limitations through an entropy-regularized optimal transport formulation. By computing an optimal coupling between candidate control sequences and low-cost proposals, OT-MPC refines candidates toward nearby promising samples while coordinating updates across the ensemble to maintain coverage of the solution space.
3 h=43
2026-05-04 cs.RO eess.SY M. Zeilinger · h=43
Péter Antal, Andrea Carron, Melanie Zeilinger, Roland Tóth, Tamás Péni
Core Contributions
  • First MPC approach for autonomous pick-and-place between moving platforms using a hook-equipped aerial manipulator — a significantly more complex task than static-platform aerial manipulation
  • Uses a MuJoCo digital twin as the predictive model, enabling rapid and accurate modeling of the complex quadcopter-hook-payload dynamics without manual equation derivation
  • Integrates zero-order robust optimization (zoRO) for uncertainty propagation with an EKF for online parameter estimation, ensuring robust constraint satisfaction under aerodynamic uncertainty and unknown payloads
  • Validated in both complex simulated scenarios and real-world flight experiments, demonstrating computational efficiency suitable for onboard deployment
Show abstract
This paper presents a novel model predictive control (MPC) approach for autonomous pick-and-place between moving platforms with a hook-equipped aerial manipulator. First, for accurate and rapid modeling of the complex dynamics, a digital twin model of the quadcopter equipped with a hook-based gripper, implemented in MuJoCo, is constructed and used as the predictive model for the MPC. To handle uncertainties of the predictive model (e.g. due to aerodynamics and uncertain payloads), a robust adaptive MPC approach is proposed. By systematic integration of zero-order robust optimization (zoRO) based uncertainty propagation and an extended Kalman filter (EKF) for parameter estimation, the MPC algorithm ensures robust constraint satisfaction, high performance, and computational efficiency.
13 h=13
2026-05-04 cs.RO eess.SY Ting Yuan · h=13
Chang Liu, Wenhan Cao, Zeju Sun, Tianyi Zhang, Jiayu Yuan
Core Contributions
  • Reframes Gaussian filtering from an information-geometric perspective, using natural gradient descent on the statistical manifold of Gaussian distributions to iteratively refine posterior mean and covariance
  • Proves that a single natural-gradient step exactly recovers the classical Kalman measurement update in the linear-Gaussian case, establishing a direct theoretical connection between information geometry and Kalman filtering
  • The NANO filter preserves positive definiteness of the covariance matrix by construction (following the manifold geometry), avoiding the numerical issues that plague ad-hoc covariance corrections in standard EKF/UKF
  • Demonstrated on satellite attitude estimation, SLAM, quadruped and humanoid robot state estimation — showing practical benefits across a diverse range of nonlinear estimation problems
Show abstract
Bayesian filtering is a cornerstone of state estimation in complex systems such as aerospace systems, yet exact solutions are available only for linear Gaussian models. In practice, nonlinear systems are handled through tractable approximations, with Gaussian filters such as the extended and unscented Kalman filters being among the most widely used methods. This tutorial revisits Gaussian filtering from an information-geometric perspective, viewing the prediction and measurement update steps as inference procedures over state distributions. Within this framework, we introduce a geometry-aware Gaussian filtering approach that leverages natural gradient descent on the statistical manifold of Gaussian distributions.
24 h=5
2026-05-04 cs.RO cs.MS F. Kuehnel · h=5
Frank O. Kuehnel
Core Contributions
  • Provides a practical hybrid analytical/AD recipe for computing exact Hessians and higher-order derivative tensors of SE(3) objectives — placing the analytical/AD seam at the point-action interface y=Tx to maximize efficiency
  • The seeded-Hessian path is approximately 5× faster than finite-differencing the AD gradient while matching a nested-AD oracle to machine precision, adding only ~70 lines of analytical-Jacobian code over an AD-only baseline
  • Identifies and fixes a removable singularity in the standard SO(3)/SE(3) scalar basis that produces NaNs at the origin under seeded AD — a previously undocumented pitfall for practitioners
  • Enables exact Newton steps, observed-information covariance estimates, and covariance correction for SE(3) optimization problems without finite-difference tuning — critical for robotics SLAM and state estimation
Show abstract
Fast prototyping of new SE(3) estimation objectives remains awkward in practice. Modern Lie-group frameworks — GTSAM, manif, Sophus, SymForce, Ceres — target first-order workloads through different code-generation and automatic-differentiation strategies. The remaining gap is a compact, AD-safe path from these first-order primitives to exact Hessians, observed-information matrices, and higher-order derivative tensors. This paper presents a hybrid analytical/AD recipe for SE(3) negative log-likelihoods.
Perception & Scene Understanding
4 h=35
2026-05-04 cs.RO cs.CV Abhinav Valada · h=35
Simon Dorer, Martin Büchner, Nick Heppert, Abhinav Valada
Core Contributions
  • Proposes a training-free depth grounding framework that anchors monocular depth foundation model predictions in raw sensor depth through patch-wise affine alignment via factor graph optimization — preserving fine-grained geometric structure while correcting metric scale
  • Introduces a benchmark dataset with dense scene-wide ground truth depth for non-Lambertian objects (transparent, specular surfaces) obtained via matte reflection spray and multi-camera fusion — overcoming the reliance on CAD-based annotations
  • Works across diverse sensors and domains without any retraining, making it immediately applicable to existing depth sensor setups that struggle with reflective materials
  • Addresses a practical robotics pain point: depth sensors fail on transparent and specular surfaces, but monocular foundation models provide good structure — this method combines the best of both worlds
Show abstract
Dense and accurate depth estimation is essential for robotic manipulation, grasping, and navigation, yet currently available depth sensors are prone to errors on transparent, specular, and general non-Lambertian surfaces. To mitigate these errors, large-scale monocular depth estimation approaches provide strong structural priors, but their predictions can be potentially skewed or mis-scaled in metric units, limiting their direct use in robotics. Thus, in this work, we propose a training-free depth grounding framework that anchors monocular depth estimation priors from a depth foundation model in raw sensor depth through factor graph optimization.
5 h=35
2026-05-04 cs.CV cs.AI cs.RO Abhinav Valada · h=35
Yao Lu, Rohit Mohan, Florian Drews, Yakov Miron, Abhinav Valada
Core Contributions
  • Exploits the natural hierarchy of semantic categories (e.g., "dog" → "animal" → "object") by learning embeddings in hyperbolic space, enabling unknown objects to remain close to higher-level concepts even when their fine-grained category was never seen during training
  • Does not require explicit modeling of unknowns during training — unlike prior open-set methods that need proxy unknown classes or out-of-distribution sampling
  • Achieves the best balance between unknown object discovery and in-distribution robustness across MS COCO, Cityscapes, and Lost&Found benchmarks
  • The hierarchical embedding structure provides interpretable failure modes: an unknown animal will cluster near "animal" rather than "electronics," giving downstream reasoning systems a meaningful semantic neighborhood for the detected object
Show abstract
Recognizing unknown objects is crucial for safety-critical applications such as autonomous driving and robotics. Open-Set Panoptic Segmentation (OPS) aims to segment known thing and stuff classes while identifying valid unknown objects as separate instances. Prior OPS approaches largely treat known categories as a flat label set, ignoring the semantic hierarchy that provides valuable structural priors for distinguishing unknown objects from in-distribution classes. In this work, we propose Hyp2Former, an end-to-end framework for OPS that does not require explicit modeling of unknowns during training, and instead learns hierarchical semantic similarities continuously in hyperbolic space.
17 h=6
2026-05-04 cs.RO cs.CV Médéric Fourmy · h=6
Kateryna Zorina, Vojtech Priban, Mederic Fourmy, Josef Sivic, Vladimir Petrik
Core Contributions
  • Addresses the critical gap between single-frame pose estimation accuracy and the temporal consistency required for stable robot feedback control — off-the-shelf pose estimators produce frame-to-frame jitter that destabilizes controllers
  • Develops a factor graph approach incorporating object motion models, explicit pose measurement uncertainty estimation, and online optimization-based smoothing with outlier rejection
  • Significantly improves results on standardized pose estimation benchmarks while enabling stable visual servoing on a torque-controlled manipulator — bridging the perception-to-control gap
  • The explicit uncertainty estimation allows the controller to adapt its response based on pose estimate confidence, rather than treating all estimates as equally reliable
Show abstract
Single-view RGB object pose estimators have reached a level of precision and efficiency that makes them good candidates for vision-based robot control. However, off-the-shelf methods lack temporal consistency and robustness that are mandatory for a stable feedback control. In this work, we develop a factor graph approach to enforce temporal consistency of the object pose estimates. In particular, the proposed approach: (i) incorporates object motion models, (ii) explicitly estimates the object pose measurement uncertainty, and (iii) integrates the above two components in an online optimization-based estimator.
18 h=6
2026-05-04 cs.RO cs.AI Shizhao Sun · h=6
Meisheng Zhang, Shizhao Sun, Yang Zhao, Ziyuan Liu, Zhijun Gao
Core Contributions
  • Shifts indoor scene synthesis from object-centric to zone-graph orchestration — translating high-level semantic intent into functional zones and topological constraints that handle non-convex rooms where prior methods fail
  • Constructs Zone-Scene-10K, a large-scale dataset with explicit zone-graph annotations, and releases SCALE, a stress-test benchmark specifically for irregular indoor scenarios with complex spatial relations
  • Alternating Alignment Strategy cycles between reasoning internalization and Zone-Aware Group Relative Policy Optimization, reconciling semantic richness with geometric validity without external physics engines
  • Resolves the density-safety dichotomy: places furniture densely enough to be functional while avoiding physically invalid configurations — a tradeoff that object-level generators consistently fail at in irregular rooms
Show abstract
Autonomous 3D indoor scene synthesis breaks down in non-convex rooms with tightly coupled spatial constraints. Data-driven generators lack topological priors for long-horizon planning, while iterative agents fragment semantics and become geometrically brittle. We present ZoneMaestro, a unified framework that shifts the paradigm from object-centric synthesis to Zone-Graph Orchestration.
Human-Robot Interaction & Assistive Robotics
8 h=19
2026-05-04 cs.HC cs.RO J. Gerken · h=19
Ali Askari, Jens Gerken
Core Contributions
  • Proposes a neurobiology-inspired multi-model architecture that decomposes affective touch into distinct specialized subtask models, treating it as a distributed closed-loop perceptual task rather than a monolithic motoric movement
  • Introduces a peer-to-peer, state-sharing framework designed to overcome the "haptic uncanny valley" — the phenomenon where near-human but imperfect touch feels more disturbing than clearly robotic touch
  • Outlines a Sim-to-Real pipeline for affective touch that enables haptics, AI, and robotics researchers to contribute independently yet coherently to the same system
  • Position paper that provides a structured roadmap for a largely underexplored area — while robotic grasping and dexterity have advanced rapidly, social touch capabilities remain primitive
Show abstract
Despite the advancement in robotic grasping and dexterity through haptic information, affective social touch, such as handshaking or reassuring stroking, remains a major challenge in Human-Robot-Interaction. This position paper examines current progress and limitations across artificial intelligence, haptics and robotics research, and proposes a novel multi-model architecture to address these gaps.
9 h=18
2026-05-04 cs.RO nlin.AO D. Dotov · h=18
Jingxian Gu, Joanna Spyra, Andrew Walski, Lyla Elsaesser, Samuel Bierner
Core Contributions
  • Designs a biologically inspired tensegrity crutch using a pre-stressed self-tensile two-cell structure that provides compliance without compromising stability — unlike existing spring-loaded designs that reduce perceived stability
  • Human trials (N=18) show the tensegrity design improves effort, comfort, pain, and usability versus rigid crutches, while spring-loaded crutches reduce perceived stability and walking speed
  • Achieves favorable nonlinear stiffness, ground-following, and force feedback through the tensegrity structure — properties that emerge from the pre-stressed geometry rather than requiring active control
  • Addresses a significant accessibility problem: 6 million US crutch users face secondary upper-joint injuries from rigid designs, and this is the first tensegrity-based solution validated with human participants
Show abstract
Purpose: Six million people use crutches as mobile aids in the US. Rigid designs with no axial mobility limit sensory feedback and lead to secondary injury on the upper joints. Spring-loaded designs offer compliance but may compromise stability. We designed a biologically inspired tensegrity crutch with a compliant module aiming to achieve favorable mechanical properties.
12 h=13
2026-05-04 cs.RO S. Tortora · h=13
Edoardo Trombin, Miroljub Mihailovic, Matheus Henrique Ferreira Moura, Luca Tonin, Emanuele Menegatti
Core Contributions
  • Proposes a Kernelized Movement Primitives framework for adaptive gait generation across multiple indoor terrains (flat, slopes, stairs, obstacles) — current exoskeletons are limited to flat, even surfaces
  • Learns a probabilistic gait representation in both joint and task spaces from a limited number of human demonstrations, adapted in real-time using via-points from onboard RGB-D environmental sensing
  • Formulates adaptive gait generation as a linearly constrained optimization problem, ensuring kinematic feasibility while adapting to terrain detected by the onboard camera
  • Validated on a commercial lower limb exoskeleton in real-world scenarios including stair climbing and obstacle crossing — demonstrating feasibility of environment-aware gait planning for assistive robotics
Show abstract
Lower limb exoskeletons (LLEs) present the potential to make motor-impaired individuals walk again. Their application in real-world environments is still limited by the lack of effective adaptive gait planning. Indeed, current exoskeletons are meant to walk only on a flat and even terrain. Generating environment-aware, physiologically consistent gait trajectories in real-time is an open challenge. To overcome this, we propose a novel Kernelized Movement Primitives (KMP)-based framework for adaptive gait generation (AGG) across multiple indoor terrains.
21 h=6
2026-05-04 cs.RO cs.HC Yupu Lu · h=6
Sihan Chen, Hang Xu, Yupu Lu, Chen Wang, Benfang Duan
Core Contributions
  • Addresses the mutual understanding gap in shared autonomy: while most research focuses on robots inferring human intent, this work enables humans to understand the robot's intent through impedance-based physical communication
  • Adaptively modulates the robot's dynamic response to human input via an anisotropic guidance field, providing continuous, physically grounded communication without requiring additional visual or auditory interfaces
  • User studies across three scenarios and two teleoperation interfaces show improvements in task performance, human-robot agreement, and subjective experience
  • Inspired by impedance control principles, the approach is more intuitive than prior explicit intent communication methods (screen overlays, haptic displays) because the communication channel is embedded in the task interaction itself
Show abstract
Shared autonomy (SA) enables robots to infer human intent and assist in its achievement. While most research focuses on improving intent inference, it overlooks whether humans can understand the robot's intent in return. Without such mutual understanding, collaboration becomes less effective, degrading user experience and task performance. Inspired by impedance control, we propose Impedance-Driven Anisotropic Guidance Field Enhanced Shared Autonomy (IAGF-SA), a novel paradigm that extends SA with an embodied, physically-grounded communication channel.
Robot Learning & Multi-Robot Systems
10 h=14
2026-05-04 cs.RO John Tadrous · h=14
John Tadrous
Core Contributions
  • Derives per-node and network-wide Age of Information (AoI) lower bounds that cleanly decompose into a sensing term (mean group sensing times) and a propagation term (shortest-path distances) — providing theoretical foundations for multi-robot monitoring system design
  • Shows the sensing component minimization yields a separable discretely convex resource allocation problem, solvable optimally by a greedy water-filling algorithm — avoiding combinatorial explosion
  • Constructs a shortest-path-tree conveyor architecture with Euler-walk deployment that provably attains the AoI lower bound in the full-conveyor regime
  • Captures both stochastic parallel sensing delays and hop-based propagation in a unified framework — measuring AoI from sensing start rather than just transport, providing a more realistic model for monitoring applications
Show abstract
A team of mobile robots monitors spatially distributed processes and delivers measurements to a base, where AoI is measured from sensing start, capturing both stochastic parallel sensing delays and hop-based propagation. At each non-base node, multiple robots may collaborate, yielding node-dependent geometric group sensing times, while other robots act as mobile conveyors that transport samples along unit-time edges.
19 h=6
2026-05-04 cs.RO S. Aravecchia · h=6
Luis F. W. Batista, Stéphanie Aravecchia, Cédric Pradalier
Core Contributions
  • Presents a complete field-validated system combining polarimetric camera perception with DRL-based control for autonomous floating-waste detection and capture — deployed on a retrofitted ASV platform
  • Introduces a systematic sim-to-real testing methodology with a two-stage simulation protocol and perception abstraction module that mimics real camera behavior, enabling reproducible field trials
  • Evaluates robustness across 14 disturbance regimes in matched simulation and field experiments, identifying actuation-model fidelity as the primary source of degradation — not perception or control policy
  • Demonstrates search-and-capture over areas up to 450 m² with centimeter-level terminal accuracy, distilling practical lessons including the importance of latency and timestamp management across modules
Show abstract
Autonomous surface vessels for floating-waste removal operate under varying hydrodynamics, external disturbances, and challenging water-surface perception. We present a field-validated system that combines camera-based polarimetric perception with a lightweight DRL-based controller for floating-waste detection and capture.
25 h=4
2026-05-04 cs.LG cs.AI cs.RO O. Beyan · h=4
Lingxiao Kong, Cong Yang, Oya Deniz Beyan, Zeyd Boukhers
Core Contributions
  • First systematic quantitative decomposition of how specific RL algorithm and hyperparameter configurations contribute to the generalization gap — using Shapley values to move beyond aggregate performance metrics
  • Establishes a theoretical foundation connecting Shapley values to RL generalizability, then uses SHAP-guided configuration selection to improve cross-environment transfer
  • Reveals consistent configuration impact patterns across diverse robotic tasks and environments, suggesting that SHAP insights transfer and can serve as practical guidance for RL practitioners
  • Provides actionable recommendations: rather than extensive hyperparameter sweeps per environment, practitioners can use SHAP rankings from reference environments to select configurations likely to generalize
Show abstract
Despite significant advances in Reinforcement Learning (RL), model performance remains highly sensitive to algorithm and hyperparameter configurations, while generalization gaps across environments complicate real-world deployment. Although prior work has studied RL generalization, the relative contribution of specific configurations to the generalization gap has not been quantitatively decomposed and systematically leveraged for configuration selection.
30 h=2
2026-05-04 cs.RO math.AG Georg Nawratil · h=2
Yudi Zhao, Georg Nawratil
Core Contributions
  • Investigates singular configurations of planar 3-RPR parallel manipulators arising from averaging solution pairs of the direct kinematic problem — a mathematical approach to understanding manipulator flexibility
  • Parametrizes input pairs and determines their relative orientation to increase the flexion order of averaged configurations without computing zeros of the degree-6 forward kinematics polynomial
  • The methodology extends to spherical and spatial analogues of planar 3-RPR manipulators, providing a unified algebraic-geometric framework for studying parallel mechanism singularities
Show abstract
This paper investigates singular configurations of planar 3-RPR parallel manipulators, which result from applying the averaging technique to solution pairs of their direct kinematic problem. Without computing the zeros of the corresponding degree 6 polynomial we parametrize the input pairs and determine their relative orientation in a way that the flexion order of the averaged configurations increases. Moreover, the obtained results are visualized for concrete examples.