🤖 Robotics arXiv Digest

Wednesday, May 6, 2026

📄 27 papers 📂 7 research areas Generated by Claude

🔭 Research Landscape

Today's batch of 27 papers reveals a field converging on two complementary fronts: closing the sim-to-real transfer gap and making learned policies robust enough for deployment. The autonomous driving cluster is particularly striking — five papers collectively tackle the full post-training pipeline from scenario generation (Conditional Flow-VAE) through closed-loop fine-tuning (CRAFT, ReflectDrive-2) to driver behavior modeling (Driver-WM) and validation (Practical validation). What unites them is a shared dissatisfaction with open-loop evaluation: each paper introduces mechanisms to stress-test or improve policies under closed-loop feedback, whether through distribution-matched safety-critical rollouts, RL-aligned self-editing of discrete trajectory tokens, or Bayesian equivalence testing of synthetic scenarios.

A second dominant theme is the maturation of model-based RL and planning for contact-rich, long-horizon tasks. ELVIS tackles the deep imagination brittleness problem with ensemble-calibrated uncertainty gating, HDFlow separates strategic subgoal diffusion from fast flow-based trajectory generation, and Dream-MPC rehabilitates gradient-based planning with amortized optimization. Meanwhile, the tactile manipulation papers (Reduced-order Neural Modeling, From Reach to Insert, Active Contact Sensing) demonstrate that touch-aware control is moving beyond proof-of-concept: sub-millimeter insertion at 0.05mm clearance, neural tactile simulation at 65% speedup, and 97.5% handover success via active perturbation all point toward industrial-grade tactile capability.

Cross-cutting these themes is a growing emphasis on computational efficiency under resource constraints. ConsisVLA-4D achieves 2.4× inference speedup over OpenVLA through consistency-based 3D reasoning without additional sensors, the dual-barrier CBF safety filter runs on Raspberry Pi via closed-form linear solves, and the cascaded-fidelity MPC for bipedal walking mixes model fidelities across the prediction horizon to hit real-time rates. The message is clear: the field is moving past "does it work in simulation?" toward "does it run fast enough, safely enough, on real hardware?"

Autonomous Driving & Safety

Scenario generation, post-training for driving policies, driver modeling, and safety validation.

#1 Conditional Flow-VAE for Safety-Critical Traffic Scenario Ge...
#6 ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing ...
#20 CRAFT: Counterfactual-to-Interactive Reinforcement Fine-Tuni...
#13 Driver-WM: A Driver-Centric Traffic-Conditioned Latent World...
#27 Practical validation of synthetic pre-crash scenarios

VLA Models & Robot Learning from Demonstrations

Vision-language-action architectures, latent action supervision, and offline-to-online policy learning.

#5 ConsisVLA-4D: Advancing Spatiotemporal Consistency in Effici...
#15 From Pixels to Tokens: A Systematic Study of Latent Action S...
#7 When Life Gives You BC, Make Q-functions: Extracting Q-value...

Model-Based RL & Long-Horizon Planning

Latent imagination, hierarchical diffusion-flow planning, gradient-based MPC, and multi-agent learning.

#3 ELVIS: Ensemble-Calibrated Latent Imagination for Long-Horiz...
#2 Modular Reinforcement Learning For Cooperative Swarms
#19 HDFlow: Hierarchical Diffusion-Flow Planning for Long-horizo...
#21 Dream-MPC: Gradient-Based Model Predictive Control with Late...

Tactile & Contact-Rich Manipulation

High-fidelity tactile simulation, precision assembly under sub-mm tolerances, and active handover sensing.

#4 Reduced-order Neural Modeling with Differentiable Simulation...
#10 From Reach to Insert: Tactile-Augmented Precision Assembly u...
#26 Active Contact Sensing for Robust Robot-to-Human Object Hand...

Navigation, SLAM & Localization

Radar SLAM, underwater navigation, safety-critical control on occupancy maps, and space rendezvous.

#9 Dr-PoGO: Direct Radar Pose-Graph Optimization
#17 AI-Aided Advancements in Autonomous Underwater Vehicle Navig...
#24 A Closed-Form Dual-Barrier CBF Safety Filter for Holonomic R...
#22 Tightly-Coupled Estimation and Guidance for Robust Low-Thrus...

Control, Calibration & Dynamics

Koopman operators, hand-eye calibration, bipedal walking MPC, and agile robot locomotion.

#11 Koopman Identification of Nonlinear Systems via Reservoir Li...
#14 Optimal Uncertainty-Aware Calibration for the AX=YB Problem
#16 Right Model, Right Time: Real-Time Cascaded-Fidelity MPC for...
#8 LineRides: Line-Guided Reinforcement Learning for Bicycle Ro...

Hardware, Surgical Robotics & HRI

Self-folding robots, autonomous laparoscope control, gaze estimation benchmarks, and embodied AI privacy.

#12 3D Printing of Passively Actuated Self-Folding Robots with I...
#23 Autonomous Laparoscope Control through Unified Mechanics-Bas...
#25 Gaze4HRI: Zero-shot Benchmarking Gaze Estimation Neural-Netw...
#18 Position: Embodied AI Requires a Privacy-Utility Trade-off

Autonomous Driving & Safety

#1 h=116

Conditional Flow-VAE for Safety-Critical Traffic Scenario Generation

2026-05-06 cs.RO, cs.LG R. Urtasun (h=116)

Zimu Gong, Brian Zhaoning Zhang, Chris Zhang, Kelvin Wong, Raquel Urtasun

Core Contributions

Introduces conditional latent flow matching that transforms nominal driving scenes into safety-critical rollouts via distribution matching, avoiding the unrealistic adversarial behaviors produced by standard optimization-based methods
Combines both simulation and real-world data within the same generative framework, enabling the model to capture diverse failure modes that either source alone would miss
Demonstrates more consistent scenario generation than prior methods — the flow-VAE structure ensures generated trajectories remain physically plausible while still covering rare collision-relevant configurations
Provides a direct training and benchmarking pipeline for AV systems: generated scenarios can stress-test perception and planning stacks at scale without manual scenario authoring

Show abstract

Safety-critical scenarios are essential for the development of autonomous vehicles (AVs) but are rare in real-world driving data. While simulation offers a way to generate such scenarios, manually designed test cases lack scalability, and adversarial optimization often produces unrealistic behaviors. In this work, we introduce a conditional latent flow matching approach for scalable and realistic safety-critical scenario generation. Our method uses distribution matching to transform nominal scenes into safety-critical rollouts. Furthermore, we demonstrate that incorporating both simulation and real-world data enables our framework to efficiently generate diverse, data-driven scenarios. Experimental results highlight that our approach is able to more consistently and realistically generate novel safety-critical scenarios, making it a valuable tool for training and benchmarking AV systems.

#6 h=16

ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving

2026-05-06 cs.RO Kun Zhan (h=16)

Huimin Wang, Yue Wang, Bihao Cui, Pengxiang Li, Ben Lu

Core Contributions

Represents driving plans as discrete trajectory tokens generated via masked parallel decoding, enabling in-place token revision (AutoEdit) without an auxiliary refinement network — the same model both drafts and edits
Full-rollout RL is the key enabler: under supervised training alone, AutoEdit improves PDMS by only 0.3, but RL training across the complete decision–draft–reflect pipeline increases the gain to 1.9, demonstrating that coupling drafting and editing requires end-to-end credit assignment
Achieves 91.0 PDMS with camera-only input on NAVSIM and 94.8 in a best-of-6 oracle setting at 31.8ms average latency on NVIDIA Thor, establishing strong real-time viability
Co-designs an efficient reflective decoding stack combining shared-prefix KV reuse, alternating step decode, and fused on-device unmasking to minimize the computational overhead of iterative refinement

Show abstract

We introduce ReflectDrive-2, a masked discrete diffusion planner with separate action expert for autonomous driving that represents plans as discrete trajectory tokens and generates them through parallel masked decoding. This discrete token space enables in-place trajectory revision: AutoEdit rewrites selected tokens using the same model, without requiring an auxiliary refinement network. To train this capability, we use a two-stage procedure. First, we construct structure-aware perturbations of expert trajectories along longitudinal progress and lateral heading directions and supervise the model to recover the original expert trajectory. We then fine-tune the full decision--draft--reflect rollout with reinforcement learning (RL), assigning terminal driving reward to the final post-edit trajectory and propagating policy-gradient credit through full-rollout transitions. Full-rollout RL proves crucial for coupling drafting and editing: under supervised training alone, inference-time AutoEdit improves PDMS by at most $0.3$, whereas RL increases its gain to $1.9$. We also co-design an efficient reflective decoding stack for the decision--draft--reflect pipeline, combining shared-prefix KV reuse, Alternating Step Decode, and fused on-device unmasking. On NAVSIM, ReflectDrive-2 achieves $91.0$ PDMS with camera-only input and $94.8$ PDMS in a best-of-6 oracle setting, while running at $31.8$ ms average latency on NVIDIA Thor.

#20 h=5

CRAFT: Counterfactual-to-Interactive Reinforcement Fine-Tuning for Driving Policies

2026-05-06 cs.LG, cs.RO Wenchao Sun (h=5)

Keyu Chen, Nanfei Ye, Yida Wang, Wenchao Sun, Danqi Zhao

Core Contributions

Formulates closed-loop post-training as proxy-residual optimization: dense counterfactual advantages serve as a proxy for closed-loop advantages, with grounded residual correction from interaction-critical events reducing proxy bias
Achieves strongest closed-loop gains on Bench2Drive across three architecture families (hierarchical planning, VLA, and vocabulary-scoring), demonstrating architecture-agnostic applicability
Uses asymmetric KL self-distillation against an EMA teacher to stabilize online adaptation, preventing the catastrophic forgetting that plagues standard on-policy fine-tuning of pretrained driving models
Theoretically decomposes the real policy gradient into proxy and residual terms under the same visited-state distribution, providing formal justification for the complementary roles of counterfactual and interactive feedback

Show abstract

Open-loop imitation learning has advanced modern autonomous driving policy architectures, but closed-loop deployment remains vulnerable to policy-induced distribution shift. Existing post-training paradigms exhibit fundamental trade-offs: closed-loop RL fine-tuning provides grounded feedback from executed actions but is constrained by the sparsity of informative events, whereas counterfactual fine-tuning provides dense supervision over candidate futures but inherits bias from imperfect future estimates. We introduce Counterfactual-to-Interactive Reinforcement Fine-Tuning (CRAFT), an on-policy framework that formulates closed-loop post-training as proxy-residual optimization. CRAFT uses group-normalized counterfactual advantages as a dense proxy for real closed-loop advantages and aligns this proxy with the closed-loop world through grounded residual correction from interaction-critical events. To stabilize adaptation, CRAFT regularizes the online policy toward an EMA teacher via asymmetric KL self-distillation. Theoretically, CRAFT decomposes the real closed-loop policy gradient into proxy and residual terms under the same visited-state distribution, reducing residual variance with an aligned proxy while mitigating proxy bias through grounded residual approximation. Empirically, CRAFT achieves the strongest closed-loop gains on Bench2Drive across hierarchical planning, vision-language-action, and vocabulary-scoring architectures. Ablations, scaling behavior, stability analyses, and transfer results further validate the complementary roles of dense counterfactual proxy and grounded residual correction. Project page: https://currychen77.github.io/CRAFT.

#13 h=7

Driver-WM: A Driver-Centric Traffic-Conditioned Latent World Model for In-Cabin Dynamics Rollout

2026-05-06 cs.RO, cs.AI, cs.CV Haoruo Zhang (h=7)

Haozhuang Chi, Daosheng Qiu, Hao Su, Haochen Liu, Zirui Li

Core Contributions

First latent world model that rolls out in-cabin driver dynamics causally conditioned on external traffic context, bridging the gap between environment forecasting and driver behavior prediction for L2/L3 shared control
Uses a gated causal injection mechanism with learned vector gates to modulate how external traffic perturbations influence internal driver state predictions while enforcing temporal causality
Operates in a compact latent space from frozen vision-language features via a dual-stream architecture, keeping inference lightweight while unifying physical kinematics forecasting with behavioral and emotional recognition
Enables controlled test-time interventions: the explicit external-to-internal conditioning allows systematic analysis of how traffic context changes affect predicted driver reactions

Show abstract

Safe L2/L3 driving automation requires anticipating human-in-the-loop reactions during shared-control transitions. While most driving world models forecast the external environment, in-cabin intelligence remains strictly recognition-oriented and lacks multi-step rollout capabilities for driver dynamics. We introduce Driver-WM, a driver-centric latent world model that rolls out in-cabin dynamics causally conditioned on out-cabin traffic context. This formulation unifies physical kinematics forecasting with auxiliary behavioral and emotional semantic recognition. Operating in a compact latent space constructed from frozen vision-language features, Driver-WM adopts a dual-stream architecture to separately encode external traffic and internal driver states. These streams are directionally coupled via a gated causal injection mechanism, which uses a learned vector gate to modulate external contextual perturbations while strictly enforcing temporal causality. Evaluations on a multi-task assistive driving benchmark demonstrate that Driver-WM yields robust long-horizon geometric forecasting for reactive high-motion maneuvers and improves semantic alignment for both driver and traffic states. Finally, the explicit external-to-internal conditioning allows for controlled test-time interventions to systematically analyze mechanism responses.

#27 h=3

Practical validation of synthetic pre-crash scenarios

2026-05-06 cs.RO Carol A. C. Flannagan (h=3)

Jian Wu, Ulrich Sander, Carol Flannagan, Jonas Bärgman

Core Contributions

Extends Bayesian ROPE-based equivalence testing to validate whether synthetic pre-crash scenarios are practically equivalent to real-world data for AV safety assessment, going beyond conventional significance testing
Introduces two binning-based statistics that measure practically meaningful distributional differences in the context of safety impact assessment, rather than arbitrary statistical distances
Demonstrates the framework on rear-end pre-crash datasets for AEB system evaluation, providing both quantitative equivalence scores and diagnostic insights into where synthetic and real datasets diverge
The generic framework extends beyond rear-end scenarios to broader validation contexts, offering a principled basis for trusting synthetic data in safety-critical AV development

Show abstract

The representativeness of synthetic pre-crash scenarios is crucial for assessing the safety impact of Driving Automation Systems through virtual simulations. However, a gap remains in the robust evaluation of synthetic pre-crash scenarios' practical equivalence to their real-world counterparts; that is, whether they are similar enough for the intended assessment purpose. Conventional significance testing is inadequate, as it focuses on detecting differences rather than establishing practical equivalence. This study addresses the research gap by extending our previous work on a Bayesian Region of Practical Equivalence (ROPE)-based equivalence testing framework by introducing a binning-based approach to define appropriate statistics and equivalence criteria. Two binning-based statistics are proposed to measure practically meaningful distributional differences between datasets in the context of safety impact assessment. The framework's applicability is demonstrated through a case study, which tests the practical equivalence of two synthetic rear-end pre-crash datasets with a previously developed reference dataset in the context of the safety impact assessment of an Automatic Emergency Braking system. The results show that the framework provides informative quantitative assessments of practical equivalence as well as diagnostic insights into the divergence of datasets. Although the demonstration focuses on rear-end pre-crash scenarios, the framework is generic and extensible to broader validation contexts, providing an interpretable and principled basis for practical equivalence assessment across diverse synthetic data applications.

VLA Models & Robot Learning from Demonstrations

#5 h=17

ConsisVLA-4D: Advancing Spatiotemporal Consistency in Efficient 3D-Perception and 4D-Reasoning for Robotic Manipulation

2026-05-06 cs.RO Liqiang Nie (h=17)

Wei Li, Jizhihui Liu, Li Yixing, Junwen Tong, Rui Shao

Core Contributions

Introduces three complementary consistency modules — CV-Aligner for cross-view semantic consistency, CO-Fuser for cross-object spatial geometric consistency, and CS-Thinker for cross-scene temporal consistency — forming a unified spatiotemporal reasoning framework without additional sensors
CV-Aligner filters instruction-relevant regions and aligns object identities across viewpoints, eliminating the ambiguity that degrades multi-camera VLA performance on cluttered scenes
Achieves 21.6% and 41.5% performance improvements over OpenVLA on LIBERO benchmark and real-world platforms respectively, with 2.3–2.4× inference speedups — demonstrating that consistency constraints improve both accuracy and efficiency
CO-Fuser resolves spatial relation ambiguities between objects across views using compact latent representations rather than explicit depth sensors, keeping the architecture lightweight

Show abstract

Current Vision-Language-Action (VLA) models primarily focus on mapping 2D observations to actions, but exhibit notable limitations in spatiotemporal perception and reasoning: 1) spatial representations often rely on additional sensors, introducing substantial computational overhead; 2) visual reasoning is typically limited to future-frame prediction, lacking alignment with the instruction-grounded scene and thus compromising spatiotemporal consistency. To address these challenges, we propose ConsisVLA-4D, a unified and efficient framework that enhances spatiotemporal consistency in 3D perception and 4D reasoning. Specifically, we design: 1) CV-Aligner, which ensures cross-view object semantic consistency by filtering instruction-relevant regions and aligning object identities across multiple viewpoints; 2) CO-Fuser, which guarantees cross-object spatial geometric consistency by eliminating spatial relation ambiguities between objects across views using compact latent representations. Building upon these, we introduce 3) CS-Thinker to achieve cross-scene spatiotemporal consistency as actions unfold. It learns implicit knowledge of local dynamics from object-semantic tokens of CV-Aligner and global depth from geometric tokens of CO-Fuser, thereby enhancing efficient visual reasoning under scene variations. Extensive experiments demonstrate that, benefiting from its efficient spatiotemporal consistency design, ConsisVLA-4D achieves 21.6% and 41.5% performance improvements, along with 2.3-fold and 2.4-fold inference speedups compared to OpenVLA on the LIBERO benchmark and real-world platforms, respectively.ConsisVLA-4D is open-sourced and publicly available at

#15 h=7

From Pixels to Tokens: A Systematic Study of Latent Action Supervision for Vision-Language-Action Models

2026-05-06 cs.RO, cs.CV Haoyang Li (h=7)

Yihan Lin, Haoyang Li, Yang Li, Haitao Shen, Yihan Zhao

Core Contributions

Provides the first structured comparison of latent action supervision strategies for VLAs, decomposing the design space into image-based (trajectory regularization) and action-based (target space unification) approaches
Reveals a formulation-task correspondence: image-based latent actions benefit long-horizon reasoning and scene generalization, while action-based latent actions excel at complex motor coordination — explaining why no single approach dominates all benchmarks
Finds that directly supervising VLMs with discrete latent action tokens yields the strongest performance, suggesting that discretization preserves the VLM's language modeling strengths better than continuous action regression
Provides initial evidence that latent action supervision improves mixed-data training, pointing toward a scalable path for VLA pretraining across heterogeneous robot datasets

Show abstract

Latent actions serve as an intermediate representation that enables consistent modeling of vision-language-action (VLA) models across heterogeneous datasets. However, approaches to supervising VLAs with latent actions are fragmented and lack a systematic comparison. This work structures the study of latent action supervision from two perspectives: (i) regularizing the trajectory via image-based latent actions, and (ii) unifying the target space with action-based latent actions. Under a unified VLA baseline, we instantiate and compare four representative integration strategies. Our results reveal a formulation-task correspondence: image-based latent actions benefit long-horizon reasoning and scene-level generalization, whereas action-based latent actions excel at complex motor coordination. Furthermore, we find that directly supervising the VLM with discrete latent action tokens yields the most effective performance. Finally, our experiments offer initial insights into the benefits of latent action supervision in mixed-data, suggesting a promising direction for VLA training. Code is available at https://github.com/RUCKBReasoning/From_Pixels_to_Tokens.

#7 h=14

When Life Gives You BC, Make Q-functions: Extracting Q-values from Behavior Cloning for On-Robot Reinforcement Learning

2026-05-06 cs.RO, cs.AI Stephen Hart (h=14)

Lakshita Dodeja, Ondrej Biza, Shivam Vats, Stephen Hart, Stefanie Tellex

Core Contributions

Q-Estimation extracts a Q-function directly from a pre-trained BC policy using only a few environment interaction steps, bypassing the need for offline RL training on the demonstration dataset
Q-Gating dynamically switches between BC and RL actions based on their respective Q-values during data collection, preventing the policy replacement problem where online learning overwrites previously good behaviors
Achieves up to 100% success rate and 3.75× improvement over the original BC policy on contact-rich tasks like pipe assembly and kitting, with only 1–2 hours of on-robot interaction
Outperforms state-of-the-art offline-to-online baselines on both success rate and convergence time across D4RL and robomimic benchmarks, demonstrating that Q-value extraction from BC is a practical alternative to offline RL pretraining

Show abstract

Behavior Cloning (BC) has emerged as a highly effective paradigm for robot learning. However, BC lacks a self-guided mechanism for online improvement after demonstrations have been collected. Existing offline-to-online learning methods often cause policies to replace previously learned good actions due to a distribution mismatch between offline data and online learning. In this work, we propose Q2RL, Q-Estimation and Q-Gating from BC for Reinforcement Learning, an algorithm for efficient offline-to-online learning. Our method consists of two parts: (1) Q-Estimation extracts a Q-function from a BC policy using a few interaction steps with the environment, followed by online RL with (2) Q-Gating, which switches between BC and RL policy actions based on their respective Q-values to collect samples for RL policy training. Across manipulation tasks from D4RL and robomimic benchmarks, Q2RL outperforms SOTA offline-to-online learning baselines on success rate and time to convergence. Q2RL is efficient enough to be applied in an on-robot RL setting, learning robust policies for contact-rich and high precision manipulation tasks such as pipe assembly and kitting, in 1-2 hours of online interaction, achieving success rates of up to 100% and up to 3.75x improvement against the original BC policy. Code and video are available at https://pages.rai-inst.com/q2rl_website/

Model-Based RL & Long-Horizon Planning

#3 h=24

ELVIS: Ensemble-Calibrated Latent Imagination for Long-Horizon Visual MPC

2026-05-06 cs.LG, cs.RO, eess.SY R. Detry (h=24)

Yurui Du, Pinhao Song, Yutong Hu, Renaud Detry

Core Contributions

Replaces standard unimodal MPPI with Gaussian-mixture MPPI that maintains multiple coherent trajectory hypotheses, preventing the destructive mode-averaging that collapses planning quality under branching futures
Introduces a shared uncertainty-aware lambda-return where an ensemble of latent critics gates a time-varying lambda, adaptively switching between bootstrapping and model rollout to limit compounding error — a principled solution to the deep imagination brittleness problem
Unifies the actor-critic training objective with the MPC scoring function through a single UCB-based return, eliminating the common misalignment between learned policy priors and planner cost functions
Achieves state-of-the-art on 14 DeepMind Control Suite visual tasks versus TD-MPC2 and DreamerV3, and transfers zero-shot to a real-world sand-spraying task with severe occlusions

Show abstract

A central challenge of visual control with model-based reinforcement learning (RL) is reliable long-horizon planning: long rollouts with learned latent dynamics exhibit branching futures and multi-modal action-value distributions. In addition, compounding model errors amplified by visual occlusions make deep imagination brittle. We present ELVIS, a latent model predictive controller (MPC) designed to make long-horizon planning practical. ELVIS plans in a Dreamer-style recurrent state space model (RSSM) and replaces standard unimodal model predictive path integral (MPPI) with a Gaussian-mixture MPPI that maintains multiple coherent hypotheses over long horizons, avoiding mode averaging under branching rollouts. In parallel, ELVIS stabilizes deep imagination with a shared uncertainty-aware lambda-return: an ensemble of latent critics defines an upper-confidence-bound (UCB) score that gates a time-varying lambda, adaptively trading off bootstrapping versus look-ahead to limit compounding error during planning. The same return is used both to train an actor-critic prior from imagined rollouts and to score candidate trajectories inside GMM-MPPI, aligning RL objectives with the planner's long-horizon optimization. On fourteen DeepMind Control Suite visual tasks, ELVIS establishes state-of-the-art performance compared with TD-MPC2 and DreamerV3. Finally, ELVIS transfers zero-shot to a real-world sand-spraying task with severe occlusions, improving surface-quality metrics and demonstrating robustness beyond simulation.

#2 h=43

Modular Reinforcement Learning For Cooperative Swarms

2026-05-06 cs.RO, cs.AI G. Kaminka (h=43)

Erel Shtossel, Gal A. Kaminka

Core Contributions

Proposes a decomposed state representation where each feature dimension is handled by a separate learning module, avoiding the combinatorial explosion of joint interaction states that overwhelms memory-limited swarm robots
The modular approach aggregates per-feature Q-values, sidestepping the need for robots to enumerate all possible neighbor configurations — a bottleneck in standard tabular and deep MARL for large swarms
Validated across multiple simulated foraging scenarios, showing that modular decomposition matches or exceeds monolithic policies while requiring significantly less memory per agent
Each robot learns independently using only local observations, preserving the decentralized execution constraint critical for real swarm deployments where global communication is impractical

Show abstract

A cooperative robot swarm is a collective of computationally-limited robots that share a common goal. Each robot can only interact with a small subset of its peers, without knowing how this affects the collective utility. Recent advances in distributed multi-agent reinforcement learning have demonstrated that it is possible for robots to learn how to interact effectively with others, in a manner that is aligned with the common goal, despite each robot learning independently of others. However, this requires each robot to represent a potentially combinatorial number of interaction states, challenging the memory capabilities of the robots. This paper proposes an alternative approach for representing spatial interaction states for multi-robot reinforcement learning in swarms. A modular (decomposed) representation is used, where each feature of the state is handled by a separate learning procedure, and the results aggregated. We demonstrate the efficacy of the approach in numerous experiments with simulated robot swarms carrying out foraging.

#19 h=5

HDFlow: Hierarchical Diffusion-Flow Planning for Long-horizon Tasks

2026-05-06 cs.RO Nandiraju Gireesh (h=5)

Nandiraju Gireesh, Yuanliang Ju, Chaoyi Xu, Weiheng Liu, Yuxuan Wan

Core Contributions

Decouples strategic subgoal generation (diffusion model in latent space) from dense trajectory generation (rectified flow model), exploiting diffusion's exploration strength and flow's ODE-based speed
The high-level planner generates subgoal sequences in a learned latent space, avoiding the curse of long-horizon sampling that makes single-level diffusion planners brittle on assembly tasks
Significantly outperforms state-of-the-art methods on four challenging furniture assembly tasks in both simulation and real-world, demonstrating that hierarchical generative planning scales to practical multi-step manipulation
Shows generalizability beyond assembly on two long-horizon benchmarks spanning locomotion and manipulation tasks

Show abstract

Recent advances in generative models have shown promise in generating behavior plans for long-horizon, sparse reward tasks. While these approaches have achieved promising results, they often lack a principled framework for hierarchical decomposition and struggle with the computational demands of real-time execution, due to their iterative denoising process. In this work, we introduce Hierarchical Diffusion-Flow (HDFlow), a novel hierarchical planning framework that optimally leverages the strengths of diffusion and rectified flow models to overcome the limitations of single-paradigm generative planners. HDFlow employs a high-level diffusion planner to generate sequences of strategic subgoals in a learned latent space, capitalizing on diffusion's powerful exploratory capabilities. These subgoals then guide a low-level rectified flow planner that generates smooth and dense trajectories, exploiting the speed and efficiency of ordinary differential equation (ODE)-based trajectory generation. We evaluate HDFlow on four challenging furniture assembly tasks in both simulation and real-world, where it significantly outperforms state-of-the-art methods. Furthermore, we also showcase our method's generalizability on two long-horizon benchmarks comprising diverse locomotion and manipulation tasks. Project website: https://hdflow-page.github.io/

#21 h=4

Dream-MPC: Gradient-Based Model Predictive Control with Latent Imagination

2026-05-06 cs.LG, cs.AI, cs.RO Sven Behnke (h=4)

Jonathan Spieler, Sven Behnke

Core Contributions

Generates few candidate trajectories from a policy rollout then optimizes each via gradient ascent through a learned world model, combining the sample efficiency of gradients with the robustness of multiple initialization
Introduces uncertainty regularization and action amortization across time steps, addressing the two main failure modes of gradient-based planning: overconfident model exploitation and cold-start optimization
Outperforms gradient-free MPC and state-of-the-art baselines on 24 continuous control tasks, challenging the prevailing view that gradient-based planning is inherently inferior to sampling-based alternatives
Open-source release enables direct comparison and further development of gradient-based latent MPC methods

Show abstract

State-of-the-art model-based Reinforcement Learning (RL) approaches either use gradient-free, population-based methods for planning, learned policy networks, or a combination of policy networks and planning. Hybrid approaches that combine Model Predictive Control (MPC) with a learned model and a policy prior to leverage the advantages of both paradigms have shown promising results. However, these approaches typically rely on gradient-free optimization methods, which can be computationally expensive for high-dimensional control tasks. While gradient-based methods are a promising alternative, recent works have empirically shown that gradient-based methods often perform worse than their gradient-free counterparts. We propose Dream-MPC, a novel approach that generates few candidate trajectories from a rolled-out policy and optimizes each trajectory by gradient ascent using a learned world model, uncertainty regularization and amortization of optimization iterations over time by reusing previously optimized actions. Our results on 24 continuous control tasks show that Dream-MPC can significantly improve the performance of the underlying policy and can outperform gradient-free MPC and state-of-the-art baselines. We will open source our code and more at https://dream-mpc.github.io.

Tactile & Contact-Rich Manipulation

#4 h=19

Reduced-order Neural Modeling with Differentiable Simulation for High-Detail Tactile Perception

2026-05-06 cs.RO, cs.CV Guoxin Fang (h=19)

Yuhu Guo, Zhikai Shen, Jiasheng Qu, Chenghao Qian, Yuming Huang

Core Contributions

Couples coarse-grained MPM dynamics with an implicit neural decoder that reconstructs sub-particle tactile detail from compact latent states — achieving high-resolution output without high-resolution simulation cost
Delivers over 65% faster simulation and 40% lower memory usage compared to TacIPC while maintaining better geometric fidelity, directly addressing the computational bottleneck blocking tactile sim-to-real transfer
Learns a continuous deformation manifold from paired high/low-resolution simulations, enabling differentiable end-to-end optimization through the tactile sensing pipeline
Improves tactile rendering and 3D surface reconstruction accuracy by 25%, producing realistic depth images and meshes suitable for downstream manipulation policy training

Show abstract

Tactile perception is key to dexterous manipulation, yet simulating high-resolution elastomer deformation remains computationally prohibitive. Finite element methods (FEM) deliver high fidelity but demand costly remeshing, while Material Point Methods (MPM) suffer from heavy particle-memory tradeoffs. We propose a {reduced-order neural simulation framework} that couples coarse-grained MPM dynamics with an implicit neural decoder to reconstruct sub-particle tactile details from compact latent states. The framework learns a continuous deformation manifold from paired high- and low-resolution simulations, enabling physically consistent, differentiable inference. Compared to the TacIPC, our method achieves over 65\% faster simulation and {40\% lower memory usage}, while maintaining better geometric fidelity. In tactile rendering and 3D surface reconstruction, our methods further improve accuracy by 25\% and produce realistic depth images and surface mesh within a faster inference speed. These results demonstrate that the proposed reduced-order neural model enables high-detail, physically grounded tactile simulation with substantial efficiency gains for robotic interaction and optimization.

#10 h=9

From Reach to Insert: Tactile-Augmented Precision Assembly under Sub-Millimeter Tolerances

2026-05-06 cs.RO Houcheng Li (h=9)

Xinpan Meng, Siyao Huang, JingPu Yang, Muyuan Ma, Zhenghua Ma

Core Contributions

Combines IL for reaching with RL for insertion in a two-stage pipeline, where the RL stage enables recovery from contact failures that pure imitation cannot handle under sub-millimeter clearances
Introduces tactile group sampling that increases coverage of critical contact segments during training, ensuring the policy encounters the informative force patterns needed for tight-tolerance insertion
A tactile critic more accurately evaluates policy values during contact-rich interactions, improving insertion success while keeping forces low — reducing maximum interaction force by 60% and torque by 44%
Achieves 67% success rate at the most challenging 0.05mm clearance across five hole geometries, demonstrating that tactile feedback is essential for assembly tasks where vision alone cannot resolve sub-mm pose errors

Show abstract

High-precision assembly frequently involves tight-tolerance insertions, where even slight pose errors can cause jamming or excessive interaction forces, making robust and safe insertion policies difficult to obtain. This paper proposes a tactile-augmented two-stage method that combines Imitation Learning (IL) and Reinforcement Learning (RL) for precision insertion tasks. In the first stage, IL learns a reaching policy with position generalization that grasps the peg and brings it to the vicinity of the target region. In the second stage, RL executes the insertion and enables recovery from failures during contact-rich interactions. To better exploit tactile feedback, we introduce tactile group sampling to increase coverage of critical contact segments during training, and design a tactile critic to more accurately evaluate policy values, improving insertion performance while maintaining low contact forces. We conduct systematic experiments across five hole geometries and three clearance settings. Results show that our method substantially improves insertion performance across all settings; under the most challenging 0.05\,mm clearance, it achieves a 67\% success rate while keeping contact forces low, reducing the maximum interaction force by 60\% and torque by 44\%, thereby validating both effectiveness and safety for precision assembly.

#26 h=3

Active Contact Sensing for Robust Robot-to-Human Object Handover

2026-05-06 cs.RO Linfeng Li (h=3)

Linfeng Li, Lin Shao, David Hsu

Core Contributions

Proposes active information-gathering motions during handover: the robot applies deliberate perturbations and senses resulting forces to distinguish firm grasps from incidental touches — a paradigm shift from passive sensing
Models contact state with a Bayesian linear model over piecewise-linear mappings from robot motions to human forces, enabling principled firm-grasp detection with uncertainty quantification
Achieves 97.5% success rate across 12 participants and 30 diverse rigid objects, over 30% higher than passive baselines — demonstrating that active sensing generalizes where threshold-based methods fail
The key insight is that a firm grasp produces force responses in multiple directions while accidental touches do not, and active perturbation is necessary to elicit this distinguishing signature

Show abstract

Robot-to-human object handover is an essential skill for robot assistants, from serving drinks at home to passing surgical tools in the operating room. We expect robots to perform handover robustly -- to release the object only after a firm human grasp while ignoring incidental touches. Existing passive-sensing methods struggle to generalize across diverse objects and human behaviors, as they lack informative perturbations to disambiguate different contact conditions, such as firm grasp versus incidental touch. We propose an active sensing approach for robust handovers: the robot applies information-gathering motions and senses the resulting human-applied forces to infer the contact state. A firm grasp produces forces in multiple directions, while an accidental touch does not. To capture this distinction, we model the contact state with a Bayesian linear model: a distribution over piecewise-linear mappings from robot motions to human-applied forces. This model enables firm grasp detection and active information gathering. In experiments with 12 participants and 30 diverse rigid objects, our method achieved a 97.5% success rate -- over 30% higher than two common baselines.

Navigation, SLAM & Localization

#9 h=9

Dr-PoGO: Direct Radar Pose-Graph Optimization

2026-05-06 cs.RO C. Gentil (h=9)

Cedric Le Gentil, Weican Li, Leonardo Brizi, Timothy D. Barfoot

Core Contributions

Leverages direct registration techniques (DRO) for both odometry and loop-closure on raw radar scans rather than extracted point clouds, avoiding information loss from feature extraction pipelines
Introduces a coarse-to-fine loop-closure registration that uses visual features for initial alignment followed by direct transformation refinement, bridging the gap where RaPlace provides candidates without relative transforms
Demonstrates state-of-the-art SLAM performance over 300km of real-world automotive data across various environments, showing radar's advantage in dust, snow, and rain conditions where cameras and lidars fail
Publicly released implementation enables reproducibility and provides a strong radar SLAM baseline for the community

Show abstract

This paper introduces Dr-PoGO, a method for Simultaneous Localization And Mapping (SLAM) using a 2D spinning radar. Unlike cameras or lidars that require line-of-sight, millimetre-wave radars can `see' through dust, falling snow, rain, etc. Accordingly, it is a great modality for robust perception regardless of the weather conditions. While most existing radar-based SLAM methods rely on the extraction of point clouds or features to perform ego-motion estimation, Dr-PoGO leverages direct registration techniques for odometry (DRO) and loop-closure registration. An off-the-shelf radar-focused place recognition algorithm, RaPlace, provides loop-closure candidates. As RaPlace does not provide relative transformations, Dr-PoGO introduces a coarse-to-fine registration that uses visual features and descriptors to obtain an initial guess for the direct transformation refinement. The global trajectory is optimized in a pose-graph optimization. Dr-PoGO demonstrates state-of-the-art performance over 300km of data in various real-world automotive environments. Our implementation is publicly available: https://github.com/utiasASRL/dr_pogo.

#17 h=6

AI-Aided Advancements in Autonomous Underwater Vehicle Navigation

2026-05-06 cs.RO A. Sahoo (h=6)

Guy Damari, Zeev Yampolsky, Nadav Cohen, Arup Kumar Sahoo, Jeryes Danial

Core Contributions

Surveys the landscape of AI-aided AUV positioning with a focus on sensor fusion architectures integrating INS, DVL, and cameras for GPS-denied underwater environments
Reviews the emergence of learning-based approaches for inertial dead-reckoning correction and adaptive fusion algorithms, mapping how neural networks complement traditional model-based Kalman filtering
Addresses the fundamental challenge of electromagnetic signal attenuation underwater and how modern AI techniques are closing the accuracy gap relative to surface navigation
Provides a roadmap for high-precision underwater navigation, identifying key research directions for achieving fully autonomous deep-sea missions

Show abstract

Autonomous underwater vehicles (AUVs) have become indispensable for deep-sea exploration, spanning critical scientific research and commercial applications. The rapid attenuation of electromagnetic waves renders satellite radio signals unavailable, while the dynamic unpredictability of the marine environment presents formidable navigation challenges. This chapter explores recent advancements in AI-aided AUV positioning, specifically focusing on advanced sensor fusion architectures that integrate inertial navigation systems with Doppler velocity logs and cameras. Beyond traditional model-based filtering, we examine the transformative emergence of AI-driven learning approaches in enhancing inertial dead-reckoning tasks and adaptive fusion algorithms. By addressing these recent milestones, this chapter provides a comprehensive roadmap for achieving the high-precision navigation essential for autonomous underwater missions.

#24 h=3

A Closed-Form Dual-Barrier CBF Safety Filter for Holonomic Robots on Incrementally Built Occupancy Grid Maps

2026-05-06 cs.RO, eess.SY B. Joshi (h=3)

Himanshu Paudel, Basanta Joshi, Dhirendra Raj Madai, Alina Bartaula, Biman Rimal

Core Contributions

Enforces two simultaneous safety constraints — obstacle avoidance and unexplored-region avoidance — derived analytically from the signed distance field of an incrementally built occupancy grid
The closed-form filter requires only a small linear system solve per cycle, making it feasible on resource-constrained platforms like Raspberry Pi where SLAM and planning already consume significant compute
An adaptive gain schedule relaxes the frontier constraint in information-rich regions and tightens it in well-mapped areas, balancing exploration efficiency with safety rather than treating unexplored space uniformly
Hardware flight experiments on a PX4 quadrotor demonstrate zero collisions across multiple indoor runs, validating the filter's real-time safety guarantees on actual embedded hardware

Show abstract

We present a dual-barrier control barrier function (CBF) safety filter for real-time, safety-critical velocity control of holonomic robots operating in incrementally built occupancy grid maps. As a robot explores an unknown environment, unmapped regions introduce irreducible uncertainty, since obstacle geometry beyond the explored frontier is unknown, making entry into such regions a source of collision risk, especially with front-facing sensors. To address this, we enforce two constraints: avoidance of mapped obstacles and restriction from unexplored regions. Both constraints are derived analytically from the occupancy grid's signed distance field, yielding a closed-form safety filter that requires only a small linear system solve per cycle. On resource-constrained platforms such as the Raspberry Pi, where SLAM and planning already consume significant compute, the low overhead of the proposed filter preserves resources. An adaptive gain schedule relaxes the frontier constraint in information-rich regions and tightens it in well-mapped areas, improving exploration efficiency while maintaining safety. The filter operates in velocity space as a minimally invasive correction and composes with arbitrary nominal controllers, including learning-based methods. Hardware flight experiments on a PX4-controlled quadrotor demonstrate zero collisions across multiple indoor runs.

#22 h=4

Tightly-Coupled Estimation and Guidance for Robust Low-Thrust Rendezvous via Adaptive Homotopy

2026-05-06 cs.RO, eess.SY Batu Candan (h=4)

Batu Candan, Simone Servadio

Core Contributions

Navigation confidence directly modulates the homotopy parameter of an indirect optimal control solver, creating a tight coupling between estimation quality and guidance aggressiveness that standard decoupled architectures lack
MTF covariance inflation suppresses suspicious innovation directions in the Kalman filter, providing a composite score that drives adaptive homotopy — the controller relaxes toward smoother regimes when sensing degrades
Reduces terminal miss from hundreds of meters to sub-meter levels under severe measurement degradation, roughly two orders of magnitude improvement over fixed bang-bang guidance
Demonstrates that adaptive homotopy is the dominant robustness mechanism while MTF provides additional accuracy gains, with consistently fast solution times supporting online viability

Show abstract

Minimum-fuel low-thrust rendezvous guidance yields bang-bang control structures highly sensitive to estimation errors, sensor anomalies, and solver regularization, making aggressive closed-loop execution brittle for uncooperative proximity operations. This paper proposes a tightly-coupled estimation and guidance architecture where navigation confidence directly modulates the homotopy parameter of a receding-horizon indirect optimal control solver. Relative motion is modeled in the Clohessy-Wiltshire frame. The translational state is estimated via a linear Kalman filter augmented by a Multiple Tuning Factors (MTF) covariance inflation mechanism that suppresses suspicious innovation directions. A composite score from the normalized innovation and MTF activity is mapped online to the homotopy parameter, allowing the controller to relax toward a smoother, conservative regime when confidence degrades, and recover fuel-efficient bang-bang control as sensing improves. Numerical results under severe measurement degradation show fixed bang-bang guidance remains brittle; both plain-KF and MTF-KF fixed-epsilon controllers yield large terminal miss distances. Conversely, the proposed MTF-adaptive homotopy controller reduces terminal miss by roughly two orders of magnitude, from hundreds of meters to sub-meter levels, requiring only a moderate increase in control effort versus the open-loop fuel-optimal benchmark. A comparison indicates adaptive homotopy is the dominant robustness mechanism, while MTF provides additional accuracy and efficiency improvements. The receding-horizon implementation exhibits consistently fast and reliable solution times, supporting the practical online viability of the proposed method.

Control, Calibration & Dynamics

#11 h=8

Koopman Identification of Nonlinear Systems via Reservoir Liftings

2026-05-06 cs.LG, cs.RO Lu Shi (h=8)

Weibin Gu, Chen Yang, Lu Shi

Core Contributions

Reinterprets echo state networks as stateful Koopman dictionaries whose temporal depth is controlled by spectral radius, unifying reservoir computing and Koopman operator theory in a principled framework
The Echo State Property guarantees well-posedness and favorable numerical conditioning of the lifted approximation, addressing the ill-conditioning problems that plague standard EDMD with large dictionaries
A correlation-based spectral radius selection algorithm automatically aligns reservoir memory with dominant system timescales, removing a critical manual tuning step from Koopman identification
Outperforms EDMD and Hankel-based lifting on synthetic benchmarks in both reconstruction accuracy and dynamical stability, with publicly available code

Show abstract

Learning tractable linear representations of nonlinear dynamical systems via Koopman operator theory is often hindered by dictionary selection, temporal memory encoding, and numerical ill-conditioning. Inspired by Reservoir Computing (RC) paradigm, this paper introduces the RC-Koopman framework, which interprets reservoir as a stateful, finite-dimensional Koopman dictionary whose temporal depth is explicitly controlled by its spectral radius. We show that the Echo State Property (ESP) guarantees well-posedness and favorable numerical conditioning of the lifted Koopman approximation. A correlation-based spectral radius selection algorithm aligns reservoir memory with dominant system timescales. Analysis reveals how the finite memory of the reservoir determines which Koopman eigenfunctions remain observable from the lifted features. Evaluation on synthetic benchmarks demonstrates that RC-Koopman achieves a favorable balance between reconstruction accuracy of the underlying nonlinear dynamics and dynamical stability, compared to Extended Dynamic Mode Decomposition (EDMD) and Hankel-based lifting approaches. Code available at: https://github.com/NEAR-the-future/RC-Koopman.git

#14 h=7

Optimal Uncertainty-Aware Calibration for the AX=YB Problem

2026-05-06 cs.RO Han Ding (h=7)

Yanjia Chen, Xiangfei Li, Huan Zhao, Yiyuan Hong, Guanxiao Xia

Core Contributions

Develops an iterative Lie-algebra optimizer for hand-eye calibration that strictly preserves SE(3) structural constraints and synchronizes updates between calibration parameters, unlike methods that decouple rotation and translation
Introduces an uncertainty metric that evaluates relative data quality between sources and dynamically refines the iterative process, avoiding the need for explicit noise modeling that is inherently difficult in industrial settings
Improves estimation accuracy by at least 67% under high-uncertainty conditions compared to existing methods on synthetic datasets, directly addressing the over-loading and large-workspace scenarios common in industrial robots
Includes an effective initial solution generation method that improves convergence stability, making the approach practical for real-world deployment without careful initialization

Show abstract

This article proposes a general optimization framework for solving hand-eye calibration problem. Unlike traditional methods, an iterative algorithm based on Lie algebra that achieves approximately global optimal solutions is developed. During the optimization process, the method strictly preserves the structural constraints of the calibration parameters and enables synchronized updates between calibration parameters. Recognizing that data used in real-word hand-eye calibration often contain uncertainty, especially in over-loading and large workspace industrial robot scenarios, which can significantly degrade accuracy, and accurately modeling such uncertainty is inherently difficult, this article avoids explicit uncertainty modeling. Instead, an uncertainty metric to evaluate the relative uncertainty between data sources is introduced and used to dynamically refine the iterative process. To further enhance convergence efficiency, an effective initial solution generation method that improves overall stability and accuracy is designed. Numerical simulations and real-world experiments validate the effectiveness of the proposed approach, and in synthetic datasets, the proposed approach improves the estimation accuracy by at least 67\% under high-uncertainty conditions compared with the existing methods.

#16 h=7

Right Model, Right Time: Real-Time Cascaded-Fidelity MPC for Bipedal Walking

2026-05-06 cs.RO Dennis Mronga (h=7)

Franek Stark, Felix Wiebe, Shubham Vyas, Dennis Mronga, Frank Kirchner

Core Contributions

Combines whole-body dynamics in the near horizon with a simplified single-rigid-body model in later prediction steps, reducing computational complexity while retaining the prediction accuracy needed for stable walking
Solves the resulting nonlinear OCP via sequential quadratic programming in acados, optimizing joint torques directly without depending on pre-selected footstep locations
Requires only a target walking speed and contact schedule as inputs, letting the optimizer discover appropriate foot placements — a more flexible approach than trajectory-library or footstep-planning methods
Validated on the 18-DoF HyPer-2 bipedal robot in MuJoCo, demonstrating that cascaded-fidelity MPC is viable for real-time bipedal control

Show abstract

This paper presents a multi-phase whole-body model predictive control approach for bipedal walking, combining a detailed whole-body model in the near horizon with a simplified single-rigid-body model in the later prediction steps. This reduces computational complexity while retaining prediction capabilities. The resulting nonlinear optimal control problem is solved using sequential quadratic programming (SQP) in acados. Using a prior specified contact schedule and a target walking speed, the controller optimizes joint torques without depending on prior selected foot step locations. The controller is validated in MuJoCo simulation on the 18-DoF bipedal robot HyPer-2

#8 h=14

LineRides: Line-Guided Reinforcement Learning for Bicycle Robot Stunts

2026-05-06 cs.RO, cs.AI G. Nelson (h=14)

Seungeun Rho, Shamel Fahmi, Jeonghwan Kim, Arianna Ilvonen, Sehoon Ha

Core Contributions

Replaces demonstration-based reward design with user-drawn spatial guidelines and sparse key-orientations, enabling stunt learning on platforms where reference motions are unavailable or physically impossible to capture
Handles physically infeasible guidelines via a tracking margin that permits controlled deviation, and resolves temporal ambiguity by measuring progress via traveled distance rather than wall-clock time
Demonstrates five distinct stunts (MiniHop, LargeHop, ThreePointTurn, Backflip, DriftTurn) on a custom bicycle robot (Ultra Mobility Vehicle), with seamless transitions between normal driving and stunt execution
Position- and sequence-based key-orientations disambiguate motion details that a spatial path alone cannot specify, providing a minimal but sufficient interface for non-expert stunt design

Show abstract

Designing reward functions for agile robotic maneuvers in reinforcement learning remains difficult, and demonstration-based approaches often require reference motions that are unavailable for novel platforms or extreme stunts. We present LineRides, a line-guided learning framework that enables a custom bicycle robot to acquire diverse, commandable stunt behaviors from a user-provided spatial guideline and sparse key-orientations, without demonstrations or explicit timing. LineRides handles physically infeasible guidelines using a tracking margin that permits controlled deviation, resolves temporal ambiguity by measuring progress via traveled distance along the guideline, and disambiguates motion details through position- and sequence-based key-orientations. We evaluate LineRides on the Ultra Mobility Vehicle (UMV) and show that the policy trained with our methods supports seamless transitions between normal driving and stunt execution, enabling five distinct stunts on command: MiniHop, LargeHop, ThreePointTurn, Backflip, and DriftTurn.

Hardware, Surgical Robotics & HRI

#12 h=8

3D Printing of Passively Actuated Self-Folding Robots with Integrated Functional Modules

2026-05-06 cs.RO, cs.HC M. Nisser (h=8)

Gaolin Ge, Qifeng Yang, Haoran Lu, Tingyu Cheng, Martin Nisser

Core Contributions

Elastic bands routed through printed hooks store energy that folds flat 3D-printed PLA sheets into programmed 3D geometries without external stimuli — enabling deployment-ready fabrication where electronics are placed before folding
Derives a closed-form folding model balancing hinge stiffness with elastic band moment to predict equilibrium fold angles, validated experimentally to produce a practical design map linking hinge thickness, band size, and hook spacing
The conductive PLA substrate doubles as capacitive touch electrodes and supports a reusable I/O palette with Hall sensors and ERM motors, integrating sensing and actuation into the manufacturing process
Demonstrates three applications — a modular cube for scalable collectives, a deployable gripper, and a tendon-driven finger — showcasing the versatility of the low-cost, stimulus-free approach

Show abstract

We introduce an elastic-driven self-folding approach that fabricates robots directly from flat 3D-printed conductive PLA nets. Elastic bands routed through printed hooks store energy that folds the sheet into programmed 3D geometries, while the flat state allows accurate placement of electronics and magnets before deployment. The same substrate doubles as electrodes for capacitive touch and supports a reusable platform I/O palette with Hall sensors and eccentric rotating mass (ERM) motors for docking detection and vibration actuation. We also derive a closed-form folding model that balances hinge stiffness with elastic band moment to predict equilibrium fold angles; experiments validate the model and yield a design map linking hinge thickness, band size, and hook spacing to target angles. Using this workflow we realize multiple polyhedral modules and demonstrate three applications: a cube that highlights the potential of self-folding for scalable modular robot collectives, a deployable gripper, and a tendon-driven finger. The method is low cost, stimulus-free, and integrates actuation and sensing.

#23 h=4

Autonomous Laparoscope Control through Unified Mechanics-Based Representation of Multimodal Intraoperative Information

2026-05-06 cs.RO Kai Yan (h=4)

Xiaojian Li, Jin Fang, Yudong Shi, Xilin Xiao, Kai Yan

Core Contributions

Unifies position, force/torque, and image signals into an equivalent-wrench representation in operational space, eliminating the unit-mismatch problem that makes multimodal fusion for surgical robotics ad hoc
Uses task-priority projection to inject wrenches into task space and null space, enabling simultaneous RCM constraint enforcement, compliant laparoscope manipulation, and autonomous instrument tracking
Reduces sustained trocar-site loading while maintaining the RCM constraint — directly addressing a surgical safety concern where excessive forces risk patient injury at the incision site
Validated on surgical phantom and in vivo porcine trials, demonstrating multi-task operation in clinically relevant conditions

Show abstract

Laparoscope-holding robots can provide surgeons with a stable laparoscopic field of view (FOV) and reduce the burden on human assistants. To maintain an ideal intraoperative FOV, the robot must continuously adjust the laparoscope pose according to intraoperative information. However, intraoperative multimodal signals, such as position, force/torque, and images, differ markedly in physical meaning and units, making it difficult to build a unified representation and to generate control commands that can be used directly for laparoscope control. To address this issue, we propose a laparoscope-holding robot control method based on unified mechanics modeling of multimodal information. First, we design mapping strategies for multiple intraoperative sources, including position, force/torque, and images, and unify them into an equivalent-wrench representation in the operational space. Then, using a task-priority scheme, we inject the wrenches into the task space and the null space, respectively, and synthesize laparoscope control commands via task-priority projection, thereby achieving consistent representation and coordinated fusion of multimodal information within a single framework. Finally, taking the intraoperative remote center of motion (RCM) position, force/torque sensor readings, and laparoscopic images as examples, we construct an RCM-constraint wrench to enforce the RCM geometric constraint and reduce the contact force at the trocar site, a laparoscope-manipulation wrench to enable compliant dragging, and an instrument-tracking wrench to achieve autonomous visual tracking of the instruments. Experiments on a surgical phantom and in vivo porcine trials demonstrate that the proposed method supports multi-task operation, including compliant laparoscope manipulation and autonomous instrument tracking, while maintaining the RCM constraint and reducing sustained trocar-site loading.

#25 h=3

Gaze4HRI: Zero-shot Benchmarking Gaze Estimation Neural-Networks for Human-Robot Interaction

2026-05-06 cs.CV, cs.HC, cs.LG, cs.RO S. Kalkan (h=3)

Berk Sezer, Ali Görkem Küçük, Erol Şahin, Sinan Kalkan

Core Contributions

Introduces a large-scale benchmark (50+ subjects, 600K+ frames) specifically designed for HRI conditions with dynamic cameras, moving targets, and varying illumination — filling a critical evaluation gap
Reveals that steeply-downward gaze is a universal failure point for all evaluated methods, identifying a concrete blind spot for HRI deployments where users look down at tabletop tasks
Finds that data diversity (ETH-X-Gaze dataset) is the primary driver of zero-shot robustness, challenging the recent literature's emphasis on complex spatiotemporal architectures and Transformers
PureGaze's self-adversarial loss for gaze feature purification uniquely maintains resilience across all tested conditions except downward gaze, providing a practical recommendation for HRI practitioners

Show abstract

While zero-shot appearance-based 3D gaze estimation offers significant cost-efficiency by directly mapping RGB images to gaze vectors, its reliability in Human-Robot Interaction (HRI) settings remains uncertain. Existing benchmarks frequently overlook fundamental HRI conditions, such as dynamic camera viewpoints and moving targets in video. Furthermore, current cross-dataset evaluations often suffer from a complexity gap, where methods trained on diverse datasets are tested on significantly smaller and less varied sets, failing to assess true robustness. To bridge these gaps, we introduce Gaze4HRI, a large-scale dataset (50+ subjects, 3,000+ videos, 600,000+ frames) designed to evaluate state-of-the-art performance against critical HRI variables: illumination, head-gaze conflict, as well as the motion of camera and gaze target in video. Our benchmark reveals that all evaluated methods fail in at least one condition, identifying steeply-downward gaze as a universal failure point. Notably, PureGaze trained on the ETH-X-Gaze dataset uniquely maintains resilience across all other conditions. These results challenge the recent focus in the literature on complex spatial-temporal modeling and Transformer-based architectures. Instead, our findings suggest that extensive data diversity, as exemplified by the ETH-X-Gaze dataset, serves as the primary driver of zero-shot robustness in unconstrained environments, while resilience-enhancing frameworks, such as PureGaze's self-adversarial loss for gaze feature purification, provide a substantial further improvement. Ultimately, this study establishes a rigorous benchmark that provides practical guidelines for practitioners as well as reshaping future research. The dataset and codes are available at https://gazeforhri.github.io.

#18 h=5

Position: Embodied AI Requires a Privacy-Utility Trade-off

2026-05-06 cs.AI, cs.RO Ziqi Yang (h=5)

Xiaoliang Fan, Jiarui Chen, Zhuodong Liu, Ziqi Yang, Peixuan Xu

Core Contributions

Argues that privacy in embodied AI is a lifecycle-level architectural constraint rather than a per-stage feature, since optimizing instruction, perception, planning, and interaction independently creates systemic privacy vulnerabilities
Proposes SPINE (Secure Privacy Integration in Next-generation Embodied AI), treating privacy as a dynamic control signal that governs cross-stage coupling throughout the entire EAI pipeline
Introduces a multi-criterion privacy classification matrix to orchestrate contextual sensitivity across stage boundaries, enabling context-dependent privacy enforcement rather than blanket restrictions
Demonstrates through simulation and real-world case studies how privacy constraints propagate downstream to reshape system behavior, illustrating that fragmented patches are insufficient

Show abstract

Embodied AI (EAI) systems are rapidly transitioning from simulations into real-world domestic and other sensitive environments. However, recent EAI solutions have largely demonstrated advancements within isolated stages such as instruction, perception, planning and interaction, without considering their coupled privacy implications in high-frequency deployments where privacy leakage is often irreversible. This position paper argues that optimizing these components independently creates a systemic privacy crisis when deployed in sensitive settings, thereby advancing the position that privacy in EAI is a life cycle-level architectural constraint rather than a stage-local feature. To address these challenges, we propose Secure Privacy Integration in Next-generation Embodied AI (SPINE), a unified privacy-aware framework that treats privacy as a dynamic control signal governing cross-stage coupling throughout the entire EAI life cycle. SPINE decomposes the EAI pipeline into various stages and establishes a multi-criterion privacy classification matrix to orchestrate contextual sensitivity across stage boundaries. We conduct preliminary simulation and real-world case studies to conceptually validate how privacy constraints propagate downstream to reshape system behavior, illustrating the insufficiency of fragmented privacy patches and motivating future research directions into secure yet functional embodied AI systems. We detail the SPINE framework and case studies at https://github.com/rminshen03/EAI_Privacy_Position.