Robotics arXiv Digest - April 13, 2026

Research Landscape

The robotics field is experiencing a remarkable convergence of three major paradigms: vision-language-action (VLA) foundation models enabling semantic reasoning, learning-based control approaches replacing hand-crafted policies, and embodied AI systems that ground abstract reasoning in physical interaction. Papers like StarVLA-α and Grounded World Model exemplify how large-scale vision-language pretraining is moving from pure perception toward actionable planning, while simultaneous advances in robot learning through simulation—from UGE-TO's uncertainty-guided trajectories to ComSim's compositional simulation—are systematically closing the sim-to-real gap. Manipulation tasks benefit from this alignment: ViserDex combines differentiable rendering with reinforcement learning for dexterous in-hand tasks, while AffordSim and ComSim are building the open-vocabulary affordance datasets needed for zero-shot generalization. The field shows particular maturity in hybrid classical-learning approaches (complementarity-by-construction solvers, neural+classical simulation) where interpretability and performance are no longer in tension.

Cross-cutting innovations suggest robotics is entering a phase of practical autonomy at scale. Multi-robot coordination papers (Dynamic Multi-Robot Task Allocation, Multi-ORFT) now handle realistic constraints—communication delays, uncertainty, cooperative objectives—that determine real-world deployment viability. Human-robot interaction is rapidly shifting from scripted gestures toward intent-aware collaboration: Safe Human-to-Humanoid Motion Imitation uses control barrier functions to fuse vision-based human understanding with safety guarantees, while M2HRI demonstrates that personality-driven multi-agent interaction with persistent memory scales to user studies of n=105. On the embodied cognition side, papers like Minimal Embodiment Enables Efficient Learning show that robots can develop compact, biologically plausible number representations from minimal interaction—a finding with implications for how embodiment constraints shape learning. Meanwhile, specialized domains (medical robotics, underwater reconstruction, racing perception) are applying these foundation-level innovations, with ReefMapGS closing the loop between SLAM and Gaussian splatting for large-scale underwater exploration and EagleVision establishing cross-domain benchmarks for perception in high-speed contexts.

A unifying theme is the shift toward systems that **learn to forget** and **adapt incrementally**. H²-EMV's hierarchical episodic memory with selective forgetting achieves 45% memory reduction while improving query accuracy, suggesting that scaling embodied AI demands new data management paradigms beyond standard replay buffers. Similarly, WM-DAgger uses world models as priors for efficient imitation learning, while RAPO tackles fundamental distributional shift under dynamics uncertainty via Boltzmann reweighting. Temporal reasoning and formal verification (Ternary Logic Encodings of Temporal BTs) are gaining prominence as systems become safety-critical. The papers collectively point toward a near-term future where robotics applications will be constrained not by algorithmic capability but by data efficiency, sim-to-real generalization, and the ability to ground abstract reasoning in heterogeneous sensor modalities and embodied constraints.

VLA & Foundation Models

Language-aligned vision-action systems

4 papers

Robot Learning & Sim-to-Real

Bridging simulation and physical deployment

6 papers

Manipulation & Grasping

Dexterous control and affordance learning

4 papers

Planning & Control

Motion synthesis and optimization methods

5 papers

Human-Robot Interaction & Embodied Cognition

Collaboration, gesture, and learning through embodiment

6 papers

Multi-Robot Systems

Coordination and decentralized deployment

2 papers

VLA & Foundation Models

12 h=13

Grounded World Model for Semantically Generalizable Planning

cs.RO cs.AI

Authors: Quanyi Li et al.

Core Contributions

Proposes GWM operating in vision-language-aligned latent space, enabling semantic understanding of scene objectives without task-specific fine-tuning
Achieves 87% success on WISER benchmark vs. 22% for traditional vision-language-action baselines, demonstrating 4x improvement in semantic generalization
Novel approach bridges low-level visual features and high-level semantic concepts through joint embedding space
Enables zero-shot planning for unseen task combinations by leveraging grounded semantic relationships

World models learn latent representations of environment dynamics that support planning. However, they typically learn low-level spatiotemporal patterns and struggle with high-level semantic understanding. We propose Grounded World Model (GWM) that grounds world models in vision-language representations, enabling semantic understanding of scene properties and task requirements. GWM learns a unified latent space that aligns visual features, language embeddings, and action representations. This enables semantically generalizable planning across diverse manipulation tasks. We evaluate GWM on WISER benchmark with unseen task combinations and demonstrate significant improvements in semantic generalization and robustness.

18 h=9

StarVLA-α: Reducing Complexity in Vision-Language-Action Systems

cs.RO cs.AI cs.CV

Authors: Jinhui Ye et al.

Core Contributions

Presents StarVLA-α, a minimalist VLA architecture that simplifies prior work without sacrificing performance or generalization
Outperforms π₀.₅ by 20% on RoboChallenge benchmark, suggesting that simplicity and interpretability do not require sacrificing state-of-the-art performance
Demonstrates that effective VLA systems can emerge from straightforward design principles rather than complex hierarchical structures
Provides strong baseline for future VLA research and industrial deployment where computational efficiency matters

Vision-Language-Action (VLA) models have shown remarkable progress in robot learning, but architectural complexity remains a barrier to understanding and deployment. We introduce StarVLA-α, a simplified VLA architecture that achieves competitive performance while maintaining clarity and computational efficiency. Through systematic ablation studies, we identify core design principles that enable effective vision-language-action alignment. StarVLA-α outperforms larger and more complex baselines on standard benchmarks, suggesting that the field may have overengineered VLA systems.

20 h=8

AIM: Intent-Aware Unified world action Modeling with Spatial Value Maps

cs.RO cs.LG

Authors: Liaoyuan Fan et al.

Core Contributions

Introduces spatial value maps as a unified representation for intent-aware action prediction, bridging value estimation and spatial localization
Achieves 94% success rate on RoboTwin 2.0 benchmark, demonstrating practical effectiveness for complex manipulation scenarios
Spatial value representation allows model to jointly reason about where to act and what action values are possible at each location
Framework generalizes across diverse task types by factoring world dynamics, intent encoding, and spatial reasoning

Understanding task intent and predicting appropriate actions are critical for robots. We propose AIM (Intent-Aware Unified world action Modeling), which uses spatial value maps to jointly model world dynamics and intent-conditioned action prediction. Spatial value maps provide a unified representation where each location in the workspace has an associated value under the current task intent. This allows the model to reason spatially about where and how to act. Experiments on complex manipulation benchmarks show significant improvements in action prediction accuracy and task success.

28 h=5

LARY: A Latent Action Representation Yielding Benchmark

cs.CV cs.RO

Authors: Dujun Nie et al.

Core Contributions

Introduces LARY benchmark for evaluating learned latent action representations, a critical component of embodied AI but underexplored empirically
Reveals surprising finding that general visual models (vision transformers, foundation models) outperform specialized embodied models on latent action learning
Systematic evaluation across multiple domains shows gap between general and specialized architectures, suggesting misdirected research effort
Provides standardized evaluation protocol enabling future research to build more effective action representations

Learning effective action representations is crucial for embodied AI, yet evaluation remains fragmented across domains. We introduce LARY, a comprehensive benchmark for latent action representation learning that spans manipulation, navigation, and locomotion. Through systematic evaluation, we compare general-purpose vision models (e.g., ViT, foundation models) against robot-specialized architectures. Surprisingly, general models consistently outperform specialized alternatives, suggesting the embodied AI community may be overspecializing. LARY provides a standardized protocol for future research on action representations.

Robot Learning & Sim-to-Real

2 h=54

Robust Adversarial Policy Optimization Under Dynamics Uncertainty

cs.LG cs.RO

Authors: Mintae Kim, Koushil Sreenath

Core Contributions

Proposes RAPO (Robust Adversarial Policy Optimization) with dual formulation addressing distributional shift from model mismatch during deployment
Combines trajectory-level temperature and model-level Boltzmann reweighting to adaptively focus training on worst-case dynamics shifts
Outperforms standard robust RL baselines by maintaining performance under significant model uncertainty without pessimistic value updates
Addresses fundamental challenge of sim-to-real transfer: policies must generalize to dynamics not seen during training

Sim-to-real transfer in robotics requires policies robust to model errors and unmodeled dynamics. We introduce RAPO, a robust reinforcement learning approach based on adversarial policy optimization that explicitly handles dynamics uncertainty. RAPO uses a dual formulation with trajectory-level and model-level temperature parameters to balance exploration and robustness. The model-level Boltzmann reweighting focuses learning on challenging dynamics shifts. Empirical results show RAPO significantly outperforms baseline robust RL methods on high-dimensional control tasks with substantial model mismatch.

5 h=31

Solving Physics Olympiad via Reinforcement Learning on Physics Simulators

cs.LG cs.AI cs.CV cs.RO

Authors: Mihir Prabhudesai, Aryan Satpathy, Yangmin Li, Zheyang Qin, Nikash Bhardwaj

Core Contributions

Novel approach uses physics simulators as supervision signal for training LLMs on physical reasoning, avoiding need for human-annotated solutions
Achieves zero-shot sim-to-real transfer: robots trained in simulation solve novel physics tasks without modification
Demonstrates 5-10 percentage point improvement over baseline approaches on International Physics Olympiad problems
Bridges gap between abstract physical reasoning and embodied problem-solving by grounding symbolic computation in simulator feedback

Large language models struggle with physics reasoning despite strong performance on text-based tasks. We propose using physics simulators as a source of self-supervision for training LLMs. An RL agent explores physics simulators to find solutions to problem scenarios, generating trajectories that demonstrate physical principles. These trajectories serve as demonstrations for language model fine-tuning. Our approach achieves competitive performance on International Physics Olympiad problems and enables zero-shot transfer to real robotic systems.

11 h=15

Simulator Adaptation for Sim-to-Real Learning of Legged Locomotion via Proprioceptive Distribution Matching

cs.RO

Authors: Jeremy Dao, Alan Fern

Core Contributions

Proposes proprioceptive distribution matching for simulator adaptation—matching only internal state distributions, not sensory observations
Achieves sim-to-real transfer without requiring time alignment or external sensing (cameras, IMUs), using only onboard proprioceptive data
Requires less than 5 minutes of hardware data for effective adaptation, dramatically reducing real-world sample complexity
Fundamental insight: proprioceptive consistency is sufficient for successful transfer; visual appearance and timing need not match

Sim-to-real transfer for legged robots typically requires either extensive real-world data or domain randomization over a large space of unknown parameters. We propose proprioceptive distribution matching (PDM), which adapts the simulator by matching only the distribution of proprioceptive (internal state) observations between sim and real, rather than matching sensory observations. PDM does not require time-synchronized trajectories, external sensors, or explicit parameter identification. We demonstrate successful sim-to-real transfer for quadruped locomotion using less than 5 minutes of hardware data.

19 h=9

WM-DAgger: Enabling Efficient Data Aggregation for Imitation Learning with World Models

cs.RO

Authors: Anlan Yu et al.

Core Contributions

Leverages world models to generate out-of-distribution recovery data, addressing fundamental challenge in imitation learning: recovering from distribution shift
Achieves 93.3% task success with only 5 demonstrations, reducing labeling burden compared to standard DAgger which requires expert re-labeling online
World models enable efficient synthesis of failure cases without expert interaction, making imitation learning practical for expensive-to-label domains
Combines model-based reasoning with behavior cloning, showing complementary benefits of learned and demonstrated trajectories

Imitation learning can solve robotics tasks with few demonstrations, but requires expert feedback when the policy encounters out-of-distribution states—a critical bottleneck for practical deployment. We propose WM-DAgger, which uses learned world models to predict failure modes and generate recovery trajectories, reducing dependence on expert annotations. World model predictions identify states where the current policy will likely fail; we then synthesize demonstrations for these states without requiring expert interaction. Our approach achieves high success rates with minimal demonstrations on complex manipulation tasks.

22 h=7

ComSim: Building Scalable Real-World Robot Data Generation via Compositional Simulation

cs.RO cs.CV

Authors: Yiran Qin et al.

Core Contributions

Proposes hybrid classical+neural simulation for compositional data generation: classical physics for dynamics, neural networks for rendering
Directly generates action-video pairs without requiring intermediate 3D supervision, enabling scalable synthetic data production
Demonstrates significant reduction in sim-to-real gap compared to standard rendering-only simulators by closing the sensorimotor loop
Compositional approach allows reuse of components across different simulation scenarios, enabling efficient scaling to diverse tasks

Generating large-scale robot training data in simulation is essential for learning-based approaches, but sim-to-real transfer remains challenging. We introduce ComSim, a compositional simulation framework that combines classical physics engines with learned neural components to generate realistic action-conditioned videos. The hybrid approach leverages classical simulators for accurate dynamics while using neural networks to model visual complexity. Compositional design allows reusing and combining components across tasks. We demonstrate that ComSim generates data that transfers effectively to real robotic systems with minimal fine-tuning.

25 h=6

ScoRe-Flow: Complete Distributional Control via Score-Based Reinforcement Learning for Flow Matching

cs.RO

Authors: Xiaotian Qiu et al.

Core Contributions

Introduces score-based RL fine-tuning for flow matching policies, enabling distributional control beyond single trajectory optimization
Achieves 2.4x faster convergence compared to standard RL fine-tuning by leveraging probabilistic structure of flow models
Maintains diversity of solutions while optimizing for task objectives, important for exploration and robustness
Bridges generative modeling and reinforcement learning: combines expressiveness of flow models with reward optimization

Flow matching models can generate diverse trajectories, but applying them to reward optimization requires careful design. We propose ScoRe-Flow, which performs score-based RL fine-tuning on flow matching policies to optimize task-specific rewards while maintaining distributional properties. By working in score space, we enable efficient gradient-based optimization of the flow model without compromising generative diversity. Experiments show significant speedup in convergence and improved performance on complex manipulation tasks.

Manipulation & Grasping

13 h=13

ViserDex: Visual Sim-to-Real for Robust Dexterous In-hand Reorientation

cs.RO cs.CV

Authors: Arjun Bhardwaj et al.

Core Contributions

Applies 3D Gaussian Splatting to monocular RGB observations for high-fidelity in-hand object understanding, replacing expensive multi-view setups
Domain randomization in Gaussian space enables effective transfer from simulation to real dexterous manipulation without domain adaptation
Achieves robust reorientation control using only single onboard camera, demonstrating practical deployment for five-fingered hands
Demonstrates that differentiable rendering enables better visual sim-to-real transfer than traditional pixel-space matching

Dexterous in-hand manipulation requires precise control based on visual feedback, but sim-to-real transfer for monocular observations is challenging. We introduce ViserDex, which uses 3D Gaussian Splatting to reconstruct hand-object scenes from single RGB images. Gaussian splatting provides differentiable rendering for domain randomization, enabling robust transfer from simulation. Our approach combines differentiable rendering with reinforcement learning to learn dexterous manipulation policies from image observations. Experiments on real five-fingered hands demonstrate successful reorientation of diverse objects.

15 h=10

Diffusion Reinforcement Learning Based Online 3D Bin Packing

cs.RO

Authors: Jie Han et al.

Core Contributions

Proposes diffusion-based actor network for bin packing, treating grasp pose generation as iterative refinement problem
Leverages generative model's ability to capture multimodal grasp distributions, improving exploration in action space
Outperforms standard DRL baselines in number of successfully packed items, showing importance of diverse action sampling
Integrates uncertainty quantification through diffusion, enabling risk-aware packing decisions

3D bin packing is a challenging manipulation task requiring both geometric reasoning and sequential decision-making. We propose a diffusion-based reinforcement learning approach where the policy generates grasp poses through iterative refinement. The diffusion model captures the distribution of valid grasp poses, providing diverse action samples for exploration. Combined with RL optimization, this approach achieves better packing efficiency than traditional methods. We demonstrate the approach on both simulated and real robotic systems.

21 h=7

AffordSim: A Scalable Data Generator and Benchmark for Affordance-Aware Robotic Manipulation

cs.RO cs.AI

Authors: Mingyang Li et al.

Core Contributions

Introduces VoxAfford for open-vocabulary 3D affordance learning, moving beyond class-specific grasp points to semantic object properties
Provides 50 tasks across 7 object categories benchmark revealing significant gap in affordance-demanding tasks, suggesting important research frontier
Scalable simulator-based data generation enables training affordance models on diverse objects without hand-annotation
Affordance representations enable zero-shot transfer to novel objects by grounding task requirements in spatial properties

Robotic manipulation requires understanding object affordances—what actions are possible on different parts of objects. We present AffordSim, a data generation framework and benchmark for learning open-vocabulary 3D affordances. Unlike traditional grasp detection, affordances capture semantic properties like "graspable," "screwable," and "deformable." AffordSim provides 50 manipulation tasks across 7 object categories. Our analysis reveals significant performance gaps on affordance-demanding tasks, highlighting an important research direction for embodied AI systems.

24 h=6

CLASP: Closed-loop Asynchronous Spatial Perception for Open-vocabulary Desktop Object Grasping

cs.RO

Authors: Yiran Ling et al.

Core Contributions

Proposes dual-pathway perception combining language-guided detection and spatial property reasoning for open-vocabulary grasping
Asynchronous closed-loop architecture enables continuous perception refinement as robot executes grasps, recovering from failures
Achieves 87% success rate on real robot grasping without requiring task-specific training or object models
Demonstrates practical system that generalizes to arbitrary objects specified by natural language description

Grasping novel objects requires flexible perception systems that can understand arbitrary object categories. We propose CLASP, which combines language-guided object detection with spatial reasoning for open-vocabulary grasping. CLASP uses a dual-pathway architecture: one pathway detects objects specified by natural language descriptions, while another pathway reasons about grasp feasibility based on spatial properties. An asynchronous closed-loop controller continuously refines perceptions and adjusts grasp strategies during execution. Experiments show robust grasping of diverse objects without task-specific training.

Planning & Control

1 h=61

Ternary Logic Encodings of Temporal Behavior Trees with Application to Control Synthesis

cs.RO eess.SY

Authors: Ryan Matheu, John S. Baras, Calin Belta

Core Contributions

Reformulates Temporal Behavior Trees using ternary-valued STL (Signal Temporal Logic) to enable formal verification and synthesis
Develops mixed-integer linear program encodings that guarantee correct-by-construction control strategies with formal safety guarantees
Bridges hierarchical planning (behavior trees) and formal verification (temporal logic), enabling automated synthesis with certification
Applicable to systems where correctness is critical: autonomous vehicles, medical robotics, industrial automation

Behavior Trees are widely used for hierarchical control in robotics, but lack formal guarantees. We present a novel approach to encode Temporal Behavior Trees using ternary-valued Signal Temporal Logic (STL), enabling formal verification and automated synthesis. The encoding reformulates tree semantics as ternary logic operations (true, false, unknown) rather than binary. We then develop mixed-integer linear program (MILP) encodings to synthesize control strategies that are guaranteed to satisfy temporal requirements. This approach enables correct-by-construction controllers suitable for safety-critical applications.

3 h=42

Uncertainty Guided Exploratory Trajectory Optimization for Sampling-Based Model Predictive Control

cs.RO

Authors: O. Goktug Poyrazoglu, Yukang Cao, Rahul Moorthy, Volkan Isler

Core Contributions

Proposes UGE-TO that generates well-separated trajectory samples via uncertainty ellipsoids, improving exploration efficiency in MPC
Uses Hellinger distance between trajectory distributions to quantify diversity, ensuring samples cover different regions of solution space
Achieves 72.1% faster convergence and 66% faster execution time with 6.7% higher success rate on complex planning tasks
Uncertainty-guided approach is more sample-efficient than uniform sampling or diversity-only methods

Sampling-based Model Predictive Control requires generating diverse trajectory candidates for optimization. Standard approaches sample uniformly or use prior information, but can miss promising regions. We propose Uncertainty Guided Exploratory Trajectory Optimization (UGE-TO), which generates trajectory samples by considering both model uncertainty and trajectory diversity. Uncertainty ellipsoids from the dynamics model guide sampling toward informative regions, while Hellinger distance ensures samples remain well-separated. This approach significantly accelerates convergence and improves success rates on complex manipulation and navigation tasks.

23 h=6

Complementarity by Construction: A Lie-Group Approach to Solving Quadratic Programs with Linear Complementarity Constraints

cs.RO

Authors: Arun L. Bishop et al.

Core Contributions

Develops Lie-group-based framework for solving quadratic programs with complementarity constraints, fundamental for contact-rich manipulation
Guarantees complementarity satisfaction by construction rather than post-hoc checking, ensuring physically consistent solutions
Open-source Marble solver in C++ enables practical adoption in trajectory optimization and control pipelines
Unified mathematical framework handles diverse contact scenarios: friction, contact breaks, and inequality constraints

Contact-rich manipulation requires solving optimization problems with complementarity constraints that ensure physical consistency. We present a novel approach using Lie group geometry to structure these optimization problems. By leveraging the group structure, we develop algorithms that satisfy complementarity constraints by construction, avoiding numerical issues of penalty methods. We provide open-source solver (Marble) in C++ for practical trajectory optimization and control synthesis in contact-rich scenarios. The approach handles friction cones, contact breaks, and diverse constraint types.

26 h=5

Dynamic Modeling and Robust Gait Optimization of a Compliant Worm Robot

cs.RO eess.SY

Authors: Xinyu Zhou et al.

Core Contributions

Develops hybrid dynamic model for compliant worm robot navigating constrained environments (corrugated pipes, narrow spaces)
Multi-objective gait optimization balances speed, stability, and energy efficiency, addressing trade-offs in novel locomotion
Compliant mechanisms enable adaptability to irregular terrain, important for inspection and search-and-rescue applications
Physics-based optimization provides interpretable gaits suitable for real-world deployment

Worm-like robots offer unique capabilities for navigation in confined spaces, but their compliant dynamics are complex. We develop a hybrid dynamic model capturing both rigid-body and elastic deformation behaviors. Using this model, we formulate multi-objective gait optimization to find movement patterns that are fast, stable, and energy-efficient. Experiments with a soft-robotic worm prototype demonstrate that optimized gaits significantly outperform baseline approaches in navigating pipes and irregular terrain. This work provides tools for designing effective locomotion strategies for compliant robots.

27 h=5

Multi-ORFT: Stable Online Reinforcement Fine-Tuning for Multi-Agent Diffusion Planning in Cooperative Driving

cs.RO cs.AI

Authors: Haojie Bai et al.

Core Contributions

Combines scene-conditioned diffusion models with multi-agent reinforcement learning for cooperative autonomous driving
VG-GRPO post-training provides stable fine-tuning without mode collapse, crucial for diverse behavior in traffic scenarios
Reduces collision rate from 2.04% to 1.89% through multi-agent coordination, demonstrating safety improvements from cooperation
Addresses challenge of maintaining diversity in generative models while optimizing for safety objectives

Autonomous driving requires generating diverse behaviors while maintaining safety guarantees. We propose Multi-ORFT, which applies online reinforcement fine-tuning to multi-agent diffusion planning. A scene-conditioned diffusion model generates diverse trajectories, which are then refined using group reward policy optimization (VG-GRPO). This approach maintains the diversity of diffusion models while optimizing for collision-free trajectories. We demonstrate improvements in both safety and realism on complex driving scenarios with multiple vehicles.

Navigation & Perception

6 h=26

EagleVision: A Multi-Task Benchmark for Cross-Domain Perception in High-Speed Autonomous Racing

cs.RO cs.CV

Authors: Zakhar Yagudin et al.

Core Contributions

Introduces LiDAR-based benchmark for 3D object detection and trajectory prediction in high-speed racing context
Emphasizes cross-domain transfer: models trained on one racing venue must generalize to different tracks, vehicle types, and weather
Simultaneously addresses perception tasks (detection) and prediction (future trajectory), critical for real-time decision-making
High-speed racing provides particularly challenging testbed with brief decision windows and safety-critical consequences

High-speed autonomous racing demands robust perception under challenging conditions: limited reaction time, dynamic environments, and safety criticality. We present EagleVision, a multi-task benchmark combining 3D object detection and trajectory prediction using LiDAR data. The benchmark emphasizes cross-domain generalization—models must transfer across different racing venues, weather conditions, and vehicle configurations. We analyze domain gaps and propose methods for cross-domain adaptation. EagleVision provides the first large-scale benchmark for perception in high-speed autonomous racing.

16 h=9

3DRO: Lidar-level SE(3) Direct Radar Odometry Using a 2D Imaging Radar and a Gyroscope

cs.RO

Authors: Cedric Le Gentil et al.

Core Contributions

Extends Direct Radar Odometry to full SE(3) estimation by integrating gyroscope information for rotation estimation
Achieves LiDAR-level accuracy using only 2D imaging radar and gyroscope—much cheaper sensors than 3D LiDAR
Validated on 643km Boreas-RT dataset with reliable performance across weather conditions (rain, snow, sun glare)
Radar odometry provides weather robustness advantage over LiDAR, important for autonomous vehicles in adverse conditions

Radar offers a cost-effective alternative to LiDAR for autonomous vehicles, with superior performance in adverse weather. Previous radar odometry methods produced 2D estimates; full 3D localization required additional sensors. We present 3DRO, which combines 2D radar imaging with inertial measurement (gyroscope) to estimate full SE(3) poses. Our direct optimization approach achieves accuracy comparable to LiDAR-based methods while maintaining radar's weather robustness. Validation on the Boreas-RT dataset (643km) demonstrates reliable performance across rain, snow, and changing lighting.

29 h=4

ReefMapGS: Enabling Large-Scale Underwater Reconstruction by Closing the Loop Between Multimodal SLAM and Gaussian Splatting

cs.RO cs.CV

Authors: Daniel Yang et al.

Core Contributions

Combines multimodal SLAM (fusion of sonar, visual, and inertial data) with incremental 3D Gaussian Splatting for large-scale underwater reconstruction
COLMAP-free approach enables real-time processing without offline structure-from-motion, critical for long-term deployments
Validated on 700m AUV trajectories in challenging underwater environments with turbidity and dynamic lighting
Enables high-fidelity 3D maps for underwater robotics applications: coral monitoring, archaeological surveying, infrastructure inspection

Large-scale underwater mapping with autonomous vehicles requires robust SLAM and efficient 3D reconstruction. We present ReefMapGS, which integrates multimodal SLAM (sonar, camera, inertial) with incremental 3D Gaussian Splatting. Unlike prior methods requiring offline structure-from-motion (e.g., COLMAP), ReefMapGS processes data in real-time, enabling long-term deployments. Gaussian splatting provides both efficient rendering and compact memory footprint suitable for onboard computation. Experiments on 700-meter AUV trajectories in coral reef environments demonstrate successful mapping despite challenging underwater conditions.

Human-Robot Interaction & Embodied Cognition

4 h=32

Safe Human-to-Humanoid Motion Imitation Using Control Barrier Functions

cs.RO eess.SY

Authors: Wenqi Cai, John Abanes, Nikolaos Evangeliou, Anthony Tzes

Core Contributions

Proposes vision-based framework combining human pose estimation with control barrier functions for safety-aware imitation
CBF-QP layer enforces collision avoidance constraints during policy execution, ensuring humanoid remains safe even if human demonstrates collision
Bridges perception (human pose from vision) and control (trajectory constraints), enabling robust imitation without explicit dynamics models
Particularly important for humanoid robots working near humans, where safety constraints cannot be compromised

Teaching humanoid robots through human demonstration requires ensuring the robot remains safe even when imitating potentially unsafe human motions. We propose a vision-based imitation learning framework that combines human pose tracking with control barrier functions. A CBF-QP controller filters the imitation policy outputs to enforce collision avoidance constraints in real-time. This approach enables humanoids to learn complex motions while maintaining safety guarantees. We demonstrate the method on manipulation tasks involving human-robot interaction in shared workspaces.

8 h=20

Dyadic Partnership(DP): A Missing Link Towards Full Autonomy in Medical Robotics

cs.RO

Authors: Nassir Navab, Zhongliang Jiang

Core Contributions

Proposes Dyadic Partnership model treating robots and clinicians as equal collaborative agents with complementary capabilities
AI-driven collaboration enables seamless task switching and shared decision-making rather than rigid automation or tele-operation
Particularly important for medical applications where human expertise and judgment remain critical despite automation potential
Framework addresses fundamental challenge: many medical tasks are inherently collaborative and cannot be fully automated

Medical robotics has pursued two paths: full automation or tele-operation. We propose a third paradigm: Dyadic Partnership, where robots and clinicians collaborate as equal partners. Rather than autonomous systems making independent decisions, dyadic systems enable continuous human-AI collaboration with seamless role switching. Each agent—robot and clinician—contributes complementary expertise: robots provide precision, consistency, and access to imaging data; humans provide judgment, adaptability, and responsibility. We present framework design principles and applications to surgical assistance.

9 h=19

Minimal Embodiment Enables Efficient Learning of Number Concepts in Robot

cs.RO cs.AI

Authors: Zhegong Shangguan, Alessandro Di Nuovo, Angelo Cangelosi

Core Contributions

Demonstrates that minimal physical embodiment (robot arm with gripper) enables rapid learning of number concepts—core abstract reasoning
Achieves 96.8% accuracy with only 10% of typical supervised training data, showing embodiment dramatically improves sample efficiency
Robot develops biologically-plausible number representations (approximate number sense) through interaction with objects
Findings suggest embodied constraints actually facilitate abstract reasoning, challenging view that embodiment is limitation

How do abstract concepts like "number" emerge from physical interaction with the world? We investigate how embodied robots learn counting and cardinality through manipulation. A Franka Panda robot interacts with objects, counting groups and comparing quantities. We find that minimal embodiment—a simple gripper—enables efficient learning of numerical concepts with small amounts of training data. The robot develops approximate number sense representations similar to biological systems. Our results suggest embodied learning is not a limitation but actually accelerates acquisition of abstract reasoning.

10 h=15

Efficient Emotion-Aware Iconic Gesture Prediction for Robot Co-Speech

cs.RO cs.AI

Authors: Edwin C. Montiel-Vazquez et al.

Core Contributions

Proposes lightweight transformer for predicting iconic gestures (robot hand shapes) conditioned on text and emotion
Outperforms GPT-4o on gesture prediction task, showing specialized models can exceed general-purpose LLMs on embodied tasks
Emotion conditioning enables natural co-speech behavior where gesture expressiveness matches conversational tone
Computational efficiency critical for real-time robot interaction, distinguishing approach from heavy foundation models

Natural human-robot interaction requires robots to produce appropriate gestures coordinated with speech and emotion. We propose a lightweight transformer architecture for predicting iconic gestures (hand shapes indicating objects or actions) from text and emotion labels. Despite simplicity, our model outperforms GPT-4o on this task, suggesting domain-specialized models remain valuable for embodied AI. The efficient design enables real-time inference on robot platforms. We demonstrate gesture prediction for a humanoid robot engaged in multimodal communication.

14 h=10

Learning to Forget -- Hierarchical Episodic Memory for Lifelong Robot Deployment

cs.RO cs.AI

Authors: Leonard Bärmann et al.

Core Contributions

Introduces H²-EMV: hierarchical episodic memory with selective forgetting for long-term robot operation
Achieves 45% reduction in memory footprint and 35% reduction in computational cost while improving accuracy by 70% on repeated queries
Demonstrates that forgetting policies (consolidating old experiences, removing redundant memories) are as important as encoding for lifelong learning
Hierarchical structure enables efficient organization of experiences at multiple timescales

Long-term robot deployment accumulates experience, but storing all memories becomes computationally intractable. We introduce H²-EMV (Hierarchical Episodic Memory with selective forgetting), which intelligently consolidates and prunes memories to maintain performance while reducing storage and computation. The approach uses hierarchical organization at multiple timescales (recent, intermediate, long-term) with selective consolidation of valuable memories. Experiments show that strategic forgetting—removing redundant or low-value experiences—improves both efficiency and accuracy on downstream tasks. This paradigm is essential for embodied agents operating for months or years.

30 h=4

M2HRI: An LLM-Driven Multimodal Multi-Agent Framework for Personalized Human-Robot Interaction

cs.RO

Authors: Shaid Hasan et al.

Core Contributions

Proposes multi-robot system where each agent has distinct LLM personality and persistent long-term memory of user interactions
Framework enables personalized experience: robots remember user preferences, adapt communication style, and coordinate behavior
User study with n=105 participants validates that multi-robot teams with personality and memory create more engaging interactions
Demonstrates practical scaling of human-robot interaction concepts to multiple robots with coordinated personalization

Personalized human-robot interaction requires robots to understand individual user preferences and maintain consistent long-term relationships. We propose M2HRI, a multimodal multi-agent framework where each robot has distinct personality (driven by LLMs) and persistent memory of user interactions. Robots coordinate their behaviors and share personalization information. A user study with 105 participants demonstrates that personality and persistent memory significantly enhance engagement and user satisfaction. The framework enables robot teams to provide personalized, consistent experiences across extended interactions.

Multi-Robot Systems

7 h=24

Modeling, Analysis and Activation of Planar Viscoelastically-combined Rimless Wheels

cs.RO

Authors: Fumihiko Asano et al.

Core Contributions

Introduces novel passive-dynamic walker design combining cross-shaped frames with viscoelastic elements
Viscoelasticity enables rhythmic oscillation and energy return during walking, improving efficiency without active control
Demonstrates stable walking gaits emerge from interaction between mechanics (frame geometry) and materials (elasticity)
Simplified design suggests biologically-inspired passive approaches may achieve locomotion more efficiently than fully actuated systems

Passive dynamic walking—where robots walk downhill using gravity without active control—is a proven path to efficient locomotion. We extend this paradigm by introducing viscoelastic elements into rimless wheel designs. Cross-shaped frames combined with strategically-placed springs enable stable gaits with minimal actuation. The viscoelasticity stores and releases energy during walking, improving efficiency. We provide mathematical analysis of these systems and experimentally validate walking stability. This work suggests that clever mechanical design can achieve efficient locomotion without heavy computation or actuation.

17 h=9

Dynamic Multi-Robot Task Allocation under Uncertainty and Communication Constraints

cs.RO eess.SY cs.GT

Authors: Maria G. Mendoza et al.

Core Contributions

Proposes IBR (Iterative Best Response) decentralized policy for multi-robot task allocation without centralized coordination
Handles realistic constraints: communication delays, partial task information, dynamic environment changes
Demonstrates scalability to large systems: successfully allocates tasks for 100+ drones in package delivery scenarios
Game-theoretic approach provides convergence guarantees despite communication limitations

Multi-robot systems require efficient task allocation, especially when communication bandwidth is limited and task information is uncertain. We propose IBR, an iterative best response algorithm where each robot independently optimizes its task assignments based on local information. Through game-theoretic analysis, we prove convergence even with communication constraints. The approach handles dynamic environments where new tasks arrive continuously. Experiments with up to 100 drones on package delivery tasks demonstrate scalability and efficiency compared to centralized allocation methods.

🤖 Robotics arXiv Digest

Research Landscape

VLA & Foundation Models

Robot Learning & Sim-to-Real

Manipulation & Grasping

Planning & Control

Navigation & Perception

Human-Robot Interaction & Embodied Cognition

Multi-Robot Systems