🤖 Robotics arXiv Digest

📅 April 13, 2026 📄 30 Papers 🗂️ 7 Research Areas ✨ Generated by Claude

Research Landscape

The robotics field is experiencing a remarkable convergence of three major paradigms: vision-language-action (VLA) foundation models enabling semantic reasoning, learning-based control approaches replacing hand-crafted policies, and embodied AI systems that ground abstract reasoning in physical interaction. Papers like StarVLA-α and Grounded World Model exemplify how large-scale vision-language pretraining is moving from pure perception toward actionable planning, while simultaneous advances in robot learning through simulation—from UGE-TO's uncertainty-guided trajectories to ComSim's compositional simulation—are systematically closing the sim-to-real gap. Manipulation tasks benefit from this alignment: ViserDex combines differentiable rendering with reinforcement learning for dexterous in-hand tasks, while AffordSim and ComSim are building the open-vocabulary affordance datasets needed for zero-shot generalization. The field shows particular maturity in hybrid classical-learning approaches (complementarity-by-construction solvers, neural+classical simulation) where interpretability and performance are no longer in tension.

Cross-cutting innovations suggest robotics is entering a phase of practical autonomy at scale. Multi-robot coordination papers (Dynamic Multi-Robot Task Allocation, Multi-ORFT) now handle realistic constraints—communication delays, uncertainty, cooperative objectives—that determine real-world deployment viability. Human-robot interaction is rapidly shifting from scripted gestures toward intent-aware collaboration: Safe Human-to-Humanoid Motion Imitation uses control barrier functions to fuse vision-based human understanding with safety guarantees, while M2HRI demonstrates that personality-driven multi-agent interaction with persistent memory scales to user studies of n=105. On the embodied cognition side, papers like Minimal Embodiment Enables Efficient Learning show that robots can develop compact, biologically plausible number representations from minimal interaction—a finding with implications for how embodiment constraints shape learning. Meanwhile, specialized domains (medical robotics, underwater reconstruction, racing perception) are applying these foundation-level innovations, with ReefMapGS closing the loop between SLAM and Gaussian splatting for large-scale underwater exploration and EagleVision establishing cross-domain benchmarks for perception in high-speed contexts.

A unifying theme is the shift toward systems that **learn to forget** and **adapt incrementally**. H²-EMV's hierarchical episodic memory with selective forgetting achieves 45% memory reduction while improving query accuracy, suggesting that scaling embodied AI demands new data management paradigms beyond standard replay buffers. Similarly, WM-DAgger uses world models as priors for efficient imitation learning, while RAPO tackles fundamental distributional shift under dynamics uncertainty via Boltzmann reweighting. Temporal reasoning and formal verification (Ternary Logic Encodings of Temporal BTs) are gaining prominence as systems become safety-critical. The papers collectively point toward a near-term future where robotics applications will be constrained not by algorithmic capability but by data efficiency, sim-to-real generalization, and the ability to ground abstract reasoning in heterogeneous sensor modalities and embodied constraints.

VLA & Foundation Models

Language-aligned vision-action systems

4 papers

Robot Learning & Sim-to-Real

Bridging simulation and physical deployment

6 papers

Manipulation & Grasping

Dexterous control and affordance learning

4 papers

Planning & Control

Motion synthesis and optimization methods

5 papers

Navigation & Perception

SLAM, odometry, and environmental understanding

3 papers

Human-Robot Interaction & Embodied Cognition

Collaboration, gesture, and learning through embodiment

6 papers

Multi-Robot Systems

Coordination and decentralized deployment

2 papers
VLA & Foundation Models
12 h=13
cs.RO cs.AI
Authors: Quanyi Li et al.
Core Contributions
  • Proposes GWM operating in vision-language-aligned latent space, enabling semantic understanding of scene objectives without task-specific fine-tuning
  • Achieves 87% success on WISER benchmark vs. 22% for traditional vision-language-action baselines, demonstrating 4x improvement in semantic generalization
  • Novel approach bridges low-level visual features and high-level semantic concepts through joint embedding space
  • Enables zero-shot planning for unseen task combinations by leveraging grounded semantic relationships
World models learn latent representations of environment dynamics that support planning. However, they typically learn low-level spatiotemporal patterns and struggle with high-level semantic understanding. We propose Grounded World Model (GWM) that grounds world models in vision-language representations, enabling semantic understanding of scene properties and task requirements. GWM learns a unified latent space that aligns visual features, language embeddings, and action representations. This enables semantically generalizable planning across diverse manipulation tasks. We evaluate GWM on WISER benchmark with unseen task combinations and demonstrate significant improvements in semantic generalization and robustness.
18 h=9
cs.RO cs.AI cs.CV
Authors: Jinhui Ye et al.
Core Contributions
  • Presents StarVLA-α, a minimalist VLA architecture that simplifies prior work without sacrificing performance or generalization
  • Outperforms π₀.â‚… by 20% on RoboChallenge benchmark, suggesting that simplicity and interpretability do not require sacrificing state-of-the-art performance
  • Demonstrates that effective VLA systems can emerge from straightforward design principles rather than complex hierarchical structures
  • Provides strong baseline for future VLA research and industrial deployment where computational efficiency matters
Vision-Language-Action (VLA) models have shown remarkable progress in robot learning, but architectural complexity remains a barrier to understanding and deployment. We introduce StarVLA-α, a simplified VLA architecture that achieves competitive performance while maintaining clarity and computational efficiency. Through systematic ablation studies, we identify core design principles that enable effective vision-language-action alignment. StarVLA-α outperforms larger and more complex baselines on standard benchmarks, suggesting that the field may have overengineered VLA systems.
20 h=8
cs.RO cs.LG
Authors: Liaoyuan Fan et al.
Core Contributions
  • Introduces spatial value maps as a unified representation for intent-aware action prediction, bridging value estimation and spatial localization
  • Achieves 94% success rate on RoboTwin 2.0 benchmark, demonstrating practical effectiveness for complex manipulation scenarios
  • Spatial value representation allows model to jointly reason about where to act and what action values are possible at each location
  • Framework generalizes across diverse task types by factoring world dynamics, intent encoding, and spatial reasoning
Understanding task intent and predicting appropriate actions are critical for robots. We propose AIM (Intent-Aware Unified world action Modeling), which uses spatial value maps to jointly model world dynamics and intent-conditioned action prediction. Spatial value maps provide a unified representation where each location in the workspace has an associated value under the current task intent. This allows the model to reason spatially about where and how to act. Experiments on complex manipulation benchmarks show significant improvements in action prediction accuracy and task success.
28 h=5
cs.CV cs.RO
Authors: Dujun Nie et al.
Core Contributions
  • Introduces LARY benchmark for evaluating learned latent action representations, a critical component of embodied AI but underexplored empirically
  • Reveals surprising finding that general visual models (vision transformers, foundation models) outperform specialized embodied models on latent action learning
  • Systematic evaluation across multiple domains shows gap between general and specialized architectures, suggesting misdirected research effort
  • Provides standardized evaluation protocol enabling future research to build more effective action representations
Learning effective action representations is crucial for embodied AI, yet evaluation remains fragmented across domains. We introduce LARY, a comprehensive benchmark for latent action representation learning that spans manipulation, navigation, and locomotion. Through systematic evaluation, we compare general-purpose vision models (e.g., ViT, foundation models) against robot-specialized architectures. Surprisingly, general models consistently outperform specialized alternatives, suggesting the embodied AI community may be overspecializing. LARY provides a standardized protocol for future research on action representations.
Robot Learning & Sim-to-Real
2 h=54
cs.LG cs.RO
Authors: Mintae Kim, Koushil Sreenath
Core Contributions
  • Proposes RAPO (Robust Adversarial Policy Optimization) with dual formulation addressing distributional shift from model mismatch during deployment
  • Combines trajectory-level temperature and model-level Boltzmann reweighting to adaptively focus training on worst-case dynamics shifts
  • Outperforms standard robust RL baselines by maintaining performance under significant model uncertainty without pessimistic value updates
  • Addresses fundamental challenge of sim-to-real transfer: policies must generalize to dynamics not seen during training
Sim-to-real transfer in robotics requires policies robust to model errors and unmodeled dynamics. We introduce RAPO, a robust reinforcement learning approach based on adversarial policy optimization that explicitly handles dynamics uncertainty. RAPO uses a dual formulation with trajectory-level and model-level temperature parameters to balance exploration and robustness. The model-level Boltzmann reweighting focuses learning on challenging dynamics shifts. Empirical results show RAPO significantly outperforms baseline robust RL methods on high-dimensional control tasks with substantial model mismatch.
5 h=31
cs.LG cs.AI cs.CV cs.RO
Authors: Mihir Prabhudesai, Aryan Satpathy, Yangmin Li, Zheyang Qin, Nikash Bhardwaj
Core Contributions
  • Novel approach uses physics simulators as supervision signal for training LLMs on physical reasoning, avoiding need for human-annotated solutions
  • Achieves zero-shot sim-to-real transfer: robots trained in simulation solve novel physics tasks without modification
  • Demonstrates 5-10 percentage point improvement over baseline approaches on International Physics Olympiad problems
  • Bridges gap between abstract physical reasoning and embodied problem-solving by grounding symbolic computation in simulator feedback
Large language models struggle with physics reasoning despite strong performance on text-based tasks. We propose using physics simulators as a source of self-supervision for training LLMs. An RL agent explores physics simulators to find solutions to problem scenarios, generating trajectories that demonstrate physical principles. These trajectories serve as demonstrations for language model fine-tuning. Our approach achieves competitive performance on International Physics Olympiad problems and enables zero-shot transfer to real robotic systems.
11 h=15
cs.RO
Authors: Jeremy Dao, Alan Fern
Core Contributions
  • Proposes proprioceptive distribution matching for simulator adaptation—matching only internal state distributions, not sensory observations
  • Achieves sim-to-real transfer without requiring time alignment or external sensing (cameras, IMUs), using only onboard proprioceptive data
  • Requires less than 5 minutes of hardware data for effective adaptation, dramatically reducing real-world sample complexity
  • Fundamental insight: proprioceptive consistency is sufficient for successful transfer; visual appearance and timing need not match
Sim-to-real transfer for legged robots typically requires either extensive real-world data or domain randomization over a large space of unknown parameters. We propose proprioceptive distribution matching (PDM), which adapts the simulator by matching only the distribution of proprioceptive (internal state) observations between sim and real, rather than matching sensory observations. PDM does not require time-synchronized trajectories, external sensors, or explicit parameter identification. We demonstrate successful sim-to-real transfer for quadruped locomotion using less than 5 minutes of hardware data.
19 h=9
cs.RO
Authors: Anlan Yu et al.
Core Contributions
  • Leverages world models to generate out-of-distribution recovery data, addressing fundamental challenge in imitation learning: recovering from distribution shift
  • Achieves 93.3% task success with only 5 demonstrations, reducing labeling burden compared to standard DAgger which requires expert re-labeling online
  • World models enable efficient synthesis of failure cases without expert interaction, making imitation learning practical for expensive-to-label domains
  • Combines model-based reasoning with behavior cloning, showing complementary benefits of learned and demonstrated trajectories
Imitation learning can solve robotics tasks with few demonstrations, but requires expert feedback when the policy encounters out-of-distribution states—a critical bottleneck for practical deployment. We propose WM-DAgger, which uses learned world models to predict failure modes and generate recovery trajectories, reducing dependence on expert annotations. World model predictions identify states where the current policy will likely fail; we then synthesize demonstrations for these states without requiring expert interaction. Our approach achieves high success rates with minimal demonstrations on complex manipulation tasks.
22 h=7
cs.RO cs.CV
Authors: Yiran Qin et al.
Core Contributions
  • Proposes hybrid classical+neural simulation for compositional data generation: classical physics for dynamics, neural networks for rendering
  • Directly generates action-video pairs without requiring intermediate 3D supervision, enabling scalable synthetic data production
  • Demonstrates significant reduction in sim-to-real gap compared to standard rendering-only simulators by closing the sensorimotor loop
  • Compositional approach allows reuse of components across different simulation scenarios, enabling efficient scaling to diverse tasks
Generating large-scale robot training data in simulation is essential for learning-based approaches, but sim-to-real transfer remains challenging. We introduce ComSim, a compositional simulation framework that combines classical physics engines with learned neural components to generate realistic action-conditioned videos. The hybrid approach leverages classical simulators for accurate dynamics while using neural networks to model visual complexity. Compositional design allows reusing and combining components across tasks. We demonstrate that ComSim generates data that transfers effectively to real robotic systems with minimal fine-tuning.
25 h=6
cs.RO
Authors: Xiaotian Qiu et al.
Core Contributions
  • Introduces score-based RL fine-tuning for flow matching policies, enabling distributional control beyond single trajectory optimization
  • Achieves 2.4x faster convergence compared to standard RL fine-tuning by leveraging probabilistic structure of flow models
  • Maintains diversity of solutions while optimizing for task objectives, important for exploration and robustness
  • Bridges generative modeling and reinforcement learning: combines expressiveness of flow models with reward optimization
Flow matching models can generate diverse trajectories, but applying them to reward optimization requires careful design. We propose ScoRe-Flow, which performs score-based RL fine-tuning on flow matching policies to optimize task-specific rewards while maintaining distributional properties. By working in score space, we enable efficient gradient-based optimization of the flow model without compromising generative diversity. Experiments show significant speedup in convergence and improved performance on complex manipulation tasks.
Manipulation & Grasping
13 h=13
cs.RO cs.CV
Authors: Arjun Bhardwaj et al.
Core Contributions
  • Applies 3D Gaussian Splatting to monocular RGB observations for high-fidelity in-hand object understanding, replacing expensive multi-view setups
  • Domain randomization in Gaussian space enables effective transfer from simulation to real dexterous manipulation without domain adaptation
  • Achieves robust reorientation control using only single onboard camera, demonstrating practical deployment for five-fingered hands
  • Demonstrates that differentiable rendering enables better visual sim-to-real transfer than traditional pixel-space matching
Dexterous in-hand manipulation requires precise control based on visual feedback, but sim-to-real transfer for monocular observations is challenging. We introduce ViserDex, which uses 3D Gaussian Splatting to reconstruct hand-object scenes from single RGB images. Gaussian splatting provides differentiable rendering for domain randomization, enabling robust transfer from simulation. Our approach combines differentiable rendering with reinforcement learning to learn dexterous manipulation policies from image observations. Experiments on real five-fingered hands demonstrate successful reorientation of diverse objects.
15 h=10
cs.RO
Authors: Jie Han et al.
Core Contributions
  • Proposes diffusion-based actor network for bin packing, treating grasp pose generation as iterative refinement problem
  • Leverages generative model's ability to capture multimodal grasp distributions, improving exploration in action space
  • Outperforms standard DRL baselines in number of successfully packed items, showing importance of diverse action sampling
  • Integrates uncertainty quantification through diffusion, enabling risk-aware packing decisions
3D bin packing is a challenging manipulation task requiring both geometric reasoning and sequential decision-making. We propose a diffusion-based reinforcement learning approach where the policy generates grasp poses through iterative refinement. The diffusion model captures the distribution of valid grasp poses, providing diverse action samples for exploration. Combined with RL optimization, this approach achieves better packing efficiency than traditional methods. We demonstrate the approach on both simulated and real robotic systems.
21 h=7
cs.RO cs.AI
Authors: Mingyang Li et al.
Core Contributions
  • Introduces VoxAfford for open-vocabulary 3D affordance learning, moving beyond class-specific grasp points to semantic object properties
  • Provides 50 tasks across 7 object categories benchmark revealing significant gap in affordance-demanding tasks, suggesting important research frontier
  • Scalable simulator-based data generation enables training affordance models on diverse objects without hand-annotation
  • Affordance representations enable zero-shot transfer to novel objects by grounding task requirements in spatial properties
Robotic manipulation requires understanding object affordances—what actions are possible on different parts of objects. We present AffordSim, a data generation framework and benchmark for learning open-vocabulary 3D affordances. Unlike traditional grasp detection, affordances capture semantic properties like "graspable," "screwable," and "deformable." AffordSim provides 50 manipulation tasks across 7 object categories. Our analysis reveals significant performance gaps on affordance-demanding tasks, highlighting an important research direction for embodied AI systems.
24 h=6
cs.RO
Authors: Yiran Ling et al.
Core Contributions
  • Proposes dual-pathway perception combining language-guided detection and spatial property reasoning for open-vocabulary grasping
  • Asynchronous closed-loop architecture enables continuous perception refinement as robot executes grasps, recovering from failures
  • Achieves 87% success rate on real robot grasping without requiring task-specific training or object models
  • Demonstrates practical system that generalizes to arbitrary objects specified by natural language description
Grasping novel objects requires flexible perception systems that can understand arbitrary object categories. We propose CLASP, which combines language-guided object detection with spatial reasoning for open-vocabulary grasping. CLASP uses a dual-pathway architecture: one pathway detects objects specified by natural language descriptions, while another pathway reasons about grasp feasibility based on spatial properties. An asynchronous closed-loop controller continuously refines perceptions and adjusts grasp strategies during execution. Experiments show robust grasping of diverse objects without task-specific training.
Planning & Control
1 h=61
cs.RO eess.SY
Authors: Ryan Matheu, John S. Baras, Calin Belta
Core Contributions
  • Reformulates Temporal Behavior Trees using ternary-valued STL (Signal Temporal Logic) to enable formal verification and synthesis
  • Develops mixed-integer linear program encodings that guarantee correct-by-construction control strategies with formal safety guarantees
  • Bridges hierarchical planning (behavior trees) and formal verification (temporal logic), enabling automated synthesis with certification
  • Applicable to systems where correctness is critical: autonomous vehicles, medical robotics, industrial automation
Behavior Trees are widely used for hierarchical control in robotics, but lack formal guarantees. We present a novel approach to encode Temporal Behavior Trees using ternary-valued Signal Temporal Logic (STL), enabling formal verification and automated synthesis. The encoding reformulates tree semantics as ternary logic operations (true, false, unknown) rather than binary. We then develop mixed-integer linear program (MILP) encodings to synthesize control strategies that are guaranteed to satisfy temporal requirements. This approach enables correct-by-construction controllers suitable for safety-critical applications.
3 h=42
cs.RO
Authors: O. Goktug Poyrazoglu, Yukang Cao, Rahul Moorthy, Volkan Isler
Core Contributions
  • Proposes UGE-TO that generates well-separated trajectory samples via uncertainty ellipsoids, improving exploration efficiency in MPC
  • Uses Hellinger distance between trajectory distributions to quantify diversity, ensuring samples cover different regions of solution space
  • Achieves 72.1% faster convergence and 66% faster execution time with 6.7% higher success rate on complex planning tasks
  • Uncertainty-guided approach is more sample-efficient than uniform sampling or diversity-only methods
Sampling-based Model Predictive Control requires generating diverse trajectory candidates for optimization. Standard approaches sample uniformly or use prior information, but can miss promising regions. We propose Uncertainty Guided Exploratory Trajectory Optimization (UGE-TO), which generates trajectory samples by considering both model uncertainty and trajectory diversity. Uncertainty ellipsoids from the dynamics model guide sampling toward informative regions, while Hellinger distance ensures samples remain well-separated. This approach significantly accelerates convergence and improves success rates on complex manipulation and navigation tasks.
23 h=6
cs.RO
Authors: Arun L. Bishop et al.
Core Contributions
  • Develops Lie-group-based framework for solving quadratic programs with complementarity constraints, fundamental for contact-rich manipulation
  • Guarantees complementarity satisfaction by construction rather than post-hoc checking, ensuring physically consistent solutions
  • Open-source Marble solver in C++ enables practical adoption in trajectory optimization and control pipelines
  • Unified mathematical framework handles diverse contact scenarios: friction, contact breaks, and inequality constraints
Contact-rich manipulation requires solving optimization problems with complementarity constraints that ensure physical consistency. We present a novel approach using Lie group geometry to structure these optimization problems. By leveraging the group structure, we develop algorithms that satisfy complementarity constraints by construction, avoiding numerical issues of penalty methods. We provide open-source solver (Marble) in C++ for practical trajectory optimization and control synthesis in contact-rich scenarios. The approach handles friction cones, contact breaks, and diverse constraint types.
26 h=5
cs.RO eess.SY
Authors: Xinyu Zhou et al.
Core Contributions
  • Develops hybrid dynamic model for compliant worm robot navigating constrained environments (corrugated pipes, narrow spaces)
  • Multi-objective gait optimization balances speed, stability, and energy efficiency, addressing trade-offs in novel locomotion
  • Compliant mechanisms enable adaptability to irregular terrain, important for inspection and search-and-rescue applications
  • Physics-based optimization provides interpretable gaits suitable for real-world deployment
Worm-like robots offer unique capabilities for navigation in confined spaces, but their compliant dynamics are complex. We develop a hybrid dynamic model capturing both rigid-body and elastic deformation behaviors. Using this model, we formulate multi-objective gait optimization to find movement patterns that are fast, stable, and energy-efficient. Experiments with a soft-robotic worm prototype demonstrate that optimized gaits significantly outperform baseline approaches in navigating pipes and irregular terrain. This work provides tools for designing effective locomotion strategies for compliant robots.
27 h=5
cs.RO cs.AI
Authors: Haojie Bai et al.
Core Contributions
  • Combines scene-conditioned diffusion models with multi-agent reinforcement learning for cooperative autonomous driving
  • VG-GRPO post-training provides stable fine-tuning without mode collapse, crucial for diverse behavior in traffic scenarios
  • Reduces collision rate from 2.04% to 1.89% through multi-agent coordination, demonstrating safety improvements from cooperation
  • Addresses challenge of maintaining diversity in generative models while optimizing for safety objectives
Autonomous driving requires generating diverse behaviors while maintaining safety guarantees. We propose Multi-ORFT, which applies online reinforcement fine-tuning to multi-agent diffusion planning. A scene-conditioned diffusion model generates diverse trajectories, which are then refined using group reward policy optimization (VG-GRPO). This approach maintains the diversity of diffusion models while optimizing for collision-free trajectories. We demonstrate improvements in both safety and realism on complex driving scenarios with multiple vehicles.
Navigation & Perception
6 h=26
cs.RO cs.CV
Authors: Zakhar Yagudin et al.
Core Contributions
  • Introduces LiDAR-based benchmark for 3D object detection and trajectory prediction in high-speed racing context
  • Emphasizes cross-domain transfer: models trained on one racing venue must generalize to different tracks, vehicle types, and weather
  • Simultaneously addresses perception tasks (detection) and prediction (future trajectory), critical for real-time decision-making
  • High-speed racing provides particularly challenging testbed with brief decision windows and safety-critical consequences
High-speed autonomous racing demands robust perception under challenging conditions: limited reaction time, dynamic environments, and safety criticality. We present EagleVision, a multi-task benchmark combining 3D object detection and trajectory prediction using LiDAR data. The benchmark emphasizes cross-domain generalization—models must transfer across different racing venues, weather conditions, and vehicle configurations. We analyze domain gaps and propose methods for cross-domain adaptation. EagleVision provides the first large-scale benchmark for perception in high-speed autonomous racing.
16 h=9
cs.RO
Authors: Cedric Le Gentil et al.
Core Contributions
  • Extends Direct Radar Odometry to full SE(3) estimation by integrating gyroscope information for rotation estimation
  • Achieves LiDAR-level accuracy using only 2D imaging radar and gyroscope—much cheaper sensors than 3D LiDAR
  • Validated on 643km Boreas-RT dataset with reliable performance across weather conditions (rain, snow, sun glare)
  • Radar odometry provides weather robustness advantage over LiDAR, important for autonomous vehicles in adverse conditions
Radar offers a cost-effective alternative to LiDAR for autonomous vehicles, with superior performance in adverse weather. Previous radar odometry methods produced 2D estimates; full 3D localization required additional sensors. We present 3DRO, which combines 2D radar imaging with inertial measurement (gyroscope) to estimate full SE(3) poses. Our direct optimization approach achieves accuracy comparable to LiDAR-based methods while maintaining radar's weather robustness. Validation on the Boreas-RT dataset (643km) demonstrates reliable performance across rain, snow, and changing lighting.
29 h=4
cs.RO cs.CV
Authors: Daniel Yang et al.
Core Contributions
  • Combines multimodal SLAM (fusion of sonar, visual, and inertial data) with incremental 3D Gaussian Splatting for large-scale underwater reconstruction
  • COLMAP-free approach enables real-time processing without offline structure-from-motion, critical for long-term deployments
  • Validated on 700m AUV trajectories in challenging underwater environments with turbidity and dynamic lighting
  • Enables high-fidelity 3D maps for underwater robotics applications: coral monitoring, archaeological surveying, infrastructure inspection
Large-scale underwater mapping with autonomous vehicles requires robust SLAM and efficient 3D reconstruction. We present ReefMapGS, which integrates multimodal SLAM (sonar, camera, inertial) with incremental 3D Gaussian Splatting. Unlike prior methods requiring offline structure-from-motion (e.g., COLMAP), ReefMapGS processes data in real-time, enabling long-term deployments. Gaussian splatting provides both efficient rendering and compact memory footprint suitable for onboard computation. Experiments on 700-meter AUV trajectories in coral reef environments demonstrate successful mapping despite challenging underwater conditions.
Human-Robot Interaction & Embodied Cognition
4 h=32
cs.RO eess.SY
Authors: Wenqi Cai, John Abanes, Nikolaos Evangeliou, Anthony Tzes
Core Contributions
  • Proposes vision-based framework combining human pose estimation with control barrier functions for safety-aware imitation
  • CBF-QP layer enforces collision avoidance constraints during policy execution, ensuring humanoid remains safe even if human demonstrates collision
  • Bridges perception (human pose from vision) and control (trajectory constraints), enabling robust imitation without explicit dynamics models
  • Particularly important for humanoid robots working near humans, where safety constraints cannot be compromised
Teaching humanoid robots through human demonstration requires ensuring the robot remains safe even when imitating potentially unsafe human motions. We propose a vision-based imitation learning framework that combines human pose tracking with control barrier functions. A CBF-QP controller filters the imitation policy outputs to enforce collision avoidance constraints in real-time. This approach enables humanoids to learn complex motions while maintaining safety guarantees. We demonstrate the method on manipulation tasks involving human-robot interaction in shared workspaces.
8 h=20
cs.RO
Authors: Nassir Navab, Zhongliang Jiang
Core Contributions
  • Proposes Dyadic Partnership model treating robots and clinicians as equal collaborative agents with complementary capabilities
  • AI-driven collaboration enables seamless task switching and shared decision-making rather than rigid automation or tele-operation
  • Particularly important for medical applications where human expertise and judgment remain critical despite automation potential
  • Framework addresses fundamental challenge: many medical tasks are inherently collaborative and cannot be fully automated
Medical robotics has pursued two paths: full automation or tele-operation. We propose a third paradigm: Dyadic Partnership, where robots and clinicians collaborate as equal partners. Rather than autonomous systems making independent decisions, dyadic systems enable continuous human-AI collaboration with seamless role switching. Each agent—robot and clinician—contributes complementary expertise: robots provide precision, consistency, and access to imaging data; humans provide judgment, adaptability, and responsibility. We present framework design principles and applications to surgical assistance.
9 h=19
cs.RO cs.AI
Authors: Zhegong Shangguan, Alessandro Di Nuovo, Angelo Cangelosi
Core Contributions
  • Demonstrates that minimal physical embodiment (robot arm with gripper) enables rapid learning of number concepts—core abstract reasoning
  • Achieves 96.8% accuracy with only 10% of typical supervised training data, showing embodiment dramatically improves sample efficiency
  • Robot develops biologically-plausible number representations (approximate number sense) through interaction with objects
  • Findings suggest embodied constraints actually facilitate abstract reasoning, challenging view that embodiment is limitation
How do abstract concepts like "number" emerge from physical interaction with the world? We investigate how embodied robots learn counting and cardinality through manipulation. A Franka Panda robot interacts with objects, counting groups and comparing quantities. We find that minimal embodiment—a simple gripper—enables efficient learning of numerical concepts with small amounts of training data. The robot develops approximate number sense representations similar to biological systems. Our results suggest embodied learning is not a limitation but actually accelerates acquisition of abstract reasoning.
10 h=15
cs.RO cs.AI
Authors: Edwin C. Montiel-Vazquez et al.
Core Contributions
  • Proposes lightweight transformer for predicting iconic gestures (robot hand shapes) conditioned on text and emotion
  • Outperforms GPT-4o on gesture prediction task, showing specialized models can exceed general-purpose LLMs on embodied tasks
  • Emotion conditioning enables natural co-speech behavior where gesture expressiveness matches conversational tone
  • Computational efficiency critical for real-time robot interaction, distinguishing approach from heavy foundation models
Natural human-robot interaction requires robots to produce appropriate gestures coordinated with speech and emotion. We propose a lightweight transformer architecture for predicting iconic gestures (hand shapes indicating objects or actions) from text and emotion labels. Despite simplicity, our model outperforms GPT-4o on this task, suggesting domain-specialized models remain valuable for embodied AI. The efficient design enables real-time inference on robot platforms. We demonstrate gesture prediction for a humanoid robot engaged in multimodal communication.
14 h=10
cs.RO cs.AI
Authors: Leonard Bärmann et al.
Core Contributions
  • Introduces H²-EMV: hierarchical episodic memory with selective forgetting for long-term robot operation
  • Achieves 45% reduction in memory footprint and 35% reduction in computational cost while improving accuracy by 70% on repeated queries
  • Demonstrates that forgetting policies (consolidating old experiences, removing redundant memories) are as important as encoding for lifelong learning
  • Hierarchical structure enables efficient organization of experiences at multiple timescales
Long-term robot deployment accumulates experience, but storing all memories becomes computationally intractable. We introduce H²-EMV (Hierarchical Episodic Memory with selective forgetting), which intelligently consolidates and prunes memories to maintain performance while reducing storage and computation. The approach uses hierarchical organization at multiple timescales (recent, intermediate, long-term) with selective consolidation of valuable memories. Experiments show that strategic forgetting—removing redundant or low-value experiences—improves both efficiency and accuracy on downstream tasks. This paradigm is essential for embodied agents operating for months or years.
30 h=4
cs.RO
Authors: Shaid Hasan et al.
Core Contributions
  • Proposes multi-robot system where each agent has distinct LLM personality and persistent long-term memory of user interactions
  • Framework enables personalized experience: robots remember user preferences, adapt communication style, and coordinate behavior
  • User study with n=105 participants validates that multi-robot teams with personality and memory create more engaging interactions
  • Demonstrates practical scaling of human-robot interaction concepts to multiple robots with coordinated personalization
Personalized human-robot interaction requires robots to understand individual user preferences and maintain consistent long-term relationships. We propose M2HRI, a multimodal multi-agent framework where each robot has distinct personality (driven by LLMs) and persistent memory of user interactions. Robots coordinate their behaviors and share personalization information. A user study with 105 participants demonstrates that personality and persistent memory significantly enhance engagement and user satisfaction. The framework enables robot teams to provide personalized, consistent experiences across extended interactions.
Multi-Robot Systems
7 h=24
cs.RO
Authors: Fumihiko Asano et al.
Core Contributions
  • Introduces novel passive-dynamic walker design combining cross-shaped frames with viscoelastic elements
  • Viscoelasticity enables rhythmic oscillation and energy return during walking, improving efficiency without active control
  • Demonstrates stable walking gaits emerge from interaction between mechanics (frame geometry) and materials (elasticity)
  • Simplified design suggests biologically-inspired passive approaches may achieve locomotion more efficiently than fully actuated systems
Passive dynamic walking—where robots walk downhill using gravity without active control—is a proven path to efficient locomotion. We extend this paradigm by introducing viscoelastic elements into rimless wheel designs. Cross-shaped frames combined with strategically-placed springs enable stable gaits with minimal actuation. The viscoelasticity stores and releases energy during walking, improving efficiency. We provide mathematical analysis of these systems and experimentally validate walking stability. This work suggests that clever mechanical design can achieve efficient locomotion without heavy computation or actuation.
17 h=9
cs.RO eess.SY cs.GT
Authors: Maria G. Mendoza et al.
Core Contributions
  • Proposes IBR (Iterative Best Response) decentralized policy for multi-robot task allocation without centralized coordination
  • Handles realistic constraints: communication delays, partial task information, dynamic environment changes
  • Demonstrates scalability to large systems: successfully allocates tasks for 100+ drones in package delivery scenarios
  • Game-theoretic approach provides convergence guarantees despite communication limitations
Multi-robot systems require efficient task allocation, especially when communication bandwidth is limited and task information is uncertain. We propose IBR, an iterative best response algorithm where each robot independently optimizes its task assignments based on local information. Through game-theoretic analysis, we prove convergence even with communication constraints. The approach handles dynamic environments where new tasks arrive continuously. Experiments with up to 100 drones on package delivery tasks demonstrate scalability and efficiency compared to centralized allocation methods.