🤖 Robotics arXiv Digest

📅 April 14, 2026
📚 30 Papers
🏷️ 8 Research Areas
✨ Generated by Claude

📊 Research Landscape

The April 2026 robotics arXiv batch reveals a field at an inflection point, with foundation models and learning-based approaches reaching production maturity while fundamental challenges in whole-body coordination, real-world deployment, and human-robot collaboration remain active frontiers. The dominance of learning-based methods is evident across manipulation (Papers 1, 14, 15, 21, 29), navigation (Papers 11, 13, 28), and autonomous systems (Papers 2, 7, 18, 20), yet papers consistently highlight the critical gap between simulation and deployment—whether through touch-centered multimodal learning (Paper 14), feasibility-aware trajectory generation (Paper 7), or resilient sensor fusion (Paper 20, 30).

A striking theme is the shift from monolithic end-to-end models toward interpretable, modular architectures that combine learning with classical control. Papers 1 (PAINT), 15 (WHOLE-MoMa), and 16 (hybrid plan refinement) exemplify this: they leverage hierarchical decomposition to separate intent inference or planning from low-level execution. Similarly, SLAM is undergoing renaissance with 3D Gaussian splatting (Papers 8, 19, 25), offering orders-of-magnitude speed improvements while enabling dynamic scene handling—a critical capability for real-world deployment mentioned across navigation papers (11, 13, 28).

Cross-cutting innovations include (1) scalable data collection via VR and sim-to-real pipelines (Papers 6, 14, 24), reducing the sample complexity of real-world learning; (2) neural scene representations as a unifying abstraction for navigation, manipulation, and simulation (Papers 8, 21, 25); and (3) integration of physical priors and safety constraints into differentiable pipelines (Papers 7, 9, 24). The traffic simulation survey (Paper 2) and autonomous driving cohort (Papers 2, 7, 18, 20, 22) signal maturation in this domain, while papers on underwater exploration (28), dynamic soaring (22), and nanoparticle synthesis (23) hint at expanding application frontiers beyond traditional mobile manipulation and driving.

🎯 Research Areas

VLA & Foundation Models

Vision-language models and embodied action prediction for robotic manipulation

3 papers

Autonomous Driving & Traffic

End-to-end learning, trajectory planning, and behavior simulation for autonomous vehicles

5 papers

Mobile Manipulation & Whole-Body Control

Coordination algorithms and learning methods for mobile manipulator arms

4 papers

SLAM & 3D Reconstruction

Simultaneous localization, mapping, and neural scene representations

5 papers

Robot Learning & Sim-to-Real

Reinforcement learning, transfer learning, and simulation-based training

5 papers

Navigation & Multi-Robot Systems

Goal-directed navigation, collaborative systems, and active exploration

3 papers

Hardware & Mechanism Design

Mechanical design, actuators, and wearable sensing systems

4 papers

Human-Robot Interaction

Error recovery, safety, and collaborative design principles

1 paper

📑 Papers

VLA & Foundation Models

Yaru Niu, H. E. Tseng et al.
Core Contributions
  • Introduces Humanoid Transformer with Touch Dreaming (HTD), a multimodal encoder-decoder Transformer that treats tactile feedback as a core modality alongside vision and proprioception for dexterous manipulation.
  • Combines RL-based whole-body controller with VR-based teleoperation for efficient real-world demonstration collection in humanoid systems.
  • Achieves 90.9% relative improvement in success rate across five contact-rich tasks (Insert, Book Organization, Towel Folding, Cat Litter Scooping, Tea Serving) compared to stronger baselines.
  • Demonstrates that latent-space tactile prediction is 30% more effective than raw tactile prediction, showing the importance of learned tactile representations.
Humanoid robots promise general-purpose assistance, yet real-world humanoid loco-manipulation remains challenging because it requires whole-body stability, dexterous hands, and contact-aware perception under frequent contact changes. In this work, we study dexterous, contact-rich humanoid loco-manipulation. We first develop an RL-based whole-body controller that provides stable lower-body and torso execution during complex manipulation. Built on this controller, we develop a whole-body humanoid data collection system that combines VR-based teleoperation with human-to-humanoid motion mapping, enabling efficient collection of real-world demonstrations. We then propose Humanoid Transformer with Touch Dreaming (HTD), a multimodal encoder-decoder Transformer that models touch as a core modality alongside multi-view vision and proprioception. HTD is trained in a single stage with behavioral cloning augmented by touch dreaming: in addition to predicting action chunks, the policy predicts future hand-joint forces and future tactile latents, encouraging the shared Transformer trunk to learn contact-aware representations for dexterous interaction. Across five contact-rich tasks, Insert-T, Book Organization, Towel Folding, Cat Litter Scooping, and Tea Serving, HTD achieves a 90.9% relative improvement in average success rate over the stronger baseline. Ablation results further show that latent-space tactile prediction is more effective than raw tactile prediction, yielding a 30% relative gain in success rate. These results demonstrate that combining robust whole-body execution, scalable humanoid data collection, and predictive touch-centered learning enables versatile, high-dexterity humanoid manipulation in the real world.
Zijian Song, Liang Lin et al.
Core Contributions
  • Proposes VGA (Vision-to-Geometry-Action) model that replaces language/video backbones with a 3D world model backbone, reframing manipulation as direct vision-to-geometry mapping.
  • Achieves zero-shot viewpoint generalization without explicit 3D supervision, demonstrating robust spatial reasoning across diverse camera perspectives.
  • Outperforms π₀.₅ baseline while maintaining interpretability through explicit 3D geometry predictions.
  • Provides novel perspective on embodied reasoning: 3D world models as foundational abstraction for action prediction, distinct from language-first approaches.
Manipulation tasks require understanding the spatial structure of objects and scenes. This work proposes Vision-to-Geometry-Action (VGA), a model that interprets robotic manipulation as vision-to-geometry mapping followed by geometry-to-action prediction. Rather than relying on language as an intermediate representation, VGA learns a 3D world model backbone that directly predicts object geometry and spatial relationships from images. The geometry predictions serve as input to a learned or classical action policy. Experiments demonstrate that the VGA approach enables zero-shot viewpoint generalization and outperforms π₀.₅-style vision-language models on manipulation tasks. The results suggest that explicit 3D geometric reasoning is a more effective foundation for manipulation than language-based abstractions.
Junming Wang, Pengxiang Zhai et al.
Core Contributions
  • Develops VR teleoperation interface for robot-free dexterous manipulation data collection, achieving 85% data validity without physical robot deployment.
  • Demonstrates 10:1 robot-free to real-world data ratio, enabling efficient scaling of training data at minimal cost.
  • Compiles 2,000-hour dataset of high-quality dexterous manipulation demonstrations across diverse tasks and hand morphologies.
  • Shows that VR-collected data generalizes effectively to real robotic systems, reducing sim-to-real gap for manipulation tasks.
Dexterous robotic manipulation requires large-scale, high-quality datasets to train learning-based control policies. This work presents XRZero-G0, a comprehensive approach to data collection for dexterous manipulation using VR teleoperation interfaces. Key contributions include: (1) a VR interface enabling intuitive, high-fidelity human control of robotic hands; (2) validation that VR-collected data achieves 85% validity when applied to real robots; (3) demonstration of a 10:1 ratio between virtual and real-world data usage; (4) a 2,000-hour dataset spanning multiple manipulation tasks and hand morphologies. The work enables efficient scaling of manipulation learning by decoupling demonstration collection from expensive real-world robot operation.

Autonomous Driving & Traffic

Saeed Rahmani, Shiva Rasouli, Daphne Cornelisse, Eugene Vinitsky, Bart van Arem
Core Contributions
  • Provides comprehensive survey of AI methods for mixed autonomy traffic simulation, bridging traffic engineering and computer science communities.
  • Introduces unified taxonomy organizing methods into three families: agent-level behavior models, environment-level simulation methods, and cognitive/physics-informed approaches.
  • Analyzes critical gaps in existing simulation platforms—highlighting limited realism in human driver modeling and insufficient integration of learned interaction models.
  • Reviews evaluation protocols, metrics, datasets, and tools; identifies key research directions for advancing safe and representative AV testing in mixed traffic scenarios.
Autonomous vehicles (AVs) are now operating on public roads, which makes their testing and validation more critical than ever. Simulation offers a safe and controlled environment for evaluating AV performance in varied conditions. However, existing simulation tools mainly focus on graphical realism and rely on simple rule-based models and therefore fail to accurately represent the complexity of driving behaviors and interactions. Artificial intelligence (AI) has shown strong potential to address these limitations; however, despite the rapid progress across AI methodologies, a comprehensive survey of their application to mixed autonomy traffic simulation remains lacking. Existing surveys either focus on simulation tools without examining the AI methods behind them, or cover ego-centric decision-making without addressing the broader challenge of modeling surrounding traffic. Moreover, they do not offer a unified taxonomy of AI methods covering individual behavior modeling to full scene simulation. To address these gaps, this survey provides a structured review and synthesis of AI methods for modeling AV and human driving behavior in mixed autonomy traffic simulation. We introduce a taxonomy that organizes methods into three families: agent-level behavior models, environment-level simulation methods, and cognitive and physics-informed methods. The survey analyzes how existing simulation platforms fall short of the needs of mixed autonomy research and outlines directions to narrow this gap. It also provides a chronological overview of AI methods and reviews evaluation protocols and metrics, simulation tools, and datasets. By covering both traffic engineering and computer science perspectives, we aim to bridge the gap between these two communities.
Baoyun Wang, B. Leng et al.
Core Contributions
  • Proposes trajectory-centric diffusion model with built-in feasibility constraints, avoiding post-hoc trajectory filtering and improving real-world deployment reliability.
  • Integrates curvature constraints and kinematic feasibility directly into diffusion planning process, ensuring predicted trajectories are executable by real vehicles.
  • Applies GRPO (gradient reinforcement policy optimization) post-training to further refine trajectory quality and driving safety.
  • Demonstrates improved generalization to diverse driving scenarios while maintaining computational efficiency suitable for autonomous vehicle deployment.
Trajectory planning is central to autonomous driving. Diffusion models offer flexible, expressive trajectory generation but often produce kinematically infeasible paths requiring post-hoc correction. This work proposes FeaXDrive, a diffusion-based planning approach that integrates feasibility constraints directly into the generative process. The method enforces curvature limits and vehicle kinematic constraints during trajectory sampling, eliminating trajectories that violate physical feasibility. Additionally, we apply gradient-based reinforcement policy optimization (GRPO) to refine trajectories for safety and comfort. Experiments demonstrate that feasibility-aware diffusion planning achieves superior performance on autonomous driving benchmarks while maintaining the generality and expressiveness of diffusion models.
Zhihua Hua, Zhongxue Gan et al.
Core Contributions
  • Introduces SNG (Spatial Navigation Guidance) framework showing that navigation understanding significantly improves end-to-end driving without auxiliary perception losses.
  • Proposes SNG-VLA variant integrating vision-language models with spatial guidance, achieving state-of-the-art performance on driving benchmarks.
  • Demonstrates that spatial reasoning acts as a strong auxiliary task for learning meaningful representations without explicit perception supervision.
  • Simplifies training pipeline by eliminating separate object detection and segmentation losses while improving overall driving performance.
End-to-end autonomous driving aims to map sensor inputs directly to vehicle control actions. However, intermediate perception and planning representations often improve robustness. This work shows that navigation understanding—reasoning about spatial structure and routes—provides a powerful auxiliary signal for end-to-end driving without requiring explicit object detection. We introduce the SNG (Spatial Navigation Guidance) framework, which learns navigation features as an intermediate representation. SNG-VLA, a variant combining vision-language models with spatial guidance, achieves state-of-the-art results while maintaining simplicity. The work demonstrates that navigation reasoning is surprisingly effective for driving tasks and challenges the necessity of traditional perception pipelines in learning-based control.
Chieh Tsai, Salim Hariri et al.
Core Contributions
  • Proposes RACF framework with cross-sensor gating mechanism for distance correction, improving robustness to sensor corruption and failure.
  • Achieves 35% RMSE reduction in object distance estimation under sensor degradation, critical for safety-critical autonomous driving.
  • Demonstrates resilience to adversarial sensor inputs through adaptive fusion of multiple sensor modalities (camera, LiDAR, radar).
  • Provides practical solution to real-world sensor reliability challenges without requiring complete sensor replacement or expensive redundancy.
Autonomous vehicles rely on accurate distance estimation from multiple sensors. Sensor corruption, degradation, or adversarial interference can compromise safety. This work presents RACF (Resilient Autonomous Car Framework), which uses cross-sensor gating to dynamically weight and fuse distance estimates from different sensors. The approach learns when to trust each sensor based on evidence from others, effectively mitigating single-sensor failures. Experiments show 35% RMSE reduction under various corruption scenarios. The framework provides a practical solution to sensor reliability challenges in real-world autonomous driving systems.
Lunbing Chen, Jinpeng Huang et al.
Core Contributions
  • Introduces step-level state-feedback control for dynamic soaring in aerial vehicles without explicit trajectory planning.
  • Uses deep reinforcement learning to learn energy-optimal flight strategies in shear flows, enabling sustained autonomous flight with minimal energy.
  • Demonstrates that learned step-level policies outperform traditional trajectory-based planning by adapting to real-time flow variations.
  • Opens new application domain for RL in aerial robotics: energy harvesting from environmental wind gradients for extended autonomous operation.
Dynamic soaring—extracting energy from wind shear to sustain or extend flight—is a key capability for long-duration autonomous aerial vehicles. Traditional approaches rely on explicit trajectory planning and wind estimation. This work proposes a learning-based approach using deep reinforcement learning to discover step-level feedback control policies for dynamic soaring. The learned policies adapt to real-time wind conditions without pre-computed trajectories. Experiments demonstrate energy-efficient flight in simulated and real wind environments, significantly extending flight duration compared to conventional approaches. The work shows that learned step-level control is more adaptive and efficient than traditional planning-based methods for dynamic soaring.

Mobile Manipulation & Whole-Body Control

Zhihao Cao, Tianxu An, Chenhao Li, Stelian Coros, Marco Hutter
Core Contributions
  • Proposes hierarchical learning framework that decouples intent estimation from terrain-robust locomotion, enabling partner-agnostic collaborative transport without force-torque sensors.
  • Uses proprioceptive feedback and teacher-student training to infer partner interaction wrench in real-time, eliminating need for external force sensing.
  • Demonstrates compliant cooperative transport across diverse terrains, payloads, and partners in both simulation and real-world experiments.
  • Shows natural scaling to decentralized multi-robot transport and embodiment transfer by swapping locomotion backbone—key for robot-agnostic collaboration.
Collaborative transport requires robots to infer partner intent through physical interaction while maintaining stable loco-manipulation. This becomes particularly challenging in complex environments, where interaction signals are difficult to capture and model. We present PAINT, a lightweight yet efficient hierarchical learning framework for partner-agnostic intent-aware collaborative legged transport that infers partner intent directly from proprioceptive feedback. PAINT decouples intent understanding from terrain-robust locomotion: A high-level policy infers the partner interaction wrench using an intent estimator and a teacher-student training scheme, while a low-level locomotion backbone ensures robust execution. This enables lightweight deployment without external force-torque sensing or payload tracking. Extensive simulation and real-world experiments demonstrate compliant cooperative transport across diverse terrains, payloads, and partners. Furthermore, we show that PAINT naturally scales to decentralized multi-robot transport and transfers across robot embodiments by swapping the underlying locomotion backbone. Our results suggest that proprioceptive signals in payload-coupled interaction provide a scalable interface for partner-agnostic intent-aware collaborative transport.
Yida Niu, Ziyuan Jiao et al.
Core Contributions
  • Introduces AutoMoMa, a GPU-accelerated trajectory generation system using Analytical Kinematic Redundancy (AKR) for whole-body mobile manipulation.
  • Achieves 5,000 trajectories per GPU-hour and 80x speedup compared to baselines through parallelized computation and AKR modeling.
  • Generates over 500,000 high-quality trajectories for training manipulation policies, demonstrating scalable data synthesis for learning.
  • Enables practical whole-body coordination on mobile platforms by making trajectory generation a non-bottleneck component of the learning pipeline.
Whole-body mobile manipulation requires coordinating many degrees of freedom across base, arm, and gripper. Trajectory generation is computationally expensive and often a bottleneck in learning pipelines. This work presents AutoMoMa, an analytically-informed GPU-accelerated trajectory generation system for mobile manipulators. AutoMoMa uses Analytical Kinematic Redundancy (AKR) modeling to decompose the high-dimensional planning problem efficiently. The system generates 5,000 trajectories per GPU-hour, achieving 80x speedup over baselines. With this acceleration, we synthesize 500,000+ high-quality trajectories for training manipulation policies. The work demonstrates that removing the trajectory generation bottleneck enables scalable data synthesis for whole-body mobile manipulation learning.
Snehal Jauhri, Vignesh Prasad, Georgia Chalvatzaki
Core Contributions
  • Proposes WHOLE-MoMa: offline RL method that uses sub-optimal whole-body controller (WBC) outputs as prior, improving sample efficiency without real-world interaction.
  • Achieves 80% success on bimanual drawer manipulation and 68% on cupboard tasks without any real-world training data—learning purely from simulation.
  • Demonstrates that sub-optimal classical controllers provide valuable inductive bias for learning, enabling sim-to-real transfer of complex manipulation skills.
  • Shows practical pathway for deploying whole-body mobile manipulation without expensive real-world data collection or online learning.
Whole-body mobile manipulation is complex, requiring coordination of base mobility and arm control. Learning from real-world interaction is sample-inefficient. This work proposes WHOLE-MoMa, an offline reinforcement learning method that leverages sub-optimal whole-body controllers (WBC) as behavioral priors. By learning from WBC rollouts in simulation, the method achieves strong performance on real hardware without real-world training data. The key insight is that sub-optimal controller outputs contain valuable task structure that guides learning toward feasible behaviors. Experiments demonstrate 80% success on bimanual drawer tasks and 68% on cupboard manipulation without any real-world training. The approach shows the practical value of classical controllers as learning priors.
Heng Tao et al.
Core Contributions
  • Introduces two-stage RL approach combining CVAE (Conditional Variational Autoencoder) for diverse grasp generation with whole-body control learning.
  • Integrates tactile feedback as core signal for grasp success prediction, enabling rapid closed-loop adjustment of grasping strategy.
  • Demonstrates fast dexterous grasping with mobile manipulators through learned whole-body policies that coordinate base motion and arm control.
  • Shows that combining generative models (CVAE) with tactile sensing enables robust, adaptive grasping under real-world uncertainty.
Fast dexterous grasping with mobile manipulators requires coordinating base mobility with precise arm control and hand manipulation. This work proposes FastGrasp, a learning-based approach combining conditional variational autoencoders (CVAE) for grasp generation with reinforcement learning for whole-body control. The method leverages tactile feedback to rapidly assess grasp success and adjust strategy in real time. The two-stage approach first learns diverse grasp primitives via CVAE, then learns a whole-body policy that selects and executes grasps efficiently. Experiments demonstrate fast, reliable grasping on diverse objects with mobile manipulator systems.

SLAM & 3D Reconstruction

Benxu Tang, Fu Zhang et al.
Core Contributions
  • Proposes direct boundary-based occupancy grid mapping using truncated ray casting on boundary exterior, eliminating need for auxiliary local 3D grids.
  • Significantly reduces computational overhead compared to voxel-based mapping by operating directly on boundary layers.
  • Enables efficient large-scale 3D mapping for real-time robotic navigation and planning applications.
  • Demonstrates practical efficiency gains while maintaining or improving map quality compared to traditional voxel grid methods.
Occupancy grid mapping is fundamental for robotic navigation. Traditional voxel-based approaches require large 3D grids and significant computation. This work proposes D-BDM (Direct Boundary-Based Mapping), which constructs occupancy grids by operating directly on boundary surfaces rather than full volumes. Using truncated ray casting on boundary exteriors, the method eliminates auxiliary local 3D grid structures. D-BDM achieves comparable or better mapping quality while reducing computational and memory overhead, enabling efficient large-scale mapping for real-time robotic systems. The boundary-based perspective offers a more efficient abstraction for occupancy representation.
Ziyuan Xia, Hujun Bao et al.
Core Contributions
  • Extends Habitat simulation platform with 3D Gaussian Splatting (3DGS) for high-fidelity scene rendering with dynamic objects and avatars.
  • Enables more realistic navigation simulation by supporting dynamic agents and animated humanoid avatars rendered via 3DGS.
  • Demonstrates stronger cross-domain generalization compared to photorealistic rendering, suggesting 3DGS provides useful inductive biases for embodied AI.
  • Provides open-source simulator enabling researchers to train navigation policies with improved visual realism and dynamics.
Simulation is critical for training and evaluating embodied AI agents. High-fidelity visual rendering improves realism but can suffer from domain gap. This work introduces Habitat-GS, extending the Habitat simulator with 3D Gaussian Splatting (3DGS) for scene rendering. The approach supports dynamic objects and agents rendered as gaussians, enabling more realistic simulation of human-robot environments. Experiments show that agents trained in Habitat-GS demonstrate stronger cross-domain generalization to real environments compared to photorealistic rendering. The work demonstrates that 3DGS offers both efficiency and generalization benefits for embodied AI simulation.
Dongen Li, Marcelo H. Ang et al.
Core Contributions
  • Proposes RMGS-SLAM: real-time SLAM system fusing LiDAR, inertial, and visual measurements via 3D Gaussian Splatting scene representation.
  • Uses Gaussian GICP for loop closure detection, enabling large-scale outdoor mapping without accumulating drift.
  • Demonstrates real-time performance on standard benchmarks with improved mapping quality and robustness compared to point-cloud SLAM methods.
  • Shows 3DGS as practical scene representation for production SLAM systems, combining efficiency with visual quality.
Real-time SLAM is fundamental for autonomous robot navigation. Traditional point-cloud SLAM struggles with dense scenes and limited visual quality. This work proposes RMGS-SLAM, a multi-sensor SLAM system using 3D Gaussian Splatting as the scene representation. The method fuses LiDAR, inertial, and visual measurements in a unified 3DGS framework. Loop closure is achieved via Gaussian GICP, maintaining accuracy in large-scale environments. Real-time experiments demonstrate improved mapping quality and computational efficiency compared to point-cloud methods. The work shows 3DGS as a practical, efficient scene representation for production robotic SLAM systems.
Yi Liu, Peiyu Zhuang et al.
Core Contributions
  • Introduces generalizable motion model for separating dynamic scene elements from static environment in monocular 3DGS SLAM.
  • Uses FIFO queue with sequential attention mechanism to identify and suppress moving objects during mapping.
  • Enables accurate SLAM in crowded, dynamic environments where traditional static-world assumptions break down.
  • Demonstrates practical navigation capability in real-world scenarios with pedestrians and moving obstacles.
SLAM in dynamic environments remains challenging due to moving people and objects. Traditional approaches assume static worlds. This work proposes GGD-SLAM, a monocular 3DGS SLAM system with a generalizable motion model that separates static and dynamic scene elements. The method uses a FIFO queue with sequential attention to track and suppress moving objects. A learned motion model predicts dynamic object trajectories, allowing the system to focus mapping on static structures. Experiments in crowded indoor and outdoor environments show accurate localization and mapping despite significant dynamic activity. The work addresses a critical practical challenge for real-world robotic navigation.
Shang-En Tsai, Wei-Cheng Sun
Core Contributions
  • Proposes Depth Reliability Mapping (DRM) that assigns per-pixel reliability scores to depth measurements, enabling selective fusion.
  • Reduces phantom obstacles created by glare and specular reflections, improving costmap quality for navigation planning.
  • Provides practical solution to sensor noise in real-world outdoor navigation where glare is common challenge.
  • Shows that reliability-weighted fusion outperforms simple averaging approaches in handling sensor artifacts.
Depth sensors often produce artifacts under challenging lighting conditions, with glare and specular reflections creating phantom obstacles. This work proposes Depth Reliability Mapping (DRM), which assigns per-pixel reliability scores to depth measurements based on lighting and surface properties. These reliability scores guide weighted fusion of multiple depth sources, suppressing unreliable measurements. The method reduces phantom obstacles by an average of 45% compared to standard depth fusion. For outdoor navigation, where glare is prevalent, the approach significantly improves navigation costmap quality and path planning safety.

Robot Learning & Sim-to-Real

K. Ege de Bruin, K. Glette et al.
Core Contributions
  • Demonstrates social learning framework where morphologically different robots learn from each other, enabling knowledge transfer across embodiments.
  • Shows that social learning significantly outperforms individual learning from scratch, accelerating skill acquisition.
  • Introduces methods for robots with different morphologies to share learned representations despite embodiment differences.
  • Opens pathway for collective learning in multi-robot systems with diverse hardware designs.
Learning from scratch is sample-inefficient for embodied agents. Social learning—where agents learn from peers—offers potential for accelerated skill acquisition. This work investigates social learning strategies for evolved virtual soft robots with diverse morphologies. Key findings: (1) morphologically different robots can learn from each other despite embodiment differences; (2) social learning outperforms individual learning across multiple tasks; (3) transfer mechanisms work effectively across diverse robot designs. The work demonstrates practical value of social learning for multi-robot systems with heterogeneous hardware and opens research directions in collective embodied intelligence.
Hyeonbeen Lee, T. Yeu et al.
Core Contributions
  • Proposes FDN (Frequency Decomposition Network) using spectral decomposition with probabilistic high-frequency head for wrench forecasting without force-torque sensors.
  • Enables sensorless force prediction in vibration-rich hydraulic systems by learning frequency-specific patterns in robot dynamics.
  • Demonstrates transfer learning from large-scale robot dataset, improving generalization to new manipulator configurations.
  • Reduces deployment cost by eliminating expensive force-torque sensors while maintaining estimation accuracy.
Force-torque sensing is essential for contact-aware manipulation but sensors are expensive and sometimes unavailable. This work proposes FDN (Frequency Decomposition Network) for sensorless wrench forecasting on hydraulic manipulators. The key insight is decomposing the prediction problem by frequency—learning low-frequency structural behavior separately from high-frequency vibrations via a probabilistic head. The method transfers well from large-scale robot datasets to new manipulator configurations. Experiments on hydraulic arms demonstrate accurate force prediction from proprioceptive signals alone, eliminating the need for force-torque sensors while maintaining estimation quality.
Qiang Le et al.
Core Contributions
  • Provides systematic comparison of DDPG (reinforcement learning) versus pseudo-spectral methods (classical optimal control) for path planning.
  • Shows DDPG finds feasible solution sets faster, critical for real-time robotic applications requiring quick planning.
  • Reveals complementary strengths: RL excels at rapid feasibility discovery; optimal control provides trajectory quality.
  • Informs algorithm selection for real-time planning scenarios where computational budget is limited.
Path planning algorithms must balance solution quality with computational efficiency. Reinforcement learning (RL) and optimal control offer different tradeoffs. This work provides a systematic comparison of DDPG (a policy gradient RL method) with pseudo-spectral optimal control methods. Key findings: (1) DDPG discovers feasible solution sets faster, crucial for real-time planning; (2) pseudo-spectral methods produce higher-quality optimal solutions given sufficient time; (3) the methods have complementary strengths. The analysis informs algorithm selection for real-time robotic planning scenarios where computational constraints limit trajectory optimization time.
Lidor Erez, Shahaf S. Shperberg et al.
Core Contributions
  • Proposes RL-based refinement pipeline that converts first-order kinematic plans to second-order dynamically feasible trajectories.
  • Bridges classical symbolic planning (operating on kinematic constraints) with dynamic execution requirements of real robots.
  • Shows that learned refinement policies generalize to unseen planning problems, enabling scalable deployment across diverse task specifications.
  • Demonstrates critical pipeline component for converting high-level plans into executable trajectories on hardware-constrained platforms.
Symbolic planners generate kinematic plans that ignore robot dynamics and torque limits. Executing these plans directly fails on real hardware. This work proposes a learned refinement policy that converts kinematic plans to dynamically feasible trajectories. The approach uses RL to learn how to adjust first-order kinematic plans to satisfy second-order dynamics constraints and torque limits. The learned policies generalize to new planning problems, enabling scalable deployment. The work addresses the critical gap between symbolic planning and hardware execution, enabling practical robot control pipelines.
Fangyu Sun, Yu Hu et al.
Core Contributions
  • Presents end-to-end pipeline for quadrotor control: differentiable physics simulation, RL policy learning, and sim-to-real transfer.
  • Demonstrates six different end-to-end control tasks (tracking, navigation, obstacle avoidance, etc.) on real quadrotors with learned policies.
  • Shows complete integration from training environment to hardware deployment, reducing barriers to practical end-to-end aerial robotics.
  • Proves viability of learned control on real flying systems, addressing skepticism about learning-based aerial autonomy.
End-to-end learning promises to simplify robot control by learning directly from sensors to actions. However, deploying learned controllers on real aerial systems remains challenging. This work presents E2E-Fly, a complete system for training and deploying end-to-end quadrotor controllers. The pipeline includes: (1) differentiable physics simulator for efficient learning; (2) RL-based policy learning; (3) sim-to-real transfer strategies. The system demonstrates six different flight tasks on real quadrotors. The work shows that end-to-end learning is practical and competitive for autonomous flight, enabling learned control to match or exceed classical approaches while simplifying design and tuning.

Navigation & Multi-Robot Systems

Jiahua Pei et al.
Core Contributions
  • Proposes OVAL: lifelong object goal navigation system with open-vocabulary semantic understanding, enabling navigation to novel object categories.
  • Uses memory descriptors that accumulate exploration experience, allowing the robot to reason about where unseen objects are likely found.
  • Introduces multi-value frontier scoring mechanism that balances exploration efficiency with information utility.
  • Demonstrates generalization to novel environments and object categories without retraining, key for practical deployment.
Object goal navigation—finding and reaching target objects—requires generalizing to novel environments and object categories. Traditional approaches struggle when targets are outside training categories. This work introduces OVAL (Open-Vocabulary Augmented Memory), a lifelong learning system for open-vocabulary object goal navigation. The method uses semantic memory descriptors that accumulate across episodes, encoding learned correlations between scenes and object locations. A multi-value frontier scoring mechanism balances information gain with exploration efficiency. Experiments show strong generalization to novel objects and environments without retraining, enabling practical long-term deployment.
Sunyao Zhou, Chenjia Bai et al.
Core Contributions
  • Introduces event-triggered dialogue for multi-robot vision-language navigation, enabling robots to request clarification when uncertain.
  • Shows 69.2% improvement in success weighted by path length (BSR) through dialogue-enhanced coordination compared to silent navigation.
  • Demonstrates practical multi-robot collaboration where robots explicitly communicate to resolve ambiguities in natural language instructions.
  • Opens research direction in human-robot teams where natural dialogue improves task completion in complex, long-horizon scenarios.
Vision-language navigation (VLN) enables robots to follow natural language instructions. Multi-robot collaboration over long horizons can benefit from dialogue to resolve ambiguities. This work presents DeCoNav, a system where multiple robots navigate collaboratively using event-triggered dialogue with human instructors. When robots encounter ambiguous instructions or unexpected situations, they ask for clarification rather than acting on uncertain predictions. The dialogue mechanism improves task success rate by 69.2% (BSR metric) compared to silent navigation. The work demonstrates practical value of natural language dialogue in multi-robot collaborative tasks, opening research directions in human-robot team communication.
Yuhan Jin, L. A. Barbosa et al.
Core Contributions
  • Proposes DINO-Explorer using DINOv3-based semantic surprise signal for active underwater exploration and discovery.
  • Implements ego-motion compensation that suppresses 45.5% of false-positive surprise signals caused by robot's own motion.
  • Enables autonomous underwater vehicles to autonomously identify interesting environmental features for investigation.
  • Opens frontier in active marine robotics where semantic understanding drives exploration decisions.
Autonomous underwater exploration benefits from active sensing strategies that prioritize interesting regions. This work proposes DINO-Explorer, which uses semantic surprise as an exploration signal. The key innovation is ego-motion compensation: distinguishing surprise due to environmental novelty from surprise caused by the robot's own motion. Using DINOv3 for semantic understanding, the method detects regions with genuinely novel semantic content. Ego-motion compensation reduces false positives by 45.5%, enabling effective exploration prioritization. The work demonstrates active semantic understanding for robotic exploration in challenging underwater environments.

Hardware & Mechanism Design

Jannis Gabler, O. Lambercy et al.
Core Contributions
  • Develops two-IMU wearable system for real-time detection of compensatory trunk movements (CTM) post-stroke using XGBoost classifier.
  • Achieves strong discriminative performance: macro-F1=0.80, MCC=0.73, ROC-AUC>0.93 with minimal sensing hardware.
  • Identifies wrist and trunk kinematics as sufficient anatomical sensors through systematic location-reduction analysis.
  • Enables scalable, real-time monitoring of CTM during rehabilitation therapy without bulky motion capture systems.
Compensatory trunk movements (CTMs) are commonly observed after stroke and can lead to maladaptive movement patterns, limiting targeted training of affected structures. Objective, continuous detection of CTMs during therapy and activities of daily living remains challenging due to the typically complex measurements setups required, as well as limited applicability for real-time use. This study investigates whether a two-inertial measurement unit configuration enables reliable, real-time CTM detection using machine learning. Data were collected from ten able-bodied participants performing activities of daily living under simulated impairment conditions (elbow brace restricting flexion-extension, resistance band inducing flexor-synergy-like patterns), with synchronized optical motion capture (OMC) and manually annotated video recordings serving as reference. A systematic location-reduction analysis using OMC identified wrist and trunk kinematics as a minimal yet sufficient set of anatomical sensing locations. Using an extreme gradient boosting classifier (XGBoost) evaluated with leave-one-subject-out cross-validation, our two-IMU model achieved strong discriminative performance (macro-F1 = 0.80 +/- 0.07, MCC = 0.73 +/- 0.08; ROC-AUC > 0.93), with performance comparable to an OMC-based model and prediction timing suitable for real-time applications. Explainability analysis revealed dominant contributions from trunk dynamics and wrist-trunk interaction features. In preliminary evaluation using recordings from four participants with neurological conditions, the model retained good discriminative capability (ROC-AUC ~ 0.78), but showed reduced and variable threshold-dependent performance, highlighting challenges in clinical generalization. These results support sparse wearable sensing as a viable pathway toward scalable, real-time monitoring of CTMs during therapy and daily living.
Dasharadhan Mahalingam et al.
Core Contributions
  • Demonstrates robotic manipulation for precision nanoparticle synthesis using screw geometry-based manipulation techniques.
  • Enables programming robot behaviors through demonstration, reducing need for explicit task specification in chemical processes.
  • Opens novel application domain: autonomous chemical synthesis with precision robotic control.
  • Shows potential for automating laboratory processes that traditionally require human expertise and manual control.
Nanoparticle synthesis is complex and sensitive to process parameters. This work demonstrates robotic manipulation for solution-based nanoparticle synthesis. The approach uses screw geometry-based manipulation strategies combined with learning from demonstration to enable robots to execute synthesis procedures. The robot learns to perform complex chemical operations including mixing, heating, and temporal control. The work expands robotic applications to precision materials synthesis, demonstrating that robot manipulation can improve consistency and enable automation of sophisticated chemical processes traditionally requiring manual control.
Sabyasachi Dash, Girish Krishnan et al.
Core Contributions
  • Proposes reconfigurable tendon-driven continuum manipulator (TDCM) with rotatable spacer disks enabling adaptive morphology.
  • Demonstrates shape matching in curvature-torsion space, providing interpretable and efficient workspace modeling.
  • Reduces actuation complexity while maintaining dexterity through mechanical design innovations.
  • Provides design methodology for reconfigurable continuum manipulators applicable to multiple application domains.
Tendon-driven continuum manipulators offer dexterity but require precise kinematic modeling in high-dimensional spaces. This work introduces a reconfigurable TDCM design with rotatable spacer disks that enable mechanical morphology changes. The key contribution is workspace modeling in curvature-torsion space, which significantly reduces the effective dimensionality while preserving dexterity. The rotatable spacer design enables shape matching through simplified actuation, reducing control complexity. The approach demonstrates how mechanical design can simplify control and improve interpretability of continuum manipulator behavior.
Fumihiko Asano et al.
Core Contributions
  • Develops linearized biped model enabling instantaneous walkability determination without numerical integration.
  • Provides analytical foundations for stable gait generation in bipedal systems with knee joints.
  • Enables real-time evaluation of gait feasibility critical for dynamic balance control.
  • Offers theoretical framework applicable to bipedal robot design and control optimization.
Bipedal walking requires dynamic stability and coordination of multiple joints. Determining whether a proposed gait is feasible is computationally expensive with nonlinear models. This work develops a linearized model for planar bipeds with knees that enables instantaneous walkability determination without simulation. The linearized approach provides analytical insights into stability conditions and enables real-time feasibility checking. The method generates asymptotically stable gaits and provides fast evaluation of trajectory feasibility. The work offers theoretical tools for bipedal locomotion control and robot design optimization.

Human-Robot Interaction

Christopher D. Wallbridge, Erwin Jose Lopez Pulgarin
Core Contributions
  • Presents position paper on error recovery in human-robot collaborative systems, highlighting safety-critical design principles.
  • Uses nuclear glovebox operations as concrete case study demonstrating high-stakes error recovery requirements.
  • Identifies key design challenges: detecting errors in time, communicating failures to human operators, enabling safe recovery.
  • Provides research agenda for robust human-robot teams in safety-critical applications beyond typical manipulation tasks.
Human-robot collaborative systems must gracefully handle errors to ensure safety and maintain collaboration. Error recovery—detecting, communicating, and correcting failures—is often overlooked in favor of nominal performance. This position paper argues for systematic attention to error recovery design. Using nuclear glovebox operations as a case study, we identify key challenges: (1) timely error detection in complex tasks; (2) effective human-robot communication of failure state; (3) coordinated recovery strategies. The paper outlines research directions for robust collaborative systems in safety-critical domains where errors have high consequences.