arXiv Robotics Digest

Research Landscape

April 9 shows a field in robust diversification across hardware, learning paradigms, and application domains. Foundation model approaches to manipulation (HEX, ViVa, ActiveGlasses, BLaDA) now operate alongside classical learning methods (SIM1's physics-aligned data engine, PriPG-RL's privileged planning), suggesting the community has moved beyond winner-take-all debates toward pragmatic engineering. The emergence of systematic sim-to-real tooling (SIM1 achieving 90% zero-shot at 1:15 scale, Sumo's whole-body loco-manipulation) indicates maturation: practitioners care less about novelty, more about repeatability and scale.

Autonomous driving and aerial systems continue to be primary robotics laboratories. CrashSight and Fail2Drive push benchmarking rigor with challenging failure modes (22.8% average success-rate drop across distribution shift), while RAGE-XY demonstrates real-time tire force estimation on racing platforms using RADAR+IMU fusion. Simultaneously, UAV swarms and maritime systems tackle coordination without centralized control (Karma mechanisms, multi-agent path finding), and VLN surveys (aerial VLN taxonomy) chart emerging frontiers for embodied language understanding. Infrastructure investments (AgiPIX platform with digital twins, acoustic slip sensing via A-SLIP) reflect that 2026 robotics values systems repeatability.

Bio-inspired and hardware-centric contributions anchor the digest: soft robot co-design (EvoGymCM's continuous material stiffness), bird-wing mechanisms achieving single-actuator flapping, and chick-robot affective interfaces reveal an ecosystem still investing in morphological innovation. These papers, combined with perception advances (GEAR's articulated object Gaussian splatting, sensorimotor estimation via SO(3) filtering), sketch a field balancing foundation models with domain specificity—neither pure learning nor pure engineering dominates 2026.

VLA & Foundation Models for Manipulation

Vision-language approaches, policy learning, embodied LLMs

Autonomous Driving & Vehicle Intelligence

Benchmarks, planning, trajectory prediction, force control

Robot Learning & Sim-to-Real Transfer

Data generation, privileged learning, policy distillation, temporal modeling

Sensing, Estimation & Perception

State estimation, tactile sensing, SO(3) filtering, 3D reconstruction

Aerial, Multi-Robot & Maritime Systems

UAV autonomy, multi-agent coordination, world models, maritime navigation

Hardware Design & Community

Soft robotics, bio-inspired mechanisms, affective interfaces, sustainability

VLA & Foundation Models for Manipulation

HEX: Humanoid-Aligned Experts for Cross-Embodiment Whole-Body Manipulation

h-index: 26 cs.RO

Shuanghao Bai, Meng Li, Xinyuan Lv, Jiawei Wang, Xinhua Wang

Core Contributions

State-centric architecture with Mixture-of-Experts and flow-matching action head enables cross-embodiment transfer without task-specific retraining, advancing VLA generalization beyond single morphologies
Humanoid-aligned expert specialization directly optimizes for anthropomorphic kinematic structures, improving performance versus generic expert pooling
Demonstrates scalable whole-body manipulation framework applicable to biped platforms with different hardware properties

Show Abstract ▼

State-centric framework for humanoid manipulation using Mixture-of-Experts and flow-matching action head for cross-embodiment whole-body control.

ViVa: A Video-Generative Value Model for Robot Reinforcement Learning

h-index: 13 cs.RO, cs.AI

Jindi Lv, Hao Li, Jie Li, Yifei Nie, Fankun Kong

Core Contributions

Repurposes pretrained video generators (diffusion models) as value function estimators, reducing dependency on reward signal design and enabling grounding in embodiment dynamics
Novel architecture leverages visual imagination to guide RL policy optimization, bridging perceptual uncertainty and value estimation in continuous control
Demonstrates video-based value estimation improves sample efficiency versus scalar reward baselines, opening video generators as underutilized assets for embodied learning

Show Abstract ▼

Video generator repurposed for value estimation grounding RL in anticipated embodiment dynamics.

ActiveGlasses: Learning Manipulation with Active Vision from Ego-centric Human Demonstration

h-index: 10 cs.RO

Yanwen Zou, Chenyang Shi, Wenye Yu, Han Xue, Jun Lv

Core Contributions

Smart glasses capture ego-centric human manipulation with active gaze tracking, providing richer demonstration signal than passive vision—enables zero-shot robot policy transfer by aligning viewpoint
Active vision (head motion + gaze) conveys task-relevant spatial attention, reducing the domain gap between human demos and robot execution
First study showing smart glasses as manipulation learning interface, opening consumer AR hardware for robotics data collection at scale

Show Abstract ▼

Smart glasses capture human demos with active vision for zero-shot robot manipulation transfer.

RoboAgent: Chaining Basic Capabilities for Embodied Task Planning

h-index: 4 cs.RO, cs.CV

Peiran Xu, Jiaqi Zheng, Yadong Mu

Core Contributions

Capability-driven VLM pipeline decomposes embodied planning into atomic sub-capabilities (grasp, move, place) with multi-stage training, improving compositional generalization
Explicit factorization of task planning and skill execution separates concern better than end-to-end models, improving interpretability and scalability
Demonstrates modular skill composition outperforms monolithic policies on unseen task combinations, supporting hierarchical task learning

Show Abstract ▼

Capability-driven VLM pipeline decomposes embodied planning into sub-capabilities with multi-stage training.

BLaDA: Bridging Language to Functional Dexterous Actions within 3DGS Fields

h-index: 3 cs.CV, cs.RO

Fan Yang, Wenrui Chen, Guorun Yan, Ruize Liao, Wanjun Jia

Core Contributions

Zero-shot language-to-dexterous-grasp translation via 3D Gaussian Splatting grounds language in scene geometry, eliminating per-task fine-tuning
Triangular functional point localization enables precise contact prediction from scene representations, advancing from coarse grasp heuristics to spatially grounded policies
Demonstrates vision language models can emit dexterous grasps directly from language+geometry fusion, expanding VLM applicability beyond navigation and manipulation planning

Show Abstract ▼

Zero-shot language-to-dexterous-grasp via 3D Gaussian Splatting with triangular functional point localization.

Autonomous Driving & Vehicle Intelligence

Open-Ended Instruction Realization with LLM-Enabled Multi-Planner Scheduling in Autonomous Vehicles

h-index: 22 cs.RO, cs.CV

Jiawei Liu, Xun Gong, Fen Fang, Muli Yang, Bohao Qu

Core Contributions

LLM translates open-ended passenger instructions into executable multi-modal MPC planner scripts, bridging natural language and structured planning—first system enabling end-to-end language-conditioned autonomous driving
Multi-planner scheduling intelligently routes instructions to appropriate controllers (trajectory, behavior, safety), enabling richer interaction vocabulary than single-task baselines
Demonstrates compositional instruction decomposition improves handling of complex user requests, advancing autonomous vehicle human-AI interaction

Show Abstract ▼

LLM translates passenger instructions into executable MPC planner scripts for autonomous driving.

CrashSight: A Phase-Aware, Infrastructure-Centric Video Benchmark for Traffic Crash Scene Understanding

h-index: 13 cs.CV, cs.AI, cs.RO

Rui Gan, Junyi Ma, Pei Li, Xingyou Yang, Kai Chen

Core Contributions

250-crash-video benchmark with 13K QA pairs from roadside camera perspective—first systematic dataset for infrastructure-centric safety assessment versus ego-vehicle centric approaches
Phase-aware annotation (pre-crash, crash, post-crash) enables temporal understanding of traffic incidents, improving VLM reasoning about causality and blame
Demonstrates significant VLM performance gap on safety-critical infrastructure tasks, establishing benchmark for future vision-language models in autonomous systems

Show Abstract ▼

250-crash-video benchmark with 13K QA pairs for VLM evaluation from roadside camera perspective.

Fail2Drive: Benchmarking Closed-Loop Driving Generalization

h-index: 10 cs.RO, cs.CV

Simon Gerstenecker, Andreas Geiger, Katrin Renz

Core Contributions

Paired-route benchmark with 200 routes and 17 distribution shift categories reveals 22.8% average success-rate drop—demonstrates critical gap between development and deployment robustness
Closed-loop evaluation (real-time planner in action) versus open-loop metrics better captures actual autonomous vehicle failure modes and recovery strategies
Systematic taxonomy of distribution shifts (weather, traffic density, road types) guides future robustness research, establishing new standard for AV benchmarking

Show Abstract ▼

Paired-route benchmark with 200 routes and 17 shift classes showing 22.8% average success-rate drop.

On-Policy Distillation of Language Models for Autonomous Vehicle Motion Planning

h-index: 5 cs.RO, cs.AI, eess.SY

Amirhossein Afsharrad, Amirhesam Abedsoltan, Ahmadreza Moradipari, Sanjay Lall

Core Contributions

Graph Knowledge Distillation (GKD) trains 5x smaller student models from GPT-Driver teacher, approaching teacher-level nuScenes performance while enabling edge deployment
On-policy distillation preserves teacher's decision-making under deployment conditions, versus offline approaches that may accumulate distribution mismatch
Demonstrates LLM-based planning can be efficiently compressed for resource-constrained autonomous vehicles, bridging foundation models and embedded systems

Show Abstract ▼

GKD distills 5x smaller student from GPT-Driver teacher approaching teacher-level nuScenes performance.

RAGE-XY: RADAR-Aided Longitudinal and Lateral Forces Estimation For Autonomous Race Cars

h-index: 5 cs.RO

Davide Malvezzi, Nicola Musiu, Eugenio Mascaro, Francesco Iacovacci, Marko Bertogna

Core Contributions

RADAR+IMU framework estimates tire lateral and longitudinal forces in real-time on autonomous race cars, enabling closed-loop force control without strain gauges
Online calibration adapts to track-specific tire properties and wear, improving practical deployment robustness versus offline calibration approaches
Demonstrates indirect force estimation via sensor fusion viable for high-speed autonomous platforms, reducing instrumentation complexity for racing and performance driving

Show Abstract ▼

RADAR+IMU framework for real-time tire force estimation on autonomous race car with online calibration.

Robot Learning & Sim-to-Real Transfer

A Unified Multi-Layer Framework for Skill Acquisition from Imperfect Human Demonstrations

h-index: 24 cs.RO

Zi-Qi Yang, Mehrdad R. Kermani

Core Contributions

Layered control framework for learning from demonstration robustly handles imperfect human trajectories via variable impedance learning and null-space safety injection
Explicit impedance layer adapts compliance to contact forces, improving task success on compliant manipulation versus stiff tracking approaches
Null-space safety module prevents self-collisions and joint limits during learning, enabling safe LfD for collaborative robots without manual trajectory filtering

Show Abstract ▼

Layered control framework for compliant LfD with variable impedance learning and null-space safety.

SIM1: Physics-Aligned Simulator as Zero-Shot Data Scaler in Deformable Worlds

h-index: 10 cs.RO, cs.AI, cs.CV

Yunsong Zhou, Hangxu Liu, Xuekun Jiang, Xing Shen, Yuanzhen Zhou

Core Contributions

Real-to-sim-to-real pipeline achieves 90% zero-shot manipulation success at 1:15 sim-to-real scale ratio, substantially outperforming prior sim2real approaches in deformable object handling
Physics-aligned simulator design prioritizes accurate contact dynamics and material property modeling over photorealism, improving transfer versus appearance-focused engines
Demonstrates data scalability without additional real robot experiments, enabling scalable sim-to-real paradigm for deformable manipulation tasks

Show Abstract ▼

Real-to-sim-to-real data engine for deformable manipulation achieving 90% zero-shot success at 1:15 ratio.

PriPG-RL: Privileged Planner-Guided Reinforcement Learning for Partially Observable Systems

h-index: 6 cs.LG, cs.RO

Mohsen Amiri, Mohsen Amiri, Ali Beikmohammadi, Sindri Magnusson, Mehdi Hosseinzadeh

Core Contributions

Privileged MPC planner (with full state access) distills knowledge to RL policy under partial observability, enabling POMDP learning without complete state reconstruction
Teacher-student framework leverages classical planning when available (e.g., simulation) and transitions to learned policies for deployment, hybrid approach balancing robustness and adaptability
Deployed on Unitree Go2 quadruped with real-time constraints, demonstrating practical applicability versus simulation-only studies

Show Abstract ▼

Privileged MPC planner distills knowledge to RL policy under partial observability; deployed on Unitree Go2.

Exploring Temporal Representation in Neural Processes for Multimodal Action Prediction

h-index: 4 cs.RO, cs.AI

Marco Gabriele Fedozzi, Yukie Nagai, Francesco Rea, Alessandra Sciutti

Core Contributions

Mirror neuron-inspired DMBN-PTE improves temporal encoding for action prediction, bridging neuroscience and robot learning via temporal attention mechanisms
Multimodal fusion of vision and proprioception enables richer state representation for predicting human-robot collaboration actions
Demonstrates biologically-grounded architectures improve prediction accuracy versus generic temporal models, suggesting embodied learning benefits from neuroscience insights

Show Abstract ▼

Mirror neuron-inspired DMBN-PTE improves temporal encoding for visuo-motor action prediction.

Sumo: Dynamic and Generalizable Whole-Body Loco-Manipulation

h-index: 13 cs.RO

John Z. Zhang, Maks Sorokin, Jan Brüdigam, Brandon Hung, Stephen Phillips

Core Contributions

Sim-to-real whole-body loco-manipulation with test-time steering enables pre-trained policies to adapt to novel heavy-object tasks without retraining, improving generalization
Unified framework combines locomotion and manipulation control, addressing challenge of coordinating base motion and arm trajectories simultaneously
Demonstrates real robot dynamics adaptation enables successful object transport, validating sim-to-real transfer for complex multi-task behaviors

Show Abstract ▼

Sim-to-real whole-body loco-manipulation with test-time steering of pre-trained policy for heavy objects.

Visually-grounded Humanoid Agents

h-index: 5 cs.CV, cs.RO

Hang Ye, Xiaoxuan Ma, Fan Lu, Wayne Wu, Kwan-Yee Lin

Core Contributions

Two-layer paradigm enables autonomous digital humans to act in first-person perspective within reconstructed 3D environments, advancing embodied AI beyond third-person avatars
Vision-grounded control integrates perception and action tightly, enabling natural embodiment in synthetic scenes with realistic spatial constraints
Framework applicable to both simulation and photorealistic environments, suggesting scalability toward real-robot humanoid control

Show Abstract ▼

Two-layer paradigm for autonomous digital humans with first-person perception in reconstructed 3D scenes.

Sensing, Estimation & Perception

A Soft Robotic Interface for Chick-Robot Affective Interactions

h-index: 58 cs.RO, cs.HC

Jue Chen, Alexander Mielke, Kaspar Althoefer, Elisabetta Versace

Core Contributions

Soft robotic interface with warmth, breathing, and face-like stimuli designed for animal-robot interaction—first study of affective soft robotics in inter-species contexts
Biologically-inspired morphology (soft materials, thermal properties) proves more engaging to chicks than conventional rigid interfaces, validating bio-inspired design for animal systems
Opens robotics applications to behavioral biology, enabling controlled interaction studies that were previously limited to human subjects or ethically constrained conditions

Show Abstract ▼

Soft robotic affective interface for chicks with warmth, breathing, face-like stimuli for animal-robot interaction.

State and Trajectory Estimation of Tensegrity Robots via Factor Graphs and Chebyshev Polynomials

h-index: 47 cs.RO

Edgar Granados, Patrick Meng, Charles Tang, Shrimed Sangani, William R. Johnson

Core Contributions

Factor graph approach fuses RGB-D camera observations with cable length sensors for tensegrity robot state estimation, improving observability of cable-driven structures
Chebyshev polynomial trajectory basis enables efficient parameterization of complex tensegrity dynamics, reducing state space dimensionality versus raw trajectory recording
Demonstrates hybrid sensing (vision + proprioception) critical for soft robots where traditional rigid-body assumptions fail, advancing state estimation theory for underactuated systems

Show Abstract ▼

Factor graph approach for tensegrity robot state estimation fusing RGB-D camera with cable length sensors.

Complementary Filtering on SO(3) for Attitude Estimation with Scalar Measurements

h-index: 14 eess.SY, cs.RO

Alessandro Melis, Soulaimane Berkane, Tarek Hamel

Core Contributions

SO(3) observer design for attitude estimation from scalar measurements achieves almost-global stability, enabling robust attitude control from limited sensor suites
Complementary filtering framework integrates inertial measurements with gravity/magnetic field constraints, improving accuracy versus gyro-only integration
Theoretical stability analysis grounds the approach in control theory, advancing robustness certification for robot attitude estimation systems

Show Abstract ▼

SO(3) observer for attitude estimation from scalar measurements with almost-global stability.

A-SLIP: Acoustic Sensing for Continuous In-hand Slip Estimation

h-index: 8 cs.RO

Uksang Yoo, Yuemin Mao, Jean Oh, Jeffrey Ichnowski

Core Contributions

Piezoelectric microphone system achieves 14.1 degree directional slip error with 64% improvement via multi-channel acoustic fusion, enabling low-cost slip detection
Acoustic sensing complements force/vision approaches, providing high-frequency slip signals without visual occlusion or complex tactile fabrication
Demonstrates passive acoustic sensing viable for robotic grasping, potentially scalable to multi-finger hands without per-finger instrumentation

Show Abstract ▼

Piezoelectric microphone system for slip estimation: 14.1 degree directional error, 64% improvement with multi-channel.

GEAR: GEometry-motion Alternating Refinement for Articulated Object Modeling with Gaussian Splatting

h-index: 7 cs.CV, cs.GR, cs.RO

Jialin Li, Bin Fu, Ruiping Wang, Xilin Chen

Core Contributions

EM-style alternating refinement jointly models articulated object geometry and motion in Gaussian Splatting framework, enabling accurate 3D reconstruction of dynamic scenes
Disentangled representation of static geometry and articulated motion improves over monolithic approaches, facilitating reuse of object models across scenes
Enables downstream robot tasks (grasp planning, trajectory prediction) via richer 3D scene understanding, bridging perception and manipulation planning

Show Abstract ▼

EM-style Gaussian Splatting framework for articulated object geometry and motion joint modeling.

Aerial, Multi-Robot & Maritime Systems

AgiPIX: Bridging Simulation and Reality in Indoor Aerial Inspection

h-index: 29 cs.RO

Sasanka Kuruppu Arachchige, Juan Jose Garcia, Changda Tian, Lauri Suomela, Panos Trahanias

Core Contributions

Open-source platform for indoor aerial autonomy with integrated digital twin and containerized ROS 2 stack enables rapid development and validation without custom infrastructure
Digital twin synchronization enables sim-to-real transfer for UAV planning, reducing gap between simulation experiments and field deployment
Containerized middleware abstracts hardware details, enabling portability across drone platforms and lowering barrier to drone research adoption

Show Abstract ▼

Open-source platform for indoor aerial autonomy with digital twin and containerized ROS 2 stack.

Karma Mechanisms for Decentralised, Cooperative Multi Agent Path Finding

h-index: 23 eess.SY, cs.RO

Kevin Riehl, Julius Schlapbach, Anastasios Kouvelas, Michail A. Makridis

Core Contributions

Non-tradeable Karma credits enable decentralized, fair MAPF in warehouse scenarios without centralized coordinator, improving scalability for large swarms
Mechanism design prevents credit exploitation while encouraging cooperation, providing game-theoretic fairness guarantees for multi-agent systems
Demonstrated on warehouse logistics, showing practical applicability of mechanism design to embodied multi-agent coordination

Show Abstract ▼

Non-tradeable Karma credits for decentralized MAPF fairness in warehouse scenarios.

Semantic-Aware UAV Command and Control for Efficient IoT Data Collection

h-index: 10 cs.RO

Assane Sankara, Daniel Bonilla Licea, Hajar El Hammouti

Core Contributions

DDQN-based UAV flight policy prioritizes semantically-relevant IoT image data collection, improving information utility per mission time versus uniform coverage approaches
Semantic awareness (object detection, scene understanding) guides trajectory planning, enabling task-specific data collection without manual route specification
Demonstrates RL-based semantic planning outperforms scripted IoT missions, suggesting embodied agents can be more intelligent data collectors than fixed-path systems

Show Abstract ▼

DDQN-based UAV flight policy for semantic-aware IoT image data collection.

Vision-Language Navigation for Aerial Robots: Towards the Era of Large Language Models

h-index: 9 cs.RO

Xingyu Xia, Lekai Zhou, Yujie Tang, Xiaozhou Zhu, Hai Zhu

Core Contributions

Survey of aerial VLN with systematic taxonomy of 5 architectural categories (end-to-end, two-stage, modular, hierarchical, LLM-based) guides future research directions
Identifies 7 open problems (generalization, efficiency, real-world deployment, multimodal fusion) critical for advancing VLN beyond simulators to field robotics
Establishes aerial VLN as emerging frontier, positioning UAVs as next domain for embodied language understanding after ground navigation

Show Abstract ▼

Survey of aerial VLN with taxonomy of 5 architectural categories and 7 open problems.

Why This Avoidance Maneuver? Contrastive Explanations in Human-Supervised Maritime Autonomous Navigation

h-index: 7 cs.AI, cs.RO

Joel Jose, Andreas Madsen, Andreas Brandsæter, Tor A. Johansen, Erlend M. Coates

Core Contributions

Contrastive explanations for maritime collision avoidance provide human-interpretable justifications for autonomous decisions, improving marine officer trust and compliance
User study with 4 marine officers validates effectiveness of explanations versus black-box autonomous systems, establishing human factors importance for maritime automation
Demonstrates explainability critical for high-stakes autonomous systems where human supervision remains legally and operationally required

Show Abstract ▼

Contrastive explanations for maritime collision avoidance with user study of 4 marine officers.

WorldMAP: Bootstrapping Vision-Language Navigation Trajectory Prediction with Generative World Models

h-index: 7 cs.AI, cs.CV, cs.RO

Hongjin Chen, Shangyun Jiang, Tonghua Su, Chen Gao, Xinlei Chen

Core Contributions

World model teacher generates structured supervision for VLN trajectory prediction, achieving 18% absolute ADE reduction versus direct imitation learning
Two-stage learning (world model pre-training + student distillation) improves generalization, suggesting intermediate representations help embodied understanding
Demonstrates generative models can serve as privileged teachers for navigation, opening new paradigm for leveraging pre-trained models in embodied tasks

Show Abstract ▼

World model teacher generates structured supervision for student VLN trajectory predictor; 18% ADE reduction.

Hardware Design & Community

The Sustainability Gap in Robotics: A Large-Scale Survey of Sustainability Awareness in 50,000 Research Articles

h-index: 19 cs.RO, cs.CY

Antun Skuric, Leandro Von Werra, Thomas Wolf

Core Contributions

Large-scale survey of approximately 50,000 arXiv cs.RO papers reveals sustainability motivation below 5%, identifying critical gap between robotics research and planetary imperatives
Quantitative analysis reveals systematic bias: robotics community underweights environmental considerations versus medical, energy, materials fields
Calls for integration of sustainability as first-class research objective, reshaping field values and funding priorities toward climate-aware robotics

Show Abstract ▼

Survey of approximately 50,000 arXiv cs.RO papers showing sustainability motivation below 5%.

Bird-Inspired Spatial Flapping Wing Mechanism via Coupled Linkages with Single Actuator

h-index: 6 cs.RO

Daniel Huczala, Sun-Pill Jung, Frank C. Park

Core Contributions

Two coupled spatial four-bar linkages realize bird-like sweep-and-fold wing motion with single motor, reducing actuation complexity versus multi-DOF designs
Bio-inspired mechanical design enables efficient flapping without explicit control algorithms, leveraging passive mechanics for flight stability
Demonstrates mechanical advantage of biomorphic structure, suggesting nature's morphologies embed solutions to control challenges

Show Abstract ▼

Two coupled spatial four-bar linkages realize sweep-and-fold wing motion with single motor.

EvoGymCM: Harnessing Continuous Material Stiffness for Soft Robot Co-Design

h-index: 3 cs.RO

Le Shen, Kangyao Huang, Wentao Zhao, Huaping Liu

Core Contributions

Benchmark for continuous material stiffness optimization in soft robot morphology-material-control co-design enables exploration of material properties as design variables
Allows joint optimization of body structure, material properties, and control policies rather than sequential design, improving overall robot performance
Demonstrates computational co-design framework applicable beyond soft robotics to modular and reconfigurable systems

Show Abstract ▼

Benchmark for continuous material stiffness optimization in soft robot morphology-material-control co-design.