arXiv Robotics Digest

Research Landscape

Today's batch reveals a strong focus on scaling robot learning through better data collection and transfer mechanisms. Papers like RoSHI, TAMEn, and BiDexGrasp attack the data bottleneck from complementary angles—wearable sensing, tactile-aware systems, and large-scale grasp annotation—suggesting the field is moving beyond algorithmic breakthroughs toward systematic solutions for data scarcity. These efforts directly enable downstream work in policy transfer (Learning-Based Assembly, Sustainable Transfer) and sim-to-real robustness (Robust Quadruped Locomotion via evolutionary RL).

Multi-robot coordination continues to mature with dual approaches gaining traction: classical declarative and rule-based methods (Aggregate Programming, Logical Robots) coexist with learning-based planners (Train-Small Deploy-Large, Differentiable Environment-Trajectory Co-Optimization). This reflects pragmatism—formal methods provide safety guarantees for swarms, while diffusion models and bi-level optimization unlock flexibility in dynamic, partially observable environments. The fact that both lineages advance simultaneously suggests neither dominates the design space.

Vision-language models have graduated from pure task execution toward autonomous self-diagnosis: KITE demonstrates that tokenized, keyframe-anchored evidence can substantially improve VLM-based failure detection beyond vanilla Qwen2.5-VL. Simultaneously, scene understanding pipelines (Genie Sim PanoRecon, MoRight) enable better 3D reconstruction and disentangled motion control, bridging perception and planning. Infrastructure requirements (RTK-SLAM Dataset, CADENCE energy-aware sensing) and foundational architectures (AEROS, RichMap) round out the ecosystem, signaling that 2026 robotics is as much about systems integration and measurement rigor as novel learning algorithms.

Vision-Language Models & Scene Generation

VLM-based failure analysis, motion control, 3D reconstruction

Multi-Robot Coordination & Planning

Aggregate programming, diffusion planners, declarative multi-agent

Manipulation & Grasping

Bimanual grasping, tactile sensing, peg-in-hole assembly

Motion Planning & Robot Architecture

Flow matching, OS design, quadruped locomotion

SLAM, Localization & Autonomous Driving

Visual SLAM, RTK positioning, trajectory prediction

Human Data, Bio-Inspired & Infrastructure

Wearable sensing, proprioceptive joints, telecom world models

Vision-Language Models & Scene Generation

KITE: Keyframe-Indexed Tokenized Evidence for VLM-Based Robot Failure Analysis

h-index: 29 cs.RO, cs.AI, cs.CV

Mehdi Hosseinzadeh, King Hang Wong, Feras Dayoub

Core Contributions

Solves long-context video bottleneck by tokenizing only motion-salient keyframes with BEV representations, enabling VLMs to diagnose failures without expensive training
Substantially outperforms vanilla Qwen2.5-VL on RoboFAC, with particularly strong gains in simulation failure detection and localization tasks
Training-free front-end that transforms robot execution videos into compact evidence—first principled approach to keyframe-anchored VLM prompting for robotics failure analysis

Show Abstract ▼

Training-free, keyframe-anchored front-end that converts long robot-execution videos into compact tokenized evidence for VLMs. Uses motion-salient keyframes with BEV representations. On RoboFAC benchmark, substantially improves over vanilla Qwen2.5-VL especially in simulation failure detection/identification/localization.

MoRight: Motion Control Done Right

h-index: 28 cs.CV, cs.AI, cs.GR, cs.LG, cs.RO

Shaowei Liu, Xuanchi Ren, Tianchang Shen, Huan Ling, Saurabh Gupta

Core Contributions

First framework to disentangle object motion from camera viewpoint, decomposing control into active (user-driven) and passive (consequence) components
Supports both forward control (user commands) and inverse reasoning (predict user intent from observed motion), enabling more natural human-robot interaction
State-of-the-art on three benchmarks with a unified architecture—demonstrates that explicit factorization of motion intent improves generalization over end-to-end approaches

Show Abstract ▼

Framework for disentangled motion control: separate object motion and camera viewpoint. Decomposes motion into active (user-driven) and passive (consequence) components. Supports forward and inverse reasoning. State-of-the-art on three benchmarks.

Genie Sim PanoRecon: Fast Immersive Scene Generation from Single-View Panorama

h-index: 11 cs.RO

Zhijun Li, Yongxin Su, Di Yang, Jichao Wang, Zheyuan Xing

Core Contributions

Feed-forward Gaussian-splatting pipeline overcomes cold-start problem in simulation—reconstructs full 3D scenes from single panoramic images in seconds
Depth-aware fusion strategy integrates multiple sensor modalities, critical for manipulation task realism in simulated training
Direct integration into Genie Sim reduces sim-to-real friction for manipulation by enabling rapid scenario prototyping without manual asset creation

Show Abstract ▼

Feed-forward Gaussian-splatting pipeline for 3D scene reconstruction from panorama. Depth-aware fusion strategy. Integrated into Genie Sim for manipulation tasks.

Multi-Robot Coordination & Planning

Exploiting Aggregate Programming in a Multi-Robot Service Prototype

h-index: 30 cs.DC, cs.MA, cs.RO

Giorgio Audrito, Andrea Basso, Daniele Bortoluzzi, Ferruccio Damiani, Giordano Scarso

Core Contributions

First production-grade application of aggregate programming to multi-robot coordination—demonstrates scalability and fault tolerance in realistic university library environment
Combines field calculus abstractions with validated simulation and hardware experiments, proving formal approaches work beyond toy domains
Provides templates for adaptive swarm behaviors that handle decentralized coordination without centralized planning—key advantage for large fleets

Show Abstract ▼

Multi-robot systems using Aggregate Programming for coordination in a university library prototype, validated with simulations and real tests.

Differentiable Environment-Trajectory Co-Optimization for Safe Multi-Agent Navigation

h-index: 7 cs.RO, cs.MA

Zhan Gao, Gabriele Fadini, Stelian Coros, Amanda Prorok

Core Contributions

Bi-level optimization framework co-designs safe trajectories and environment layouts jointly—first approach to make environment configuration differentiable via KKT + Implicit Function Theorem
Novel safety metric grounded in measure theory enables principled quantification of collision risk across multi-agent systems
Enables discovery of non-intuitive, safer environment configurations automatically—useful for infrastructure design and autonomous systems deployment

Show Abstract ▼

Bi-level optimization: lower-level trajectory optimization + upper-level environment optimization. Uses KKT + Implicit Function Theorem. Novel safety metric via measure theory.

Logical Robots: Declarative Multi-Agent Programming in Logica

h-index: 5 cs.MA, cs.AI, cs.RO

Evgeny Skvortsov, Yilin Xia, Ojaswa Garg, Shawn Bowers, Bertram Ludäscher

Core Contributions

Leverages Logica (Google's declarative query language) for multi-agent simulation—maps logic predicates directly to motor outputs, eliminating imperative control code
Enables humans and AI to specify swarm behaviors as logical constraints rather than sequential scripts, reducing specification errors
Demonstrates alternative to code-based multi-agent programming, opening robotics to domain experts unfamiliar with traditional programming

Show Abstract ▼

Multi-agent simulation with declarative behavior in Logica. Logic predicates map observations to motor outputs.

Train-Small Deploy-Large: Leveraging Diffusion-Based Multi-Robot Planning

h-index: 4 cs.RO, eess.SY

Siddharth Singh, Soumee Guha, Qing Chang, Scott Acton

Core Contributions

Diffusion model planners generalize across swarm sizes without retraining—train on 2-3 agents, deploy on 5-10, addressing scalability bottleneck
Inter-agent attention + temporal convolution architecture captures both spatial interactions and temporal dynamics elegantly
Enables rapid deployment to larger teams without computational cost of retraining, critical for field robotics applications

Show Abstract ▼

Diffusion model planner for varying agent numbers. Trained on few agents, generalizes to more. Inter-agent attention + temporal convolution.

Manipulation & Grasping

Towards Multi-Object Nonprehensile Transportation via Shared Teleoperation

h-index: 16 cs.RO

Xinyang Fan, Zhaoyang Chen, Shu Xin, Yi Ren, Zainan Jiang

Core Contributions

MPC-based shared teleoperation framework with virtual object method simplifies multi-object constraint handling—operator controls aggregate motion, not individual contacts
72.45% reduction in sliding distance and complete elimination of tip-overs (0% vs 13.9% baseline) through force-aware control
Demonstrates practical path to non-prehensile multi-object tasks, relevant for warehouse automation and unstructured environments

Show Abstract ▼

MPC-based shared teleoperation for multi-object nonprehensile transport. Virtual object method for constraint simplification. Reduces sliding distance by 72.45%, eliminates tip-overs (0% vs 13.9%).

TAMEn: Tactile-Aware Manipulation Engine for Closed-Loop Data Collection in Contact-Rich Tasks

h-index: 10 cs.RO

Longyan Wu, Jieji Ren, Chenghang Jiang, Junxi Zhou, Shijia Peng

Core Contributions

Cross-morphology wearable interface enables cost-effective, robot-agnostic tactile data collection—solves sensor cost bottleneck that limits grasp dataset scale
Dual-modal pipeline (precision + portable) with pyramid data regime increases task success from 34% to 75%, demonstrating tactile feedback is learnable and valuable
First large-scale closed-loop tactile data collection system—addresses why contact-rich tasks remain hard despite vision-based datasets

Show Abstract ▼

Cross-morphology wearable interface for tactile data collection. Dual-modal pipeline (precision + portable). Pyramid data regime. Increases task success from 34% to 75%.

Learning-Based Strategy for Composite Robot Assembly Skill Adaptation

h-index: 10 cs.RO

Khalil Abuibaid, Aleksandr Sidorenko, Achim Wagner, Martin Ruskowski

Core Contributions

Residual RL with composite skills (pre/post/invariant conditions) enables task adaptation without monolithic retraining—modular approach to assembly robustness
Demonstrates SAC+JAX integration on real UR5e peg-in-hole, bridging sim-to-real with structured skill composition
Composite skill framework provides interpretability—domain experts can reason about which conditions must hold for successful assembly

Show Abstract ▼

Residual RL for peg-in-hole assembly with composite skills. Pre/post/invariant conditions. Evaluated on UR5e with SAC+JAX.

Sustainable Transfer Learning for Adaptive Robot Skills

h-index: 10 cs.RO

Khalil Abuibaid, Vinit Hegiste, Nigora Gafur, Achim Wagner, Martin Ruskowski

Core Contributions

Demonstrates policy transfer across heterogeneous robot platforms for peg-in-hole, addressing generalization concerns in embodied learning
Fine-tuning significantly outperforms zero-shot transfer, quantifying the benefit-cost tradeoff of domain adaptation
Enables skill libraries to be shared across platforms, reducing training overhead when deploying to new hardware

Show Abstract ▼

Policy transfer across robot platforms for peg-in-hole. Fine-tuning significantly improves over zero-shot transfer.

BiDexGrasp: Coordinated Bimanual Dexterous Grasps across Object Geometries and Sizes

h-index: 10 cs.RO

Mu Lin, Yi-Lin Wei, Jiaxuan Chen, Yuhao Lin, Shuoyu Chen

Core Contributions

Large-scale bimanual grasp dataset (6351 objects, 9.7M annotations) fills a critical gap—most prior work focuses on single-arm, limiting applicability to dual-arm systems
Two-stage synthesis (region-based initialization + force-closure optimization) provides computational efficiency and physical validity
Bimanual coordination module enables grasp quality assessment across morphologically distinct hand pairs, useful for heterogeneous manipulation teams

Show Abstract ▼

Large-scale bimanual grasp dataset: 6351 objects, 9.7M grasp annotations. Two-stage synthesis: region-based init + force-closure optimization. Bimanual coordination module.

Motion Planning & Robot Architecture

Flow Motion Policy: Manipulator Motion Planning with Flow Matching Models

h-index: 15 cs.RO, cs.AI

Davood Soleymanzadeh, Xiao Liang, Minghui Zheng

Core Contributions

Open-loop end-to-end neural planner using flow matching generates multi-modal trajectories in one forward pass—avoids iterative sampling bottleneck
Best-of-N sampling provides flexible accuracy/speed tradeoff—operator can increase N for tighter paths during final approach
Demonstrates flow matching (less understood than diffusion) is viable for continuous control, expanding toolbox for generative robot planning

Show Abstract ▼

Open-loop end-to-end neural motion planner using flow matching for multi-modal path generation. Best-of-N sampling improves planning success and efficiency.

RichMap: A Reachability Map Balancing Precision, Efficiency, and Flexibility

h-index: 6 cs.RO

Yupu Lu, Yuxiang Ma, Jia Pan

Core Contributions

High-precision reachability map achieves >98% accuracy with only 1-2% false positives and ~15μs query latency—enables real-time planning constraints
MMD metrics quantify workspace similarity across embodiments, enabling direct reachability map transfer with 26% improvement in diffusion policy performance
Solves a practical deployment problem: how to reuse inverse kinematics knowledge across robot variants without recomputing

Show Abstract ▼

High-precision reachability map: >98% accuracy, 1-2% false positives, ~15μs/query. MMD metrics for workspace similarity. Up to 26% improvement in cross-embodiment diffusion policy transfer.

AEROS: A Single-Agent Operating Architecture with Embodied Capability Modules

h-index: 2 cs.RO, cs.AI

Xue Qin, Simin Luan, Cong Yang, Zhijun Li

Core Contributions

Runtime OS for robots with pluggable Embodied Capability Modules enables modular deployment—100% task success vs 67-93% for integrated baselines
Zero false acceptances in policy enforcement demonstrates robust containment—modules cannot silently violate safety constraints
Provides missing OS-level abstraction for embodied AI, analogous to how Linux changed general computing—enables rapid capability composition

Show Abstract ▼

Runtime OS for robots with installable Embodied Capability Modules. 100% task success vs 67-93% baselines. Zero false acceptances in policy enforcement. Franka Panda evaluation.

Robust Quadruped Locomotion via Evolutionary Reinforcement Learning

h-index: 1 cs.RO

Brian McAteer, Karl Mason

Core Contributions

CEM-TD3 hybrid achieves 19574.33 mean reward on rough terrain vs -99.73 for vanilla TD3—evolutionary strategy discovers better initialization and exploration
Evolutionary variants retain capability under terrain transfer, demonstrating evolutionary search finds more robust policies than pure gradient descent
Addresses long-standing challenge in quadruped learning: why gradient-based RL struggles on unstructured terrain despite apparent convexity

Show Abstract ▼

CEM-TD3 achieves 19574.33 mean reward on rough terrain vs -99.73 for TD3. Evolutionary variants retain capability under terrain transfer.

SLAM, Localization & Autonomous Driving

VGGT-SLAM++: Visual SLAM with Geometry Grounded Transformer

h-index: 4 cs.CV, cs.RO

Avilasha Mandal, Rajesh Kumar, Sudarshan Sunil Harithas, Chetan Arora

Core Contributions

VGGT front-end with geometry-grounded transformer improves feature matching robustness, addressing the long-standing limitation that ORB-SLAM scales poorly in low-texture environments
DEM-based graph backend + DINOv2 embeddings achieve state-of-the-art SLAM accuracy by integrating semantic and geometric constraints
Restores high-cadence local bundle adjustment, critical for real-time applications where drift accumulates quickly

Show Abstract ▼

VGGT front-end + DEM-based graph + DINOv2 embeddings. Restores high-cadence local bundle adjustment. State-of-the-art SLAM accuracy.

An RTK-SLAM Dataset for Absolute Accuracy Evaluation in GNSS-Degraded Environments

h-index: 3 cs.RO, cs.CV

Wei Zhang, Vincent Ress, David Skuddis, Uwe Soergel, Norbert Haala

Core Contributions

Geodetic total station ground truth (not GNSS) enables centimeter-level accuracy validation where GPS fails—solves evaluation gap for urban/indoor robots
Reveals that SE(3) alignment underestimates error by up to 76%, demonstrating common evaluation protocol is fundamentally flawed
Dataset enables honest benchmarking of multi-sensor fusion systems in realistic degraded scenarios, critical for autonomous vehicles

Show Abstract ▼

Dataset with geodetic total station ground truth (not GNSS). SE(3) alignment underestimates error by up to 76%. Centimeter-level accuracy outdoors, decimeter indoors.

CADENCE: Context-Adaptive Depth Estimation for Navigation

h-index: 2 cs.RO, cs.AI, cs.LG

Timothy K Johnsen, Marco Levorato

Core Contributions

Adaptive depth estimation scales computational cost based on navigation context—75% energy reduction on edge hardware (Jetson Orin Nano) without sacrificing accuracy
7.43% navigation accuracy improvement demonstrates that selective refinement is beneficial, not just cost-saving
Enables deployment to resource-constrained platforms, critical for swarms and long-endurance missions

Show Abstract ▼

Adaptive system scaling depth estimation complexity based on navigation needs. 75% energy reduction, 7.43% navigation accuracy improvement on NVIDIA Jetson Orin Nano.

Self-Discovered Intention-aware Transformer for Multi-modal Vehicle Trajectory Prediction

h-index: 2 cs.RO, cs.AI, cs.LG

Diyi Liu, Zihan Niu, Tu Xu, Lishan Sun

Core Contributions

Pure Transformer (no RNNs) with two-track architecture jointly predicts trajectories and behavioral intentions, eliminating decoupling errors
Residual offset learning discovers trajectory groups self-supervised, reducing annotation burden for motion datasets
Applies to autonomous driving prediction, enabling better anticipation of multi-modal vehicle futures without explicit mode labels

Show Abstract ▼

Pure Transformer for trajectory + intention prediction with two-track design. Learns ordered trajectory groups via residual offsets.

Human Data, Bio-Inspired & Infrastructure

Telecom World Models: Unifying Digital Twins, Foundation Models, and Predictive Planning for 6G

h-index: 29 cs.RO, eess.SP, eess.SY

Hang Zou, Yuzhi Yang, Lina Bariah, Yu Tian, Yuhuan Lu

Core Contributions

Telecom World Model architecture applies learned, action-conditioned, uncertainty-aware dynamics modeling to 6G network slicing—bridges embodied AI and telecom systems
Three-layer architecture unifies digital twins, foundation models, and planning—demonstrates that world models generalize beyond robotics
Proof-of-concept on network slicing shows practical value for infrastructure optimization, opening robotics methodologies to telecom domain

Show Abstract ▼

Telecom World Model architecture for learned, action-conditioned, uncertainty-aware modeling of telecom dynamics. Three-layer architecture for 6G. Proof-of-concept on network slicing.

RoSHI: A Versatile Robot-oriented Suit for Human Data In-the-Wild

h-index: 25 cs.RO, cs.AI, cs.CV

Wenjing Margaret Mao, Jefferson Ng, Luyang Hu, Daniel Gehrig, Antonio Loquercio

Core Contributions

Hybrid wearable (IMUs + Project Aria glasses) estimates full 3D pose and body shape from egocentric view, solving the cold-start problem for humanoid policy learning
Outperforms previous egocentric baselines and matches SAM3D, demonstrating that sensor fusion beats single-modality approaches for in-the-wild capture
Enables cost-effective on-the-job human motion capture for robotics—reduces instrumentation burden for data collection in real environments

Show Abstract ▼

Hybrid wearable fusing IMUs with Project Aria glasses for full 3D pose/body shape estimation. Outperforms egocentric baselines, comparable to SAM3D. Demonstrated for humanoid policy learning.

Exploring the proprioceptive potential of joint receptors using a biomimetic robotic joint

h-index: 16 cs.RO, q-bio.NC

Akihiro Miki, Shun Hasegawa, Sota Yuzaki, Yuta Sahara, Yoshimoto Ribayashi

Core Contributions

Biomimetic joint with Type I receptor analog achieves <2 degree average error in 3D bending and twisting—validates decades of neuroscience theory in hardware
Suggests joint receptors play greater proprioceptive role than previously thought, shifting understanding of sensorimotor control architecture
Opens path to biologically-inspired sensing in robots, potentially simpler and more robust than vision-based proprioception

Show Abstract ▼

Biomimetic joint mimicking Type I joint receptors achieves <2 degrees average error in bending and twisting. Suggests joint receptors play greater role in proprioception than previously thought.

Infrastructure First: Enabling Embodied AI for Science in the Global South

h-index: 14 cs.CY, cs.RO

Shaoshan Liu, Jie Tang, Marwa S. Hassan, Mohamed H. Sharkawy, Moustafa M. G. Fouda

Core Contributions

Argues for infrastructure-first approach to embodied AI deployment in resource-limited settings—prioritizes grid power, compute, connectivity over algorithms
Outlines practical requirements for scaling embodied intelligence beyond well-resourced labs, addressing a critical gap in robotics deployment literature
Emphasizes that robotics accessibility requires infrastructure investment, not just algorithmic innovation—reshapes how we should think about global impact

Show Abstract ▼

Argues for infrastructure-first approach to Embodied AI for Science. Outlines requirements for deploying embodied intelligence at scale in resource-limited settings.

Robots that learn to evaluate models of collective behavior

h-index: 5 cs.RO

Mathis Hocke, Andreas Gerken, David Bierbach, Jens Krause, Tim Landgraf

Core Contributions

RL-based RoboFish autonomously evaluates fish behavior models through closed-loop interaction—novel approach to model validation that doesn't require labeled data
Neural network fish model shows smallest sim-to-real gap versus other learned and hand-crafted models, suggesting neural approaches capture ethology better
Demonstrates robots can serve as experimental platforms for behavioral science, inverting typical application direction

Show Abstract ▼

RL-based RoboFish evaluates fish behavior models through closed-loop interaction. Neural network fish model shows smallest sim-to-real gap.