arXiv Robotics Digest

Curated Papers for April 8, 2026

25 papers ranked by maximum author h-index

Research Landscape

Today's batch reveals a strong focus on scaling robot learning through better data collection and transfer mechanisms. Papers like RoSHI, TAMEn, and BiDexGrasp attack the data bottleneck from complementary angles—wearable sensing, tactile-aware systems, and large-scale grasp annotation—suggesting the field is moving beyond algorithmic breakthroughs toward systematic solutions for data scarcity. These efforts directly enable downstream work in policy transfer (Learning-Based Assembly, Sustainable Transfer) and sim-to-real robustness (Robust Quadruped Locomotion via evolutionary RL).

Multi-robot coordination continues to mature with dual approaches gaining traction: classical declarative and rule-based methods (Aggregate Programming, Logical Robots) coexist with learning-based planners (Train-Small Deploy-Large, Differentiable Environment-Trajectory Co-Optimization). This reflects pragmatism—formal methods provide safety guarantees for swarms, while diffusion models and bi-level optimization unlock flexibility in dynamic, partially observable environments. The fact that both lineages advance simultaneously suggests neither dominates the design space.

Vision-language models have graduated from pure task execution toward autonomous self-diagnosis: KITE demonstrates that tokenized, keyframe-anchored evidence can substantially improve VLM-based failure detection beyond vanilla Qwen2.5-VL. Simultaneously, scene understanding pipelines (Genie Sim PanoRecon, MoRight) enable better 3D reconstruction and disentangled motion control, bridging perception and planning. Infrastructure requirements (RTK-SLAM Dataset, CADENCE energy-aware sensing) and foundational architectures (AEROS, RichMap) round out the ecosystem, signaling that 2026 robotics is as much about systems integration and measurement rigor as novel learning algorithms.

Vision-Language Models & Scene Generation

VLM-based failure analysis, motion control, 3D reconstruction

3

Multi-Robot Coordination & Planning

Aggregate programming, diffusion planners, declarative multi-agent

4

Manipulation & Grasping

Bimanual grasping, tactile sensing, peg-in-hole assembly

5

Motion Planning & Robot Architecture

Flow matching, OS design, quadruped locomotion

4

SLAM, Localization & Autonomous Driving

Visual SLAM, RTK positioning, trajectory prediction

4

Human Data, Bio-Inspired & Infrastructure

Wearable sensing, proprioceptive joints, telecom world models

5

Vision-Language Models & Scene Generation

Mehdi Hosseinzadeh, King Hang Wong, Feras Dayoub
Core Contributions
  • Solves long-context video bottleneck by tokenizing only motion-salient keyframes with BEV representations, enabling VLMs to diagnose failures without expensive training
  • Substantially outperforms vanilla Qwen2.5-VL on RoboFAC, with particularly strong gains in simulation failure detection and localization tasks
  • Training-free front-end that transforms robot execution videos into compact evidence—first principled approach to keyframe-anchored VLM prompting for robotics failure analysis
Show Abstract
Training-free, keyframe-anchored front-end that converts long robot-execution videos into compact tokenized evidence for VLMs. Uses motion-salient keyframes with BEV representations. On RoboFAC benchmark, substantially improves over vanilla Qwen2.5-VL especially in simulation failure detection/identification/localization.
4
h-index: 28 cs.CV, cs.AI, cs.GR, cs.LG, cs.RO
Shaowei Liu, Xuanchi Ren, Tianchang Shen, Huan Ling, Saurabh Gupta
Core Contributions
  • First framework to disentangle object motion from camera viewpoint, decomposing control into active (user-driven) and passive (consequence) components
  • Supports both forward control (user commands) and inverse reasoning (predict user intent from observed motion), enabling more natural human-robot interaction
  • State-of-the-art on three benchmarks with a unified architecture—demonstrates that explicit factorization of motion intent improves generalization over end-to-end approaches
Show Abstract
Framework for disentangled motion control: separate object motion and camera viewpoint. Decomposes motion into active (user-driven) and passive (consequence) components. Supports forward and inverse reasoning. State-of-the-art on three benchmarks.
Zhijun Li, Yongxin Su, Di Yang, Jichao Wang, Zheyuan Xing
Core Contributions
  • Feed-forward Gaussian-splatting pipeline overcomes cold-start problem in simulation—reconstructs full 3D scenes from single panoramic images in seconds
  • Depth-aware fusion strategy integrates multiple sensor modalities, critical for manipulation task realism in simulated training
  • Direct integration into Genie Sim reduces sim-to-real friction for manipulation by enabling rapid scenario prototyping without manual asset creation
Show Abstract
Feed-forward Gaussian-splatting pipeline for 3D scene reconstruction from panorama. Depth-aware fusion strategy. Integrated into Genie Sim for manipulation tasks.

Multi-Robot Coordination & Planning

Giorgio Audrito, Andrea Basso, Daniele Bortoluzzi, Ferruccio Damiani, Giordano Scarso
Core Contributions
  • First production-grade application of aggregate programming to multi-robot coordination—demonstrates scalability and fault tolerance in realistic university library environment
  • Combines field calculus abstractions with validated simulation and hardware experiments, proving formal approaches work beyond toy domains
  • Provides templates for adaptive swarm behaviors that handle decentralized coordination without centralized planning—key advantage for large fleets
Show Abstract
Multi-robot systems using Aggregate Programming for coordination in a university library prototype, validated with simulations and real tests.
Zhan Gao, Gabriele Fadini, Stelian Coros, Amanda Prorok
Core Contributions
  • Bi-level optimization framework co-designs safe trajectories and environment layouts jointly—first approach to make environment configuration differentiable via KKT + Implicit Function Theorem
  • Novel safety metric grounded in measure theory enables principled quantification of collision risk across multi-agent systems
  • Enables discovery of non-intuitive, safer environment configurations automatically—useful for infrastructure design and autonomous systems deployment
Show Abstract
Bi-level optimization: lower-level trajectory optimization + upper-level environment optimization. Uses KKT + Implicit Function Theorem. Novel safety metric via measure theory.
Evgeny Skvortsov, Yilin Xia, Ojaswa Garg, Shawn Bowers, Bertram Ludäscher
Core Contributions
  • Leverages Logica (Google's declarative query language) for multi-agent simulation—maps logic predicates directly to motor outputs, eliminating imperative control code
  • Enables humans and AI to specify swarm behaviors as logical constraints rather than sequential scripts, reducing specification errors
  • Demonstrates alternative to code-based multi-agent programming, opening robotics to domain experts unfamiliar with traditional programming
Show Abstract
Multi-agent simulation with declarative behavior in Logica. Logic predicates map observations to motor outputs.
Siddharth Singh, Soumee Guha, Qing Chang, Scott Acton
Core Contributions
  • Diffusion model planners generalize across swarm sizes without retraining—train on 2-3 agents, deploy on 5-10, addressing scalability bottleneck
  • Inter-agent attention + temporal convolution architecture captures both spatial interactions and temporal dynamics elegantly
  • Enables rapid deployment to larger teams without computational cost of retraining, critical for field robotics applications
Show Abstract
Diffusion model planner for varying agent numbers. Trained on few agents, generalizes to more. Inter-agent attention + temporal convolution.

Manipulation & Grasping

Xinyang Fan, Zhaoyang Chen, Shu Xin, Yi Ren, Zainan Jiang
Core Contributions
  • MPC-based shared teleoperation framework with virtual object method simplifies multi-object constraint handling—operator controls aggregate motion, not individual contacts
  • 72.45% reduction in sliding distance and complete elimination of tip-overs (0% vs 13.9% baseline) through force-aware control
  • Demonstrates practical path to non-prehensile multi-object tasks, relevant for warehouse automation and unstructured environments
Show Abstract
MPC-based shared teleoperation for multi-object nonprehensile transport. Virtual object method for constraint simplification. Reduces sliding distance by 72.45%, eliminates tip-overs (0% vs 13.9%).
Longyan Wu, Jieji Ren, Chenghang Jiang, Junxi Zhou, Shijia Peng
Core Contributions
  • Cross-morphology wearable interface enables cost-effective, robot-agnostic tactile data collection—solves sensor cost bottleneck that limits grasp dataset scale
  • Dual-modal pipeline (precision + portable) with pyramid data regime increases task success from 34% to 75%, demonstrating tactile feedback is learnable and valuable
  • First large-scale closed-loop tactile data collection system—addresses why contact-rich tasks remain hard despite vision-based datasets
Show Abstract
Cross-morphology wearable interface for tactile data collection. Dual-modal pipeline (precision + portable). Pyramid data regime. Increases task success from 34% to 75%.
Khalil Abuibaid, Aleksandr Sidorenko, Achim Wagner, Martin Ruskowski
Core Contributions
  • Residual RL with composite skills (pre/post/invariant conditions) enables task adaptation without monolithic retraining—modular approach to assembly robustness
  • Demonstrates SAC+JAX integration on real UR5e peg-in-hole, bridging sim-to-real with structured skill composition
  • Composite skill framework provides interpretability—domain experts can reason about which conditions must hold for successful assembly
Show Abstract
Residual RL for peg-in-hole assembly with composite skills. Pre/post/invariant conditions. Evaluated on UR5e with SAC+JAX.
Khalil Abuibaid, Vinit Hegiste, Nigora Gafur, Achim Wagner, Martin Ruskowski
Core Contributions
  • Demonstrates policy transfer across heterogeneous robot platforms for peg-in-hole, addressing generalization concerns in embodied learning
  • Fine-tuning significantly outperforms zero-shot transfer, quantifying the benefit-cost tradeoff of domain adaptation
  • Enables skill libraries to be shared across platforms, reducing training overhead when deploying to new hardware
Show Abstract
Policy transfer across robot platforms for peg-in-hole. Fine-tuning significantly improves over zero-shot transfer.
Mu Lin, Yi-Lin Wei, Jiaxuan Chen, Yuhao Lin, Shuoyu Chen
Core Contributions
  • Large-scale bimanual grasp dataset (6351 objects, 9.7M annotations) fills a critical gap—most prior work focuses on single-arm, limiting applicability to dual-arm systems
  • Two-stage synthesis (region-based initialization + force-closure optimization) provides computational efficiency and physical validity
  • Bimanual coordination module enables grasp quality assessment across morphologically distinct hand pairs, useful for heterogeneous manipulation teams
Show Abstract
Large-scale bimanual grasp dataset: 6351 objects, 9.7M grasp annotations. Two-stage synthesis: region-based init + force-closure optimization. Bimanual coordination module.

Motion Planning & Robot Architecture

Davood Soleymanzadeh, Xiao Liang, Minghui Zheng
Core Contributions
  • Open-loop end-to-end neural planner using flow matching generates multi-modal trajectories in one forward pass—avoids iterative sampling bottleneck
  • Best-of-N sampling provides flexible accuracy/speed tradeoff—operator can increase N for tighter paths during final approach
  • Demonstrates flow matching (less understood than diffusion) is viable for continuous control, expanding toolbox for generative robot planning
Show Abstract
Open-loop end-to-end neural motion planner using flow matching for multi-modal path generation. Best-of-N sampling improves planning success and efficiency.
Yupu Lu, Yuxiang Ma, Jia Pan
Core Contributions
  • High-precision reachability map achieves >98% accuracy with only 1-2% false positives and ~15μs query latency—enables real-time planning constraints
  • MMD metrics quantify workspace similarity across embodiments, enabling direct reachability map transfer with 26% improvement in diffusion policy performance
  • Solves a practical deployment problem: how to reuse inverse kinematics knowledge across robot variants without recomputing
Show Abstract
High-precision reachability map: >98% accuracy, 1-2% false positives, ~15μs/query. MMD metrics for workspace similarity. Up to 26% improvement in cross-embodiment diffusion policy transfer.
Xue Qin, Simin Luan, Cong Yang, Zhijun Li
Core Contributions
  • Runtime OS for robots with pluggable Embodied Capability Modules enables modular deployment—100% task success vs 67-93% for integrated baselines
  • Zero false acceptances in policy enforcement demonstrates robust containment—modules cannot silently violate safety constraints
  • Provides missing OS-level abstraction for embodied AI, analogous to how Linux changed general computing—enables rapid capability composition
Show Abstract
Runtime OS for robots with installable Embodied Capability Modules. 100% task success vs 67-93% baselines. Zero false acceptances in policy enforcement. Franka Panda evaluation.
Brian McAteer, Karl Mason
Core Contributions
  • CEM-TD3 hybrid achieves 19574.33 mean reward on rough terrain vs -99.73 for vanilla TD3—evolutionary strategy discovers better initialization and exploration
  • Evolutionary variants retain capability under terrain transfer, demonstrating evolutionary search finds more robust policies than pure gradient descent
  • Addresses long-standing challenge in quadruped learning: why gradient-based RL struggles on unstructured terrain despite apparent convexity
Show Abstract
CEM-TD3 achieves 19574.33 mean reward on rough terrain vs -99.73 for TD3. Evolutionary variants retain capability under terrain transfer.

SLAM, Localization & Autonomous Driving

Avilasha Mandal, Rajesh Kumar, Sudarshan Sunil Harithas, Chetan Arora
Core Contributions
  • VGGT front-end with geometry-grounded transformer improves feature matching robustness, addressing the long-standing limitation that ORB-SLAM scales poorly in low-texture environments
  • DEM-based graph backend + DINOv2 embeddings achieve state-of-the-art SLAM accuracy by integrating semantic and geometric constraints
  • Restores high-cadence local bundle adjustment, critical for real-time applications where drift accumulates quickly
Show Abstract
VGGT front-end + DEM-based graph + DINOv2 embeddings. Restores high-cadence local bundle adjustment. State-of-the-art SLAM accuracy.
Wei Zhang, Vincent Ress, David Skuddis, Uwe Soergel, Norbert Haala
Core Contributions
  • Geodetic total station ground truth (not GNSS) enables centimeter-level accuracy validation where GPS fails—solves evaluation gap for urban/indoor robots
  • Reveals that SE(3) alignment underestimates error by up to 76%, demonstrating common evaluation protocol is fundamentally flawed
  • Dataset enables honest benchmarking of multi-sensor fusion systems in realistic degraded scenarios, critical for autonomous vehicles
Show Abstract
Dataset with geodetic total station ground truth (not GNSS). SE(3) alignment underestimates error by up to 76%. Centimeter-level accuracy outdoors, decimeter indoors.
Timothy K Johnsen, Marco Levorato
Core Contributions
  • Adaptive depth estimation scales computational cost based on navigation context—75% energy reduction on edge hardware (Jetson Orin Nano) without sacrificing accuracy
  • 7.43% navigation accuracy improvement demonstrates that selective refinement is beneficial, not just cost-saving
  • Enables deployment to resource-constrained platforms, critical for swarms and long-endurance missions
Show Abstract
Adaptive system scaling depth estimation complexity based on navigation needs. 75% energy reduction, 7.43% navigation accuracy improvement on NVIDIA Jetson Orin Nano.
Diyi Liu, Zihan Niu, Tu Xu, Lishan Sun
Core Contributions
  • Pure Transformer (no RNNs) with two-track architecture jointly predicts trajectories and behavioral intentions, eliminating decoupling errors
  • Residual offset learning discovers trajectory groups self-supervised, reducing annotation burden for motion datasets
  • Applies to autonomous driving prediction, enabling better anticipation of multi-modal vehicle futures without explicit mode labels
Show Abstract
Pure Transformer for trajectory + intention prediction with two-track design. Learns ordered trajectory groups via residual offsets.

Human Data, Bio-Inspired & Infrastructure

Hang Zou, Yuzhi Yang, Lina Bariah, Yu Tian, Yuhuan Lu
Core Contributions
  • Telecom World Model architecture applies learned, action-conditioned, uncertainty-aware dynamics modeling to 6G network slicing—bridges embodied AI and telecom systems
  • Three-layer architecture unifies digital twins, foundation models, and planning—demonstrates that world models generalize beyond robotics
  • Proof-of-concept on network slicing shows practical value for infrastructure optimization, opening robotics methodologies to telecom domain
Show Abstract
Telecom World Model architecture for learned, action-conditioned, uncertainty-aware modeling of telecom dynamics. Three-layer architecture for 6G. Proof-of-concept on network slicing.
Wenjing Margaret Mao, Jefferson Ng, Luyang Hu, Daniel Gehrig, Antonio Loquercio
Core Contributions
  • Hybrid wearable (IMUs + Project Aria glasses) estimates full 3D pose and body shape from egocentric view, solving the cold-start problem for humanoid policy learning
  • Outperforms previous egocentric baselines and matches SAM3D, demonstrating that sensor fusion beats single-modality approaches for in-the-wild capture
  • Enables cost-effective on-the-job human motion capture for robotics—reduces instrumentation burden for data collection in real environments
Show Abstract
Hybrid wearable fusing IMUs with Project Aria glasses for full 3D pose/body shape estimation. Outperforms egocentric baselines, comparable to SAM3D. Demonstrated for humanoid policy learning.
Akihiro Miki, Shun Hasegawa, Sota Yuzaki, Yuta Sahara, Yoshimoto Ribayashi
Core Contributions
  • Biomimetic joint with Type I receptor analog achieves <2 degree average error in 3D bending and twisting—validates decades of neuroscience theory in hardware
  • Suggests joint receptors play greater proprioceptive role than previously thought, shifting understanding of sensorimotor control architecture
  • Opens path to biologically-inspired sensing in robots, potentially simpler and more robust than vision-based proprioception
Show Abstract
Biomimetic joint mimicking Type I joint receptors achieves <2 degrees average error in bending and twisting. Suggests joint receptors play greater role in proprioception than previously thought.
Shaoshan Liu, Jie Tang, Marwa S. Hassan, Mohamed H. Sharkawy, Moustafa M. G. Fouda
Core Contributions
  • Argues for infrastructure-first approach to embodied AI deployment in resource-limited settings—prioritizes grid power, compute, connectivity over algorithms
  • Outlines practical requirements for scaling embodied intelligence beyond well-resourced labs, addressing a critical gap in robotics deployment literature
  • Emphasizes that robotics accessibility requires infrastructure investment, not just algorithmic innovation—reshapes how we should think about global impact
Show Abstract
Argues for infrastructure-first approach to Embodied AI for Science. Outlines requirements for deploying embodied intelligence at scale in resource-limited settings.
Mathis Hocke, Andreas Gerken, David Bierbach, Jens Krause, Tim Landgraf
Core Contributions
  • RL-based RoboFish autonomously evaluates fish behavior models through closed-loop interaction—novel approach to model validation that doesn't require labeled data
  • Neural network fish model shows smallest sim-to-real gap versus other learned and hand-crafted models, suggesting neural approaches capture ethology better
  • Demonstrates robots can serve as experimental platforms for behavioral science, inverting typical application direction
Show Abstract
RL-based RoboFish evaluates fish behavior models through closed-loop interaction. Neural network fish model shows smallest sim-to-real gap.