🤖 Robotics arXiv Digest

🔭 Research Landscape

Today's 20 papers reveal a field grappling with a fundamental credibility gap in how it evaluates itself. The VLA cluster is particularly cohesive — three papers converge on the concern that prevailing metrics systematically mislead. From Inference Efficiency to Embodied Efficiency demonstrates that compression methods scoring well on FLOPs and token throughput often increase real execution cost or degrade motion quality. FASTER identifies that standard flow-matching schedules force all denoising steps to complete before any action can start, creating avoidable reaction latency. And the mechanistic study Not All Features Are Created Equal reveals that VLAs largely ignore language when visual context is sufficient — encoding spatially-bound motor programs tied to scene coordinates rather than abstract multi-modal representations. Taken together, these three papers constitute a quiet indictment of VLA benchmarking practice and point toward the need for better evaluation protocols and more honest architectural analysis before the community declares these models "solved."

A second dominant thread is the tension between scalable data collection and physical fidelity. V-Dreamer uses video generative models as motion priors to auto-synthesize manipulation environments from text, eliminating fixed asset libraries. Fire as a Service augments existing simulators with high-fidelity thermodynamic fire dynamics for hazardous scenario training. OmniVTA contributes a 21,000-trajectory visuo-tactile dataset across 86 tasks — an order of magnitude larger than prior art. These represent different responses to the same bottleneck: real-world data is too expensive and narrow to train generalizable robots. A related paper, ViTac-Tracing, demonstrates that even with limited data, careful sensing design can achieve 65% generalization to unseen deformable objects, suggesting data quality and architecture matter as much as scale. The field will likely need all three approaches — generative synthesis, physics co-simulation, and large-scale hardware collection — rather than any single solution.

A third thread bridges distributed computation and fundamental theoretical limits. The ADMM-MPC paper achieves 51% speedup over centralized planning for four-agent quadruped navigation while preserving control barrier function safety guarantees. GoC-MPC enables model-free multi-agent manipulation planning from visual observations alone, without training data or environment models. Meanwhile, the information-theoretic paper on Fundamental Limits for Sensor-Based Control — the highest-ranked paper by author h-index — provides a Gibbs variational bound on achievable controller performance that tightens self-consistently as the controller improves, providing the kind of principled benchmarking the field is hungry for. The convergence of theory (paper #1), distributed optimization (paper #7), and learning-based coordination (paper #18) suggests this sub-field is maturing rapidly toward bridgeable theory-practice gaps.

📂 Papers by Research Area

Control Theory & Optimal Design

Information-theoretic limits, hardware-control co-design via bilevel optimization

#1 Fundamental Limits#6 Quadrupedal Skating

Tactile & Contact-Rich Manipulation

Visuo-tactile sensing, deformable objects, contact world models

#2 ViTac-Tracing#19 OmniVTA

Hardware Design & Mechanisms

Continuum robots, passive elastic mechanisms, airdrop sensor systems

#3 Tendon Continuum Robot#14 Elastic-Folding Sensor

Agentic AI & Human-Robot Interaction

LLM-driven robot programming, multi-actor event grounding, HRI awareness

#4 RAPID EV Disassembly#12 MERGE

Robot Learning & Simulation

Dynamics-grounded RL, generative simulation, hazardous environment training

#5 ABD-Net#11 V-Dreamer#16 FaaS

Multi-Robot Coordination

Distributed MPC, multi-agent TAMP, learned multi-objective routing

#7 ADMM-MPC#13 GoC-MPC#18 CAMO

Navigation & Spatial Grounding

Zero-shot object navigation, metric-semantic grounding, dynamic SLAM

#9 REST#10 MAPG#15 DROID-SLAM

VLA & Foundation Models

Embodied efficiency metrics, real-time flow inference, mechanistic VLA analysis

#8 Embodied Efficiency#17 FASTER#20 VLA Features

Control Theory & Optimal Design

RANK

h=47

Fundamental Limits for Sensor-Based Control via the Gibbs Variational Principle

📅 2026-03-19 math.OC · cs.RO · eess.SY 👤 Evangelos A. Theodorou (h=47)

Vincent Pacelli, Evangelos A. Theodorou

Core Contributions

Prior information-theoretic bounds evaluated sensors against the uncontrolled system, producing overly conservative bounds that became useless precisely when feedback mattered most; this work fixes that by conditioning on the controlled trajectory distribution via the Gibbs variational principle.
The bound applies broadly to nonlinear, nonholonomic, and hybrid dynamics with unbounded costs — a significantly larger class than prior approaches limited to linear or smooth systems.
A self-consistent refinement tightens the bound iteratively: a good controller concentrates the state, limiting extractable sensor information, which in turn tightens the bound further — the fixed-point equation has a unique solution computable by bisection.
Provable convexity conditions guarantee the free energy minimization yields a certifiably correct numerical bound, not just a heuristic approximation.
On a nonlinear Dubins car tracking problem, the self-consistent bound captures most of the optimal cost across all sensor noise levels where the open-loop variant is completely vacuous — the refinement is not just theoretical but practically significant.

Show abstract

Fundamental limits on the performance of feedback controllers are essential for benchmarking algorithms, guiding sensor selection, and certifying task feasibility -- yet few general-purpose tools exist for computing them. Existing information-theoretic approaches overestimate the information a sensor must provide by evaluating it against the uncontrolled system, producing bounds that degrade precisely when feedback is most valuable. We derive a lower bound on the minimum expected cost of any causal feedback controller under partial observations by applying the Gibbs variational principle to the joint path measure over states and observations. The bound applies to nonlinear, nonholonomic, and hybrid dynamics with unbounded costs and admits a self-consistent refinement: any good controller concentrates the state, which limits the information the sensor can extract, which tightens the bound. The resulting fixed-point equation has a unique solution computable by bisection, and we provide conditions under which the free energy minimization is provably convex, yielding a certifiably correct numerical bound. On a nonlinear Dubins car tracking problem, the self-consistent bound captures most of the optimal cost across sensor noise levels, while the open-loop variant is vacuous at low noise.

RANK

h=22

Efficient and Versatile Quadrupedal Skating: Optimal Co-design via Reinforcement Learning and Bayesian Optimization

📅 2026-03-19 cs.RO 👤 Josiah P. Hanna (h=22)

Hanwen Wang, Zhenlong Fang, Josiah Hanna, Xiaobin Xiong

Core Contributions

Passive-wheel skating tightly couples mechanical design and control in a way that makes optimizing either without the other leave significant performance on the table; this is the first work to address quadrupedal skating via joint hardware-control co-design.
A bilevel optimization framework separates concerns: Bayesian Optimization searches wheel geometry and placement at the upper level, while RL trains a motor control policy for each candidate design at the lower level, enabling scalable design space exploration.
The co-designed system discovers qualitatively emergent behaviors — "hockey stop" (rapid sideways braking) and self-aligning motion (automatic reorientation for efficiency) — that are absent from human-engineered baselines, suggesting the optimizer found non-obvious design-control synergies.
Passive wheels reduce leg inertia compared to actuated alternatives, improving energy efficiency at high speeds — but only when the mechanical design is specifically matched to the controller's requirements.
Provides the first systematic study of dynamic skating motions on quadrupedal platforms, establishing baselines for future work on skating, drifting, and passive-wheel locomotion more broadly.

Show abstract

In this paper, we present a hardware-control co-design approach that enables efficient and versatile roller skating on quadrupedal robots equipped with passive wheels. Passive-wheel skating reduces leg inertia and improves energy efficiency, particularly at high speeds. However, the absence of direct wheel actuation tightly couples mechanical design and control. To unlock the full potential of this modality, we formulate a bilevel optimization framework: an upper-level Bayesian Optimization searches the mechanical design space, while a lower-level Reinforcement Learning trains a motor control policy for each candidate design. The resulting design-policy pairs not only outperform human-engineered baselines, but also exhibit versatile behaviors such as hockey stop (rapid braking by turning sideways to maximize friction) and self-aligning motion (automatic reorientation to improve energy efficiency in the direction of travel), offering the first system-level study of dynamic skating motion on quadrupedal robots.

Tactile & Contact-Rich Manipulation

RANK

h=46

ViTac-Tracing: Visual-Tactile Imitation Learning of Deformable Object Tracing

📅 2026-03-19 cs.RO 👤 Y. Demiris (h=46)

Yongqiang Zhao, Haining Luo, Yupeng Wang, Emmanouil Spyrakos Papastavridis, Yiannis Demiris

Core Contributions

Existing deformable object tracing methods require either object-specific geometric models or sim-to-real transfer that degrades in practice; this work sidesteps both by learning directly from visual+tactile demonstrations, enabling category-level generalization.
A single unified model handles both 1D (cables, ropes) and 2D (cloth, fabric) deformable object tracing — prior methods typically required separate architectures per object category.
A weighted loss that penalizes actions moving contact away from the tactile image center enforces fine-grained contact maintenance without hand-engineered reward shaping, learning the "stay centered" heuristic directly from demonstrations.
A tracing task loss supervises task progression explicitly, improving long-horizon consistency by teaching the policy when it is making forward progress versus drifting.
Achieves 80% success on seen objects and 65% on unseen objects in real-world trials — meaningful cross-category generalization without any retraining on new object types.

Show abstract

Deformable objects often appear in unstructured configurations. Tracing deformable objects helps bringing them into extended states and facilitating the downstream manipulation tasks. Due to the requirements for object-specific modeling or sim-to-real transfer, existing tracing methods either lack generalizability across different categories of deformable objects or struggle to complete tasks reliably in the real world. To address this, we propose a novel visual-tactile imitation learning method to achieve one-dimensional (1D) and two-dimensional (2D) deformable object tracing with a unified model. Our method is designed from both local and global perspectives based on visual and tactile sensing. Locally, we introduce a weighted loss that emphasizes actions maintaining contact near the center of the tactile image, improving fine-grained adjustment. Globally, we propose a tracing task loss that helps the policy to regulate task progression. On the hardware side, to compensate for the limited features extracted from visual information, we integrate tactile sensing into a low-cost teleoperation system considering both the teleoperator and the robot. Extensive ablation and comparative experiments on diverse 1D and 2D deformable objects demonstrate the effectiveness of our approach, achieving an average success rate of 80% on seen objects and 65% on unseen objects.

#19

RANK

h=8

OmniVTA: Visuo-Tactile World Modeling for Contact-Rich Robotic Manipulation

📅 2026-03-19 cs.RO 👤 Yuhang Zheng (h=8)

Yuhang Zheng, Songen Gu, Weize Li, Yupeng Zheng, Yujie Zang

Core Contributions

The data bottleneck in visuo-tactile manipulation has been severe — existing datasets cover hundreds of trajectories across a handful of tasks; OmniViTac (21,000+ trajectories, 86 tasks, 100+ objects) is an order-of-magnitude larger and broader than any prior collection.
Six physics-grounded interaction patterns provide a principled taxonomy of contact types (wiping, insertion, screwing, etc.), enabling systematic evaluation of whether models generalize across contact regimes rather than memorizing task-specific behaviors.
Treating tactile signals as inputs to a predictive world model rather than passive observations allows the policy to anticipate contact evolution — a qualitative shift from reactive to anticipatory control.
A 60Hz reflexive controller closes the loop on tactile prediction errors in real time, correcting contact deviations before they propagate into task failures — the highest-frequency tactile feedback loop reported in a manipulation learning paper.
Demonstrates generalization to unseen objects and geometric configurations across all six interaction categories, confirming that predictive contact modeling provides genuine physical understanding rather than task-specific memorization.

Show abstract

Contact-rich manipulation tasks, such as wiping and assembly, require accurate perception of contact forces, friction changes, and state transitions that cannot be reliably inferred from vision alone. Despite growing interest in visuo-tactile manipulation, progress is constrained by two persistent limitations: existing datasets are small in scale and narrow in task coverage, and current methods treat tactile signals as passive observations rather than using them to model contact dynamics or enable closed-loop control explicitly. In this paper, we present OmniViTac, a large-scale visuo-tactile-action dataset comprising 21,000+ trajectories across 86 tasks and 100+ objects, organized into six physics-grounded interaction patterns. Building on this dataset, we propose OmniVTA, a world-model-based visuo-tactile manipulation framework that integrates four tightly coupled modules: a self-supervised tactile encoder, a two-stream visuo-tactile world model for predicting short-horizon contact evolution, a contact-aware fusion policy for action generation, and a 60Hz reflexive controller that corrects deviations between predicted and observed tactile signals in a closed loop. Real-robot experiments across all six interaction categories show that OmniVTA outperforms existing methods and generalizes well to unseen objects and geometric configurations.

Hardware Design & Mechanisms

RANK

h=45

Tendon-Actuated Robots with a Tapered, Flexible Polymer Backbone: Design, Fabrication, and Modeling

📅 2026-03-19 cs.RO 👤 J. Gravdahl (h=45)

Harald Minde Hansen, Nandita Gallacher, Nicholas B. Andrews, Kristin Y. Pettersen, Jan Tommy Gravdahl

Core Contributions

Unlike most continuum robot designs requiring specialized fabrication facilities, this system uses standard FDM 3D printing with TPU and commercial sheet materials — making it fully reproducible in any lab with a consumer-grade printer.
A tapered backbone creates graded stiffness along the length: high proximal stiffness for load bearing and increased distal compliance for safe contact — a mechanical property unachievable with uniform-cross-section designs without active control.
The Cosserat rod model is extended to explicitly handle spatially varying cross-sectional geometry, correcting a systematic modeling error in prior formulations that assumed uniform backbone cross-sections.
Model validates against motion capture with centimeter-level shape accuracy after a single Young's modulus calibration — no per-configuration tuning or learned correction needed.
Open parameterized CAD scripts enable rapid geometry generation and scaling, lowering the barrier for researchers to replicate or adapt the design for specific inspection or manipulation tasks.

Show abstract

This paper presents the design, modeling, and fabrication of 3D-printed, tendon-actuated continuum robots featuring a flexible, tapered backbone constructed from thermoplastic polyurethane (TPU). Our scalable design incorporates an integrated electronics base housing that enables direct tendon tension control and sensing via actuators and compression load cells. Unlike many continuum robots that are single-purpose and costly, the proposed design prioritizes customizability, rapid assembly, and low cost while enabling high curvature and enhanced distal compliance through geometric tapering, thereby supporting a broad range of compliant robotic inspection and manipulation tasks. We develop a generalized forward kinetostatic model of the tapered backbone based on Cosserat rod theory using a Newtonian approach, extending existing tendon-actuated Cosserat rod formulations to explicitly account for spatially varying backbone cross-sectional geometry. The model is validated against motion capture data, achieving centimeter-level shape prediction accuracy after calibrating Young's modulus via a line search that minimizes modeling error. We further demonstrate teleoperated grasping using an endoscopic gripper routed along the continuum robot, mounted on a 6-DoF robotic arm. Parameterized iLogic/CAD scripts are provided for rapid geometry generation and scaling.

#14

RANK

h=11

A Passive Elastic-Folding Mechanism for Stackable Airdrop Sensors

📅 2026-03-19 cs.RO · eess.SY 👤 T. Sasatani (h=11)

Damyon Kim, Yuichi Honjo, Tatsuya Iizuka, Naomi Okubo, Naoto Endo

Core Contributions

Existing airdrop sensor systems require active actuators for mid-air shape change, adding power consumption, weight, and failure modes; this work achieves full 3D structural deployment entirely passively through pre-programmed elastic energy released on drop.
A single oven-heating fabrication step programs fold angles (10–100° range, ±4° repeatability) into laminated PCB-sheet composites — no clean-room, no specialized tooling, no per-unit calibration needed, enabling low-cost scalable production.
A geometric model linking laminate geometry to fold angle provides predictive design methodology: researchers specify a target configuration and compute required laminate parameters analytically before fabrication.
Field trials with LoRa transmission confirm reliable data collection during real dispersal events — not just controlled lab deployment — validating the system under realistic airdrop conditions.
HWM-based trajectory simulation indicates 10+ km coverage per dispersal event, enabling wide-area environmental monitoring at a cost-per-sensor significantly below existing deployable systems.

Show abstract

Air-dispersed sensor networks deployed from aerial robotic systems (e.g., UAVs) provide a low-cost approach to wide-area environmental monitoring. However, existing methods often rely on active actuators for mid-air shape or trajectory control, increasing both power consumption and system cost. Here, we introduce a passive elastic-folding hinge mechanism that transforms sensors from a flat, stackable form into a three-dimensional structure upon release. Hinges are fabricated by laminating commercial sheet materials with rigid printed circuit boards (PCBs) and programming fold angles through a single oven-heating step, enabling scalable production without specialized equipment. Our geometric model links laminate geometry, hinge mechanics, and resulting fold angle, providing a predictive design methodology for target configurations. Laboratory tests confirmed fold angles between 10 degrees and 100 degrees, with a standard deviation of 4 degrees and high repeatability. Field trials further demonstrated reliable data collection and LoRa transmission during dispersion, while the Horizontal Wind Model (HWM)-based trajectory simulations indicated strong potential for wide-area sensing exceeding 10 km.

Agentic AI & Human-Robot Interaction

RANK

h=35

Robotic Agentic Platform for Intelligent Electric Vehicle Disassembly (RAPID)

📅 2026-03-19 cs.RO 👤 N. Correll (h=35)

Zachary Allen, Max Conway, Lyle Antieau, Allen Ponraj, Nikolaus Correll

Core Contributions

EV battery disassembly remains manual due to high design variability across manufacturers; RAPID is among the first systems to tackle full-scale battery pack disassembly robotically with open-vocabulary perception, removing the need to retrain for each new EV model.
Open-vocabulary object detection achieves 0.9757 mAP50 on screws, nuts, and busbars — enabling reliable fastener identification on novel battery designs without task-specific training data.
Three fastener removal strategies compared empirically (n=204): taught-in poses (97% success), one-shot vision (57%), visual servoing (83%) — the 40-point gap between taught poses and vision-only execution pinpoints perception-to-action coupling as the primary failure mode.
LLM agents using structured tool-based interfaces achieve 100% task completion, while automatic ROS service discovery fails 43% of the time — a sharp empirical demonstration that structured APIs are necessary (not optional) for reliable LLM robot control.
Edge-hardware results (Qwen 3.5 9B/4B) comparable to GPT-4o-mini for structured tool calls suggest LLM-driven disassembly does not require cloud connectivity, which is critical for factory deployment.

Show abstract

Electric vehicles (EV) create an urgent need for scalable battery recycling, yet disassembly of EV battery packs remains largely manual due to high design variability. We present our Robotic Agentic Platform for Intelligent Disassembly (RAPID), designed to investigate perception-driven manipulation, flexible automation, and AI-assisted robot programming in realistic recycling scenarios. The system integrates a gantry-mounted industrial manipulator, RGB-D perception, and an automated nut-running tool for fastener removal on a full-scale EV battery pack. An open-vocabulary object detection pipeline achieves 0.9757 mAP50, enabling reliable identification of screws, nuts, busbars, and other components. We experimentally evaluate (n=204) three one-shot fastener removal strategies: taught-in poses (97% success rate, 24 min duration), one-shot vision execution (57%, 29 min), and visual servoing (83%, 36 min). Tool-based interfaces achieve 100% task completion, while automatic ROS service discovery shows 43.3% failure rates, highlighting the need for structured robot APIs for reliable LLM-driven control.

#12

RANK

h=13

MERGE: Guided Vision-Language Models for Multi-Actor Event Reasoning and Grounding in Human-Robot Interaction

📅 2026-03-19 cs.RO 👤 A. Belardinelli (h=13)

Joerg Deigmoeller, Nakul Agarwal, Stephan Hasler, Daniel Tanneberg, Anna Belardinelli

Core Contributions

HRI systems face a dilemma: use expensive VLMs per frame (high latency, cost) or lightweight streaming models (miss complex reasoning); MERGE resolves this by streaming to detect scene changes and invoking VLMs selectively only when something meaningful changes.
4× runtime reduction vs. VLM-only baselines (including GPT-4o, GPT-5, Gemini 2.5 Flash) — most frames are similar to the previous one, so selective invocation eliminates redundant inference without sacrificing reasoning quality.
2× improvement in grounding score over state-of-the-art VLM baselines, showing that structured perception pipelines add substantial value even on top of very powerful foundation models.
The GROUND dataset fills a genuine gap: no benchmark existed for fine-grained situational awareness in multi-person HRI with temporally consistent actor tracking across interaction sequences.
Actor-action-object relations with temporal consistency enable reasoning about causal event chains over time — a capability that frame-level captioning fundamentally cannot support without persistent actor identity.

Show abstract

We introduce MERGE, a system for situational grounding of actors, objects, and events in dynamic human-robot group interactions. MERGE achieves this by uniquely identifying physical instances of actors and objects and structuring them into actor-action-object relations, ensuring temporal consistency across interactions. Central to MERGE is the integration of Vision-Language Models (VLMs) guided with a perception pipeline: a lightweight streaming module continuously processes visual input to detect changes and selectively invokes the VLM only when necessary. This decoupled design preserves the reasoning power and zero-shot generalization of VLMs while improving efficiency. To address the absence of suitable benchmarks for multi-actor collaboration, we introduce the GROUND dataset, which offers fine-grained situational annotations of multi-person and human-robot interactions. On this dataset, our approach improves the average grounding score by a factor of 2 compared to VLM-only baselines — including GPT-4o, GPT-5 and Gemini 2.5 Flash — while also reducing run-time by a factor of 4.

Robot Learning & Simulation

RANK

h=22

Articulated-Body Dynamics Network: Dynamics-Grounded Prior for Robot Learning

📅 2026-03-19 cs.RO 👤 Josiah P. Hanna (h=22)

Sangwoo Shin, Kunzhao Ren, Xiaobin Xiong, Josiah Hanna

Core Contributions

Most GNNs for robots exploit connectivity structure but ignore dynamics — ABD-Net embeds the Articulated Body Algorithm's inertia propagation directly into the network architecture, creating a prior that mirrors how forces actually propagate through rigid-body chains.
Replacing physical inertial quantities with learnable parameters lets the network start from a physics-informed prior and adapt to model errors, unlike standard GNNs that learn from scratch with no physical knowledge.
Demonstrates improved sample efficiency on humanoid, quadruped, and hopper robots — the dynamics prior reduces required environment interactions to learn stable locomotion, directly reducing training cost on physical hardware.
Generalizes better to dynamics shifts (changed payload, terrain changes) than transformer-based and GNN baselines, suggesting the structural prior acts as a regularizer against overfitting training conditions.
Validated on real Unitree G1 humanoid and Go2 quadruped with sim-to-real transfer at real-time inference rates — validating the approach beyond controlled benchmarks on state-of-the-art platforms.

Show abstract

Recent work in reinforcement learning has shown that incorporating structural priors for articulated robots, such as link connectivity, into policy networks improves learning efficiency. However, dynamics properties, despite their fundamental role in determining how forces and motion propagate through the body, remain largely underexplored as an inductive bias for policy learning. To address this gap, we present the Articulated-Body Dynamics Network (ABD-Net), a novel graph neural network architecture grounded in the computational structure of forward dynamics. Specifically, we adapt the inertia propagation mechanism from the Articulated Body Algorithm, systematically aggregating inertial quantities from child to parent links in a tree-structured manner, while replacing physical quantities with learnable parameters. Through experiments with simulated humanoid, quadruped, and hopper robots, our approach demonstrates increased sample efficiency and generalization to dynamics shifts compared to transformer-based and GNN baselines. We further validate the learned policy on real Unitree G1 and Go2 robots, generating dynamic, versatile and robust locomotion behaviors through sim-to-real transfer with real-time inference.

#11

RANK

h=15

V-Dreamer: Automating Robotic Simulation and Trajectory Synthesis via Video Generation Priors

📅 2026-03-19 cs.RO 👤 Jing Huo (h=15)

Songjia He, Zixuan Chen, Hongyu Ding, Dian Shao, Jieqi Shi

Core Contributions

Prior simulation-based data generation requires manual asset placement and fixed object libraries; V-Dreamer generates complete 3D manipulation environments from text descriptions with zero manual intervention, removing the fixed-vocabulary bottleneck entirely.
Video generative models serve as motion priors: rather than programming expert trajectories, the system generates plausible video of the task being performed and maps this to robot kinematics — a fundamentally different and more scalable source of trajectory supervision.
The Sim-to-Gen visual-kinematic alignment module (using CoTracker3 + VGGT) is the critical bridge between video-space motion and executable robot joint commands — without this alignment, video priors remain unexecutable fantasies.
Policies trained on V-Dreamer data transfer to real-world manipulation with a Piper robotic arm, demonstrating that synthetically generated environments provide training signal that generalizes beyond simulation.
Scales to open-vocabulary tasks — any manipulation task expressible in natural language can generate training data — removing the fixed-vocabulary limitation that restricts existing simulation frameworks.

Show abstract

Training generalist robots demands large-scale, diverse manipulation data, yet real-world collection is prohibitively expensive, and existing simulators are often constrained by fixed asset libraries and manual heuristics. To bridge this gap, we present V-Dreamer, a fully automated framework that generates open-vocabulary, simulation-ready manipulation environments and executable expert trajectories directly from natural language instructions. V-Dreamer employs a novel generative pipeline that constructs physically grounded 3D scenes using large language models and 3D generative models, validated by geometric constraints to ensure stable, collision-free layouts. Crucially, for behavior synthesis, we leverage video generation models as rich motion priors. These visual predictions are then mapped into executable robot trajectories via a robust Sim-to-Gen visual-kinematic alignment module utilizing CoTracker3 and VGGT. Extensive evaluations on tabletop manipulation tasks using the Piper robotic arm demonstrate that our policies robustly generalize to unseen objects in simulation and achieve effective sim-to-real transfer.

#16

RANK

h=10

Fire as a Service: Augmenting Robot Simulators with Thermally and Visually Accurate Fire Dynamics

📅 2026-03-19 cs.RO · cs.GR 👤 Sören Pirk (h=10)

Anton R. Wagner, Madhan Balaji Rao, Helge Wrede, Sören Pirk, Xuesu Xiao

Core Contributions

Robot simulators model rigid-body dynamics precisely but treat fire as a texture or ignore it entirely; FaaS is the first framework to integrate high-fidelity thermodynamic fire simulation into the robot control loop, enabling firefighting robot training and evaluation in simulation.
Asynchronous co-simulation decouples slow fire physics (compute-heavy) from the fast robot control loop (100+ Hz), achieving realistic fire dynamics without disrupting real-time control — a key systems engineering contribution.
Multi-species thermodynamic heat transfer and volumetric smoke allow robots to experience realistic sensor degradation and proximity heat effects, critical for learning policies that behave safely around actual fire hazards.
Real-time performance supports human-in-the-loop teleoperation, enabling data collection that successfully produced reactive multimodal firefighting policies via Behavioral Cloning.
Simulator-agnostic design integrates with multiple existing robot simulators, letting the community adopt fire environments without switching simulation infrastructure.

Show abstract

Most existing robot simulators prioritize rigid-body dynamics and photorealistic rendering, but largely neglect the thermally and optically complex phenomena that characterize real-world fire environments. For robots envisioned as future firefighters, this limitation hinders both reliable capability evaluation and the generation of representative training data prior to deployment in hazardous scenarios. To address these challenges, we introduce Fire as a Service (FaaS), a novel, asynchronous co-simulation framework that augments existing robot simulators with high-fidelity and computationally efficient fire simulations. Our pipeline enables robots to experience accurate, multi-species thermodynamic heat transfer and visually consistent volumetric smoke without disrupting high-frequency rigid-body control loops. Crucially, its real-time performance supports human-in-the-loop teleoperation, enabling the successful training of reactive, multimodal policies via Behavioral Cloning. By adding fire dynamics to robot simulations, FaaS provides a scalable pathway toward safer, more reliable deployment of robots in fire scenarios.

Multi-Robot Coordination

RANK

h=19

ADMM-Based Distributed MPC with Control Barrier Functions for Safe Multi-Robot Quadrupedal Locomotion

📅 2026-03-19 cs.RO · math.OC 👤 K. Hamed (h=19)

Yicheng Zeng, Ruturaj S. Sambhus, Basit Muhammad Imran, Jeeseop Kim, Vittorio Pastore

Core Contributions

CBF constraints introduce inter-agent coupling that prevents direct decomposition of multi-robot MPC; a node-edge splitting formulation with consensus constraints separates this coupling into independent local quadratic programs solvable in parallel with neighbor-only communication.
The distributed solution provably converges to the centralized safety-critical MPC solution, preserving formal safety guarantees while enabling decentralization — distribution doesn't sacrifice safety for speed.
Reduces per-cycle planning time by up to 51% vs. centralized MPC for four-agent systems — meaningful speedup at scale, enabling real-time decentralized planning on physical hardware.
Validated on real Unitree Go2 robots navigating rough terrain with external disturbances — an important hardware robustness test that many multi-robot papers skip in favor of simulation-only evaluation.
Modular hierarchical architecture (distributed planning → nonlinear MPC → whole-body control) allows the distributed layer to plug into existing locomotion stacks without replacing lower-level controllers.

Show abstract

This paper proposes a fully decentralized model predictive control (MPC) framework with control barrier function (CBF) constraints for safety-critical trajectory planning in multi-robot legged systems. The incorporation of CBF constraints introduces explicit inter-agent coupling, which prevents direct decomposition of the resulting optimal control problems. To address this challenge, we reformulate the centralized safety-critical MPC problem using a structured distributed optimization framework based on the alternating direction method of multipliers (ADMM). By introducing a novel node-edge splitting formulation with consensus constraints, the proposed approach decomposes the global problem into independent node-local and edge-local quadratic programs that can be solved in parallel using only neighbor-to-neighbor communication. This enables fully decentralized trajectory optimization with symmetric computational load across agents while preserving safety and dynamic feasibility. The effectiveness of the proposed approach is demonstrated through hardware experiments on two Unitree Go2 quadrupedal robots and numerical simulations involving up to four robots, showing that the proposed distributed formulation achieves performance comparable to centralized MPC while reducing the average per-cycle planning time by up to 51% in the four-agent case.

#13

RANK

h=12

Graph-of-Constraints Model Predictive Control for Reactive Multi-agent Task and Motion Planning

📅 2026-03-19 cs.RO 👤 A. H. Qureshi (h=12)

Anastasios Manganaris, Jeremy Lu, Ahmed H. Qureshi, Suresh Jagannathan

Core Contributions

Existing sequence-of-constraints TAMP approaches require static agent assignments and cannot recover when disturbances force task reassignment; GoC-MPC generalizes to partially ordered tasks with dynamic agent assignments natively without replanning from scratch.
Constraints defined over tracked 3D keypoints (not CAD models) make the approach model-free — no prior environment knowledge, no training data required, just visual observations from cameras in the workspace.
The graph structure encodes task dependencies without reducing to a fixed linear sequence: tasks can be ordered, parallel, or conditionally dependent, naturally handling the partial ordering common in real manipulation workflows.
Achieves higher success rates, faster TAMP computation, and shorter paths vs. recent baselines simultaneously — improvements across all metrics without metric-specific tradeoffs.
Adapts online to disturbances without full replanning, critical for real-world deployment where task allocations can change mid-execution due to unforeseen events.

Show abstract

Sequences of interdependent geometric constraints are central to many multi-agent Task and Motion Planning (TAMP) problems. However, existing methods for handling such constraint sequences struggle with partially ordered tasks and dynamic agent assignments. They typically assume static assignments and cannot adapt when disturbances alter task allocations. To overcome these limitations, we introduce Graph-of-Constraints Model Predictive Control (GoC-MPC), a generalized sequence-of-constraints framework integrated with MPC. GoC-MPC naturally supports partially ordered tasks, dynamic agent coordination, and disturbance recovery. By defining constraints over tracked 3D keypoints, our method robustly solves diverse multi-agent manipulation tasks from visual observations alone, without relying on training data or environment models. Experiments demonstrate that GoC-MPC achieves higher success rates, significantly faster TAMP computation, and shorter overall paths compared to recent baselines.

#18

RANK

h=9

CAMO: A Conditional Neural Solver for the Multi-objective Multiple Traveling Salesman Problem

📅 2026-03-19 cs.RO · cs.AI 👤 G. Sartoretti (h=9)

Fengxiaoxiao Li, Xiao Mao, Mingfeng Fan, Yifeng Zhang, Yi Li

Core Contributions

Multi-objective routing and multi-agent routing have been tackled separately by the learning community; CAMO is the first neural solver to address both simultaneously, handling dual complexity sources in a single framework.
A conditional encoder fuses user-specified preference vectors into instance representations at test time, enabling explicit control over the objective tradeoff (travel cost vs. makespan) without retraining for each preference setting.
The collaborative decoder alternates between selecting which agent acts next and which node that agent visits, learning coordination strategies without centralized communication overhead or explicit coordination protocols.
Training on mixed problem sizes via REINFORCE enables generalization to varying numbers of agents and targets at test time — prior methods required separate training for each configuration.
Real mobile robot experiments validate practical applicability beyond combinatorial benchmarks, demonstrating that learned multi-agent routing transfers to physical deployment conditions.

Show abstract

Robotic systems often require a team of robots to collectively visit multiple targets while optimizing competing objectives, such as total travel cost and makespan. This setting can be formulated as the Multi-Objective Multiple Traveling Salesman Problem (MOMTSP). Although learning-based methods have shown strong performance on the single-agent TSP and multi-objective TSP variants, they rarely address the combined challenges of multi-agent coordination and multi-objective trade-offs. To bridge this gap, we propose CAMO, a conditional neural solver for MOMTSP that generalizes across varying numbers of targets, agents, and preference vectors, and yields high-quality approximations to the Pareto front. Extensive experiments show that CAMO outperforms both neural and conventional heuristics, and real-world tests on a mobile robot platform demonstrate its practical applicability.

Navigation & Spatial Grounding

RANK

h=17

REST: Receding Horizon Explorative Steiner Tree for Zero-Shot Object-Goal Navigation

📅 2026-03-19 cs.RO · cs.AI · cs.CV 👤 Maani Ghaffari (h=17)

Shuqi Xiao, Maani Ghaffari, Chengzhong Xu, Hui Kong

Core Contributions

Prior LLM-based navigation planners propose isolated waypoints as options, discarding information gathered en route; REST's key insight is that options should be full paths, where en-route information gain through unexplored regions is explicitly scored alongside destination utility.
Organizing candidate paths as a Steiner tree with shared segments allows the LLM to reason coarse-to-fine — dismissing entire branches before examining individual leaves, compressing a combinatorial path space into an efficient reasoning hierarchy.
Builds an explicit open-vocabulary 3D map online from RGB-D streams without requiring prior maps, enabling zero-shot deployment in previously unseen environments with no map preprocessing.
Ranks among the top methods in success rate on Gibson, HM3D, and HSSD benchmarks while achieving best or second-best path efficiency — a favorable efficiency-success balance that prior methods achieve only by sacrificing one for the other.
Entirely training-free across all three benchmarks — no task-specific navigation models, no fine-tuning — confirming the approach generalizes zero-shot rather than overfitting to any particular environment distribution.

Show abstract

Zero-shot object-goal navigation (ZSON) requires navigating unknown environments to find a target object without task-specific training. Prior hierarchical training-free solutions invest in scene understanding and high-level decision-making, yet overlook the design of the option space. In practice, options are reduced to isolated waypoints scored independently: single destinations hide the value gathered along the journey. Our insight is that the option space should be a tree of paths. Full paths expose en-route information gain that destination-only scoring systematically neglects; a tree of shared segments enables coarse-to-fine LLM reasoning that dismisses or pursues entire branches before examining individual leaves. We instantiate this insight in REST, a training-free framework that builds an explicit open-vocabulary 3D map from online RGB-D streams, grows an agent-centric tree of safe and informative paths, and textualizes each branch into a spatial narrative for chain-of-thought LLM reasoning. Across the Gibson, HM3D, and HSSD benchmarks, REST consistently ranks among the top methods in success rate while achieving the best or second-best path efficiency.

#10

RANK

h=16

Meanings and Measurements: Multi-Agent Probabilistic Grounding for Vision-Language Navigation

📅 2026-03-19 cs.RO · cs.AI · cs.CL · cs.CV · cs.LG 👤 N. Gopalan (h=16)

Swagat Padhan, Lakshya Jain, Bhavya Minesh Shah, Omkar Patil, Thao Nguyen

Core Contributions

State-of-the-art VLMs fail on metric-semantic queries like "two meters to the right of the fridge" because they lack mechanisms to reason about physical distances in 3D space; MAPG addresses this by decomposing such queries into semantic and metric subcomponents grounded separately.
Probabilistic composition of grounded subcomponents produces a distribution over 3D positions (not a point estimate), enabling metrically consistent decisions that propagate uncertainty appropriately through the pipeline.
Introduces MAPG-Bench, a benchmark specifically designed for metric-semantic goal grounding — addressing a gap in existing evaluations that tested semantic grounding but not physical distance reasoning in 3D scenes.
Shows consistent performance improvements over strong baselines on HM-EQA with successful real-world robot demonstrations, confirming the approach transfers beyond simulation when structured scene representations are available.
Multi-agent decomposition (multiple specialized VLM calls per query) is more expensive but substantially more reliable — an important tradeoff for deployment scenarios where grounding accuracy is safety-critical.

Show abstract

Robots collaborating with humans must convert natural language goals into actionable, physically grounded decisions. For example, executing a command such as "go two meters to the right of the fridge" requires grounding semantic references, spatial relations, and metric constraints within a 3D scene. While recent vision language models (VLMs) demonstrate strong semantic grounding capabilities, they are not explicitly designed to reason about metric constraints in physically defined spaces. In this work, we empirically demonstrate that state-of-the-art VLM-based grounding approaches struggle with complex metric-semantic language queries. To address this limitation, we propose MAPG (Multi-Agent Probabilistic Grounding), an agentic framework that decomposes language queries into structured subcomponents and queries a VLM to ground each component. MAPG then probabilistically composes these grounded outputs to produce metrically consistent, actionable decisions in 3D space. We evaluate MAPG on the HM-EQA benchmark and show consistent performance improvements over strong baselines. We also present a real-world robot demonstration showing that MAPG transfers beyond simulation when a structured scene representation is available.

#15

RANK

h=10

DROID-SLAM in the Wild

📅 2026-03-19 cs.CV · cs.RO 👤 Marc Pollefeys (h=10)

Moyang Li, Zihan Zhu, Marc Pollefeys, Daniel Barath

Core Contributions

Traditional SLAM assumes static scenes and fails in dynamic environments; this work removes that assumption by estimating per-pixel uncertainty from multi-view visual feature inconsistency — moving objects create inconsistent cross-view features and are automatically downweighted.
Unlike prior dynamic SLAM approaches requiring predefined dynamic priors (e.g., "assume all people are dynamic"), this method handles unknown dynamic objects of any category without category-specific assumptions.
Achieves state-of-the-art camera poses and scene geometry in dynamic cluttered scenarios while running at ~10 FPS in real time — prior uncertainty-aware methods sacrificed real-time performance for accuracy.
Differentiable bundle adjustment allows uncertainty estimates to co-improve with the map: better uncertainty → better poses → better map → better uncertainty, creating a self-reinforcing accuracy loop.
Open-source code and datasets released, addressing a known gap in dynamic SLAM evaluation benchmarks that has hampered reproducible comparison.

Show abstract

We present a robust, real-time RGB SLAM system that handles dynamic environments by leveraging differentiable Uncertainty-aware Bundle Adjustment. Traditional SLAM methods typically assume static scenes, leading to tracking failures in the presence of motion. Recent dynamic SLAM approaches attempt to address this challenge using predefined dynamic priors or uncertainty-aware mapping, but they remain limited when confronted with unknown dynamic objects or highly cluttered scenes. In contrast, our method estimates per-pixel uncertainty by exploiting multi-view visual feature inconsistency, enabling robust tracking and reconstruction even in real-world environments. The proposed system achieves state-of-the-art camera poses and scene geometry in cluttered dynamic scenarios while running in real time at around 10 FPS. Code and datasets are available at https://github.com/MoyangLi00/DROID-W.git.

VLA & Foundation Models

RANK

h=19

From Inference Efficiency to Embodied Efficiency: Revisiting Efficiency Metrics for Vision-Language-Action Models

📅 2026-03-19 cs.LG · cs.RO 👤 Chaojian Li (h=19)

Zhuofan Li, Hongkun Yang, Zhenyang Chen, Yangxuan Chen, Yingyan

Core Contributions

The central finding is counterintuitive: compression methods scoring well on standard VLA efficiency metrics (FLOPs, token throughput) often increase real end-to-end execution cost or degrade motion quality — the metrics are measuring the wrong thing.
Introduces embodied efficiency metrics — task completion time, trajectory smoothness, cumulative joint rotation, motion energy — that correlate with what actually matters for physical robot deployment rather than benchmark leaderboard scores.
Model compression and token sparsification maintain task success rates but often produce jerkier, less energy-efficient motions that would accelerate hardware wear and degrade user experience in deployment.
In-context prompting and supervised fine-tuning show only mild, metric-specific improvements in embodied efficiency — adaptation methods alone cannot close the gap between benchmark performance and real-world deployment quality.
Provides a unified evaluation protocol enabling fairer comparison of VLA models for deployment scenarios rather than benchmark-only leaderboards that systematically miss embodied execution realities.

Show abstract

Vision-Language-Action (VLA) models have recently enabled embodied agents to perform increasingly complex tasks by jointly reasoning over visual, linguistic, and motor modalities. However, we find that the prevailing notion of "efficiency" in current VLA research, characterized by parameters, FLOPs, or token decoding throughput, does not reflect actual performance on robotic platforms. In real-world execution, efficiency is determined by system-level embodied behaviors such as task completion time, trajectory smoothness, cumulative joint rotation, and motion energy. Through controlled studies across model compression, token sparsification, and action sequence compression, we make several observations that challenge common assumptions: methods that reduce computation under conventional metrics often increase end-to-end execution cost or degrade motion quality; system-level embodied efficiency metrics reveal performance differences hidden under conventional evaluations; and common adaptation methods show only mild improvements. Our results suggest that conventional inference efficiency metrics overlook important aspects of embodied execution, and that incorporating embodied efficiency provides a more complete and fairer view of VLA model behavior.

#17

RANK

h=9

FASTER: Rethinking Real-Time Flow VLAs

📅 2026-03-19 cs.RO · cs.CV 👤 Hengshuang Zhao (h=9)

Yuxiang Lu, Zhe Liu, Xianzhe Fan, Zhenya Yang, Jinghua Hou

Core Contributions

Identifies that the standard constant schedule in flow-based VLAs forces all denoising steps to complete before any action can begin — creating systematic reaction latency that scales with action chunk length and is not addressed by asynchronous inference approaches.
Proves formally that reaction time follows a uniform distribution determined jointly by Time to First Action (TTFA) and execution horizon — framing the optimization problem as minimizing TTFA rather than total compute budget.
The Horizon-Aware Schedule adaptively prioritizes near-term actions during sampling, compressing immediate-reaction denoising by 10× (to a single step in π₀.₅ and X-VLA) while preserving trajectory quality for long-horizon planning.
A streaming client-server pipeline overlaps inference of future chunks with execution of current ones, eliminating compute-execution serialization that dominates latency on consumer-grade GPUs.
Validated on dynamic table tennis requiring real-time ball tracking — a task where existing flow-based VLAs were too slow to react — demonstrating that FASTER unlocks reaction speeds previously impossible for generalist policies.

Show abstract

Real-time execution is crucial for deploying Vision-Language-Action (VLA) models in the physical world. Existing asynchronous inference methods primarily optimize trajectory smoothness, but neglect the critical latency in reacting to environmental changes. By rethinking the notion of reaction in action chunking policies, this paper presents a systematic analysis of the factors governing reaction time. We show that reaction time follows a uniform distribution determined jointly by the Time to First Action (TTFA) and the execution horizon. Moreover, we reveal that the standard practice of applying a constant schedule in flow-based VLAs forces the system to complete all sampling steps before any movement can start. To overcome this issue, we propose Fast Action Sampling for ImmediaTE Reaction (FASTER). By introducing a Horizon-Aware Schedule, FASTER adaptively prioritizes near-term actions during flow sampling, compressing the denoising of the immediate reaction by tenfold into a single step. Coupled with a streaming client-server pipeline, FASTER substantially reduces the effective reaction latency on real robots. Real-world experiments, including a highly dynamic table tennis task, prove that FASTER unlocks unprecedented real-time responsiveness for generalist policies.

#20

RANK

h=7

Not All Features Are Created Equal: A Mechanistic Study of Vision-Language-Action Models

📅 2026-03-19 cs.RO 👤 Xijia Zhao (h=7)

Bryce Grant, Xijia Zhao, Peng Wang

Core Contributions

Through activation injection across 394,000+ rollout episodes on six VLA models (80M–7B parameters), the paper shows visual pathways dominate action generation: injecting null-prompt baseline activations recovers near-identical behavior, revealing language prompts are frequently ignored.
Cross-task activation injection steers robots toward the source task's spatial positions with 99.8% trajectory alignment in X-VLA — revealing VLA motor programs are spatially bound to scene coordinates, not abstract multi-modal task representations.
Language sensitivity depends on task structure, not model design: when visual context uniquely specifies the task, language is ignored; when multiple goals share a scene, language becomes essential (94%→10% success under wrong prompts in libero_goal vs. 60–100% regardless in libero_object).
In multi-pathway architectures (π₀.₅, SmolVLA, GR00T), expert pathways encode motor programs while VLM pathways encode goal semantics — subspace injection confirms these occupy separable activation subspaces with 2× greater behavioral displacement from expert-pathway injection.
Releases Action Atlas (action-atlas.com) for interactive exploration of VLA representations across all six models — enabling the community to build on these mechanistic insights without re-running the expensive 394,000-episode analysis.

Show abstract

Vision-Language-Action (VLA) models combine perception, language, and motor control in a single architecture, yet how they translate multimodal inputs into actions remains poorly understood. We apply activation injection, sparse autoencoders (SAEs), and linear probes to six models spanning 80M--7B parameters across 394,000+ rollout episodes on four benchmarks. The visual pathway dominates action generation across all architectures: injecting baseline activations into null-prompt episodes recovers near-identical behavior, while cross-task injection steers robots toward source-task positions (99.8% of X-VLA episodes align with the source trajectory), exposing spatially bound motor programs tied to scene coordinates rather than abstract task representations. Language sensitivity depends on task structure, not model design: when visual context uniquely specifies the task, language is ignored; when multiple goals share a scene, language becomes essential. In all three multi-pathway architectures, expert pathways encode motor programs while VLM pathways encode goal semantics. We release Action Atlas for interactive exploration of VLA representations across all six models.