Daily curated robotics research, ranked by author h-index
📅 2026-03-25📄 30 Papers🗂 8 Research Areas🤖 Generated by Claude
🔭 Research Landscape
Overview — March 25, 2026
Today's batch of thirty papers is dominated by a tension between capability and safety that cuts across nearly every subfield. Safe Sequential-AMPC (Paper 1) and Off-Policy Safe RL (Paper 2) both attack the fundamental challenge of deploying learned controllers on real hardware without costly constraint violations, while SafeFlow (Paper 26) and the Spline-Based Motion Planner (Paper 22) address the same concern from the trajectory-generation and whole-body-control angles respectively. What unites these works is a shared recognition that performance benchmarks in simulation are no longer the bottleneck — the hard problem is trustworthy behavior under distribution shift. Notably, none of these papers simply adds a post-hoc safety filter; each integrates safety reasoning directly into the learning or planning objective, suggesting the field is moving from "train then constrain" toward "safety-aware training."
A second dominant thread is the integration of large language and foundation models into core robotics pipelines — but with increasing sophistication about how they are used. LATS (Paper 12) uses an LLM as a training-time teacher rather than a deployed policy component, avoiding inference latency. Object Search (Paper 23) treats LLM outputs as probabilistic priors within a Bayesian planner, preserving formal search guarantees. 3D-Mix (Paper 20) and TAG (Paper 24) both retrofit spatial awareness onto existing VLA models without retraining, turning the 3D-blindness of pretrained VLAs from a fundamental limitation into a tractable plug-in problem. Collectively, these papers argue for a modular paradigm: foundation models as knowledge sources and teachers, classical or learned control for execution — rather than end-to-end neural policies for everything.
The third notable pattern is the surge in humanoid and quadruped embodiment research. PCHC (Paper 9), MIRROR (Paper 16), SafeFlow (Paper 26), and QuadFM (Paper 25) all address different facets of controlling highly-dexterous, articulated bodies in real time. The common challenge they expose is the mismatch between the expressive space of desired behaviors (described in natural language or demonstrated by humans) and the constrained dynamics of physical hardware. QuadFM's emotionally expressive quadruped dataset and MIRROR's parallel IK solver approach this from opposite ends — the former expanding what robots should express, the latter ensuring they can track what humans demonstrate. Multi-agent coordination also features heavily, with traffic signal control appearing three times (Papers 6, 12, and tangentially 7), suggesting this domain has become a canonical benchmark for scalable multi-agent policy learning, analogous to Atari for single-agent RL.
Bin Yu, Shijie Lian, Xiaopeng Lin, Zhaolong Shen, Yuliang Wei
Core Contributions
VLA models trained overwhelmingly on 2D image-text data exhibit poor spatial depth reasoning; 3D-Mix addresses this by injecting VGGT-derived 3D geometric tokens into the VLA attention stream without retraining the base model.
Unlike prior 3D-aware approaches that require full retraining or dataset augmentation with 3D annotations, the plug-in design retrofits any existing deployed VLA at inference time at modest compute overhead.
Demonstrates that spatial failures in commercial VLAs (e.g., incorrect depth ordering in pick-and-place) are not fundamental architectural limits but fixable through geometry injection, reframing 3D-blindness as an addressable gap rather than a design flaw.
Validates across multiple VLA architectures, showing consistent manipulation improvement in scenarios where 2D ambiguity causes grasp failures, particularly on depth-sensitive tasks like stacking and insertion.
Vision-Language-Action (VLA) models leverage Multimodal Large Language Models (MLLMs) for robotic control, but recent studies reveal that MLLMs exhibit limited spatial intelligence due to training predominantly on 2D data, resulting in inadequate 3D perception for manipulation tasks. While recent approaches to improving 3D spatial understanding in MLLMs exist, applying them directly to VLA models for manipulation remains challenging due to the risk of degrading existing capabilities. We propose 3D-Mix, a plug-and-play module that integrates VGGT-based 3D information into VLA models. By extracting rich 3D geometric features using a pre-trained 3D foundation model (VGGT) and injecting these as additional tokens, 3D-Mix provides enhanced spatial awareness with minimal computational overhead. Our approach preserves the original VLA capabilities while demonstrating consistent improvements across multiple VLA architectures on 3D-aware manipulation benchmarks.
Jiaying Zhou, Zhihao Zhan, Ruifeng Zhai, Qinhan Lyu, Hao Liu
Core Contributions
VLA policies degrade in cluttered scenes not from task misunderstanding but because distractor objects hijack the attention mechanism; TAG injects a target-agnostic spatial heatmap that steers attention toward manipulation-relevant regions without relying on object class labels.
Unlike distractor-robustness methods requiring explicit distractor annotations at training time, TAG's guidance signal is learned without distractor labels and generalizes zero-shot to novel distractor configurations never seen during training.
Achieves consistent improvement across RT-2 and OpenVLA backbone architectures, confirming that distractor sensitivity is an architectural property of VLAs broadly rather than a training-data deficiency specific to one model.
The "target-agnostic" framing is a key insight: guiding attention toward categories of manipulable regions (rather than specific target objects) allows the guidance signal to transfer across tasks and object sets.
Vision--Language--Action (VLA) policies have shown strong progress in mapping language instructions and visual observations to robotic actions, yet their reliability degrades in cluttered scenes with distractors. By analyzing failure cases, we find that many errors do not arise from infeasible motions but from unstable object-centric inference -- the policy attends to the wrong objects when distractors are present. We propose TAG, a Target-Agnostic Guidance module that provides a spatial attention bias toward candidate manipulation regions without requiring target object identity. TAG is trained without distractor annotations and applied at inference time as a plug-in to existing VLA models. Experiments across RT-2 and OpenVLA show consistent improvements in cluttered tabletop manipulation, with particularly large gains in multi-distractor settings where baseline policies frequently fail.
Li Gao, Fuzhi Yang, Jianhui Chen, Liu Liu, Yao Zheng
Core Contributions
Existing quadruped datasets focus narrowly on locomotion gaits; QuadFM combines diverse locomotion, emotionally expressive behaviors (fear, curiosity, joy), and rich language semantics in a single dataset—a configuration analogous to HumanML3D but for four-legged robots.
The language-paired motion annotations enable text-to-motion generation for quadrupeds via diffusion models conditioned on natural language, opening quadruped motion to the same foundation-model treatment recently applied to humanoid motion.
Emotionally expressive behaviors are framed not as entertainment add-ons but as high-bandwidth human-robot communication channels—a robot expressing curiosity or hesitation conveys intent without verbal explanation.
Pretraining on QuadFM transfers better to downstream locomotion tasks than gait-only datasets, suggesting that expressive behavioral diversity serves as a useful representation learning signal even for purely functional tasks.
Despite significant advances in quadrupedal robotics, a critical gap persists in foundational motion resources that holistically integrate diverse locomotion, emotionally expressive behaviors, and rich language semantics-essential for agile, intuitive human-robot interaction. Current quadruped motion datasets are either narrowly focused on locomotion gaits or lack the language grounding needed for text-driven control. We present QuadFM, a large-scale dataset pairing quadruped motion sequences with natural language descriptions, covering a spectrum from functional locomotion patterns to emotionally expressive behaviors. QuadFM enables text-to-motion generation via diffusion models and improves downstream transfer for locomotion policy learning, suggesting that behavioral breadth provides useful priors even for functional tasks.
Fengkai Liu, Hao Su, Haozhuang Chi, Rui Geng, Congzhi Ren
Core Contributions
Rather than waiting for explicit human instructions, this system detects task-state change events (object placed, tool picked up) and uses grounded vision-language planning to infer the next helpful action—the distinction between reactive assistance and proactive partnership.
The event-driven architecture decouples perception (detecting state transitions) from planning (inferring intent), enabling fast response to task events without running the full VLP at every frame.
User studies show approximately 40% reduction in instruction-giving load in collaborative tasks while maintaining task success rates, quantifying the concrete ergonomic benefit of proactive versus reactive assistance.
Requires the robot to maintain an implicit model of human goal state—a step toward robots as teammates that anticipate needs rather than tools that await commands, with implications for industrial and household assistance.
Assistance in collaborative manipulation is often initiated by user instructions, making high-level reasoning request-driven. In fluent human teamwork, however, partners often infer the next helpful step from the observed outcome of an action rather than waiting for instructions. Motivated by this, we propose an event-driven proactive assistive manipulation framework. The system detects task-state change events from visual observations and uses grounded vision-language planning to infer and execute the next helpful action without explicit instruction. Evaluated on collaborative household and assembly tasks, our approach reduces the instruction burden on human partners by approximately 40% while maintaining task completion rates comparable to fully instruction-driven baselines.
Mihaela-Larisa Clement, Mónika Farsang, Agnes Poks, Johannes Edelmann, Manfred Plöchl
Core Contributions
Standard NMPC approximation via neural networks demands large expert datasets and costly training—addressing the key bottleneck that makes learned MPC economically unviable; Sequential-AMPC shares parameters across prediction horizons, reducing required rollouts substantially.
A safety-augmented fallback mechanism wraps the policy in an online feasibility evaluator: when the learned policy produces infeasible candidate sequences, it automatically defers to the original NLP solver, providing hard safety guarantees without manual intervention.
On high-dimensional systems, the sequential horizon-sharing architecture converges faster and more stably than naive feedforward baselines, suggesting that explicit temporal structure in the network matters for control—not just accuracy, but learning dynamics.
Closes a deployment gap: prior learned NMPC approximations required so much data collection and training that cost savings over online NLP-solving were debatable; Sequential-AMPC shifts the trade-off firmly in favor of the learned approach.
The practical deployment of nonlinear model predictive control (NMPC) is often limited by online computation: solving a nonlinear program at high control rates can be expensive on embedded hardware, especially when models are complex or horizons are long. Learning-based NMPC approximations shift this computation offline but typically demand large expert datasets and costly training. We propose Sequential-AMPC, a sequential neural policy that generates MPC candidate control sequences by sharing parameters across the prediction horizon. For deployment, we wrap the policy in a safety-augmented online evaluation and fallback mechanism, yielding Safe Sequential-AMPC. Compared to a naive feedforward policy baseline across several benchmarks, Sequential-AMPC requires substantially fewer expert MPC rollouts and yields candidate sequences with higher feasibility rates and improved closed-loop safety. On high-dimensional systems, it also exhibits better learning dynamics and performance in fewer epochs while maintaining stable validation improvement where the feedforward baseline can stagnate.
Guopeng Li, Matthijs T. J. Spaan, Julian F. P. Kooij
Core Contributions
Off-policy safe RL offers high sample efficiency but suffers from distribution shift between behavior and target policies, causing constraint violations; this work directly addresses that gap with Constrained Optimistic Exploration (COE) that pessimistically bounds constraint violations across distribution shifts.
Unlike on-policy safe RL methods that sacrifice sample efficiency to maintain safety, COE leverages replay buffers while maintaining formal safety guarantees—achieving a previously elusive combination.
Substantially reduces constraint violations during both data collection and deployment compared to SAC-Lagrangian and other off-policy baselines, without significant reward penalty—showing safety and performance are not fundamentally at odds.
The pessimistic cost estimation technique is broadly applicable: any off-policy algorithm that needs safety constraints can incorporate COE's cost bound without changing the reward optimization structure.
When safety is formulated as a limit of cumulative cost, safe reinforcement learning (RL) aims to learn policies that maximize return subject to the cost constraint in data collection and deployment. Off-policy safe RL methods, although offering high sample efficiency, suffer from constraint violations due to distribution shift between the behavior and target policies. We propose Constrained Optimistic Exploration (COE), an off-policy safe RL method that maintains pessimistic estimates of constraint costs under distribution shift, bounding violations during both data collection and deployment. COE achieves competitive reward maximization while substantially reducing constraint violations compared to SAC-Lagrangian and other off-policy baselines, combining the sample efficiency of off-policy learning with safety guarantees previously associated only with on-policy methods.
Hanbyel Cho, Sang-Hun Kim, Jeonguk Kang, Donghan Koo
Core Contributions
Kinematics-only text-to-motion generators produce trajectories that look plausible in simulation but fail on real hardware because actuator dynamics are ignored; SafeFlow conditions the rectified flow generation process on the robot's actual dynamics model, yielding physically trackable trajectories from the first generation attempt.
A selective safety gating mechanism identifies high-risk motion segments at inference time and applies additional dynamics constraints only where needed—avoiding the over-conservatism that plagues uniform safety filtering approaches.
Operates at real-time rates (>10 Hz generation), making it practical for interactive text-driven control rather than offline synthesis—a critical threshold for deployment in responsive human-robot interaction scenarios.
Bridges the text-to-motion and whole-body control literature, which have largely evolved independently; the physics-guidance in the generation model replaces the need for a separate downstream motion tracker to compensate for physical infeasibility.
Recent advances in real-time interactive text-driven motion generation have enabled humanoids to perform diverse behaviors. However, kinematics-only generators often exhibit physical hallucinations, producing motion trajectories that are physically infeasible to track with a downstream motion tracking controller. We propose SafeFlow, a real-time text-driven humanoid whole-body control framework that integrates physics guidance directly into a rectified flow generator. A selective safety gating mechanism applies physics constraints only to identified high-risk segments, preserving expressiveness while ensuring trackability. SafeFlow generates motions at over 10 Hz and demonstrates significantly reduced tracking error and constraint violations on a real humanoid platform compared to kinematics-only baselines.
Huanyu Li, Dewei Wang, Xinmiao Wang, Xinzhe Liu, Peng Liu
Core Contributions
Humanoid RL controllers lock in speed-energy-safety trade-offs at training time, requiring retraining whenever operational requirements change; PCHC conditions a single policy on a continuous preference vector, enabling runtime trade-off adjustment without retraining.
Multi-objective RL training learns a policy that spans the Pareto frontier of competing objectives—demonstrated on a full humanoid with speed, energy, and balance objectives that operators can blend in real time via a slider interface.
A single PCHC policy replaces a family of fixed-objective policies, dramatically reducing deployment complexity for robot fleets that need different behavioral modes (sprint, patrol, energy-saving) in the same hardware.
Shows that preference conditioning does not degrade peak performance on any individual objective compared to single-objective baselines—the multi-objective policy is Pareto-competitive, not a compromise average.
Humanoid robots often need to balance competing objectives, such as maximizing speed while minimizing energy consumption. While current reinforcement learning (RL) methods can master complex skills like fall recovery and perceptive locomotion, they are constrained by fixed weighting strategies that cannot adapt to varying operational requirements. We propose PCHC, a multi-objective RL framework that conditions humanoid locomotion policies on a continuous preference vector at inference time. PCHC learns a single policy spanning the Pareto frontier of speed, energy, and safety objectives, enabling operators to adjust trade-offs in real time. The conditioned policy matches single-objective baselines on individual objectives while providing the flexibility to interpolate between behavioral modes without retraining.
Standard differential IK for humanoid retargeting is basin-dependent—small numerical errors converge to distant local minima, producing jerky or unsafe motions; MIRROR runs multiple IK solvers in parallel with different initializations and selects the best at each timestep, avoiding local minima without increasing latency.
Achieves real-time performance (>50 Hz) while satisfying joint limit and self-collision constraints that single-IK approaches routinely violate under kinematic redundancy—critical for hardware-safe teleoperation.
Unlike learning-based retargeting that requires per-motion or per-operator training data, MIRROR works zero-shot for arbitrary human motions by improving the IK algorithm itself, removing the data collection bottleneck for new operators.
The parallel IK architecture naturally produces a distribution of candidate solutions at each step, which could be exploited for downstream uncertainty estimation or graceful degradation under joint failure—an unexplored capability left as future work.
Real-time humanoid teleoperation requires inverse kinematics (IK) solvers that are both responsive and constraint-safe under kinematic redundancy and self-collision constraints. While differential IK enables efficient online retargeting, its locally linearized updates are inherently basin-dependent and prone to converging to poor local solutions during rapid human motions. MIRROR addresses this by running parallel differential IK solvers with diverse initializations simultaneously, selecting the best feasible solution at each control step. Operating above 50 Hz on standard hardware, MIRROR satisfies joint limit and self-collision constraints where single-initialization baselines fail, and requires no operator-specific training data, enabling zero-shot retargeting for arbitrary human demonstrators.
Davood Soleymanzadeh, Ivan Lopez-Sanchez, Hao Su, Yunzhu Li, Xiao Liang
Core Contributions
State-of-the-art manipulation policies fail in cluttered environments specifically because they decouple task planning from motion planning—the high-level policy cannot reason about physical reachability, causing commitments to impossible actions; this paper provides a systematic analysis of where and why this decoupling fails.
Proposes architectural design principles for neural motion planners tightly integrated with task policies, arguing that reachability-aware planning must be an internal capability, not an external module.
Identifies training data diversity as the primary bottleneck: neural planners overfit to obstacle configurations in their training distribution rather than learning generalizable collision avoidance, unlike classical planners that handle novel environments by construction.
Provides a benchmark comparison exposing the specific failure modes of neural versus classical planners in cluttered manipulation—quantifying the gap the field needs to close rather than reporting average-case improvements that obscure worst-case behavior.
State-of-the-art generalist manipulation policies have enabled the deployment of robotic manipulators in unstructured human environments. However, these frameworks struggle in cluttered environments primarily because they utilize auxiliary modules for low-level motion planning and control. Motion planning is typically treated as a downstream post-processing step, preventing policies from reasoning about reachability during high-level decision making. This paper analyzes the challenges of building generalist neural motion planners, identifies training data diversity as the primary bottleneck, and proposes architectural principles for integrating motion planning capabilities directly within manipulation policies. We provide benchmarks exposing specific failure modes and outline a research roadmap toward neural planners that generalize across manipulators, environments, and task distributions.
Existing world-model driving planners build latent spaces that conflate spatial and temporal information, making them poor predictors of how the environment evolves under control inputs; Latent-WAM explicitly disentangles spatially-aware encoding from dynamics-informed temporal modeling.
The spatially-aware encoder preserves occupancy grid structure in the latent space so the model can reason about distances and collision geometry, while a dynamics-informed module models environment evolution under ego-vehicle actions.
Outperforms prior world-model-based planners on standard driving benchmarks with a more efficient architecture, demonstrating that the performance gain comes from representational structure rather than model scale.
The spatial-temporal disentanglement principle is architecture-agnostic and broadly applicable—any world model for robotics that must reason about both current state geometry and future state evolution stands to benefit.
We introduce Latent-WAM, an efficient end-to-end autonomous driving framework that achieves strong trajectory planning through spatially-aware and dynamics-informed latent world representations. Existing world-model-based planners suffer from inadequately compressed representations, limited spatial reasoning, and poor dynamics modeling. Latent-WAM addresses these issues through a spatially-aware encoder that preserves occupancy grid structure in latent space and a dynamics-informed representation module that models environment evolution under ego-vehicle actions. Latent-WAM achieves competitive performance on standard driving benchmarks with improved computational efficiency over prior world-model planners, validating that structured latent representations outperform generic compression approaches for action-conditioned prediction.
Visual-inertial navigation filters are notoriously inconsistent—they overestimate their own accuracy and diverge; equivariant filters (EqFs) solve this but the choice of symmetry group affects both consistency and computational cost in ways that are not well understood; this paper provides analytical mappings between EqFs under different symmetries.
By establishing systematic conversions between symmetry choices, practitioners can select the symmetry yielding the sparsest Jacobians for their hardware, achieving up to 3× faster evaluation without sacrificing the consistency guarantees of EqFs.
The framework resolves a practical barrier to EqF adoption: previously, choosing a symmetry required deep mathematical expertise; the transformation mappings allow engineers to compare options systematically against hardware constraints.
Demonstrates that consistency design and efficient implementation—previously treated as separate concerns—can be co-optimized through the choice of symmetry group, unifying two threads of VIO research.
This paper presents an equivariant filter (EqF) transformation approach for visual--inertial navigation. By establishing analytical links between EqFs with different symmetries, the proposed approach enables systematic consistency design and efficient implementation. First, we formalize the mapping between EqFs under different symmetry groups, showing that filters with different symmetries are analytically related through coordinate transformations. Second, we exploit these relationships to design EqFs with sparser Jacobian structures, achieving up to 3x faster evaluation while maintaining theoretical consistency guarantees. Experiments on standard VIO benchmarks confirm that the proposed approach matches the accuracy of existing EqF methods with significantly reduced computational cost.
Standard time-optimal motion planning formulates a single large Optimal Control Problem (OCP) that is intractable for complex environments; this paper decomposes it into smaller sequential subproblems along the spline, achieving approximately 10× speedup on mobile robot benchmarks.
Unlike prior approaches that check safety only at discrete waypoints—leaving gaps between checks where collisions can occur—this method uses interval arithmetic to provide continuous safety certificates over the entire trajectory.
The continuous safety guarantee is critical for certification in safety-critical applications: point-wise collision checks are provably insufficient under real kinematic constraints, a gap this work closes rigorously.
Handles non-differentially flat systems—a broader class than most time-optimal planners which restrict to flat systems for computational tractability—extending the approach to systems like underactuated vehicles and manipulators with complex dynamics.
Generating time-optimal, collision-free trajectories for autonomous mobile robots involves a fundamental trade-off between guaranteeing safety and managing computational complexity. State-of-the-art approaches formulate spline-based motion planning as a single Optimal Control Problem (OCP) but often rely on point-wise safety constraints that cannot guarantee continuous collision avoidance. We propose an accelerated decomposition approach that solves sequential subproblems along the spline, achieving approximately 10x speedup over monolithic OCP formulations. Safety is guaranteed continuously through interval arithmetic bounding, providing provably collision-free trajectories under real kinematic constraints. The approach handles non-differentially flat systems and is validated on mobile robot benchmarks requiring navigation through cluttered environments.
Abhishek Paudel, Abhish Khanal, Raihan I. Arnob, Shahriar Hossain, Gregory J. Stein
Core Contributions
Most LLM-informed robotics uses the LLM as an action oracle or reward designer; this work takes a more conservative approach—the LLM estimates prior probabilities of object locations, which a model-based planner refines through Bayesian updating, preserving formal search guarantees while capturing commonsense knowledge.
A novel prompt selection mechanism chooses the most informative LLM query based on current uncertainty, substantially reducing API calls while maintaining search quality—critical for real-time operation where LLM query latency is a bottleneck.
Outperforms purely model-based search (which lacks commonsense priors) and purely LLM-guided search (which lacks formal completeness) in partially-known environments, demonstrating that the hybrid approach is stronger than either component alone.
The Bayesian framing treats LLM outputs as uncertain priors rather than ground truth—a key epistemological shift that makes the system robust to LLM hallucinations that would cause pure LLM planners to fail.
We present a novel LLM-informed model-based planning framework, and a novel prompt selection method, for object search in partially-known environments. Our approach uses an LLM to estimate statistics about the likelihood of finding the target object when searching various locations throughout the scene. These LLM-estimated statistics are incorporated as prior probabilities into a Bayesian model-based planner that updates beliefs as the robot explores. A prompt selection mechanism identifies which LLM queries are most informative given current uncertainty, reducing API calls while maintaining search quality. Our approach outperforms both purely model-based and purely LLM-guided search on partially-known environment benchmarks, demonstrating that LLMs and formal planners are most powerful in combination rather than in isolation.
Most AV simulation platforms target full-scale vehicles; MonoSIM fills a gap for small-scale Ackermann platforms (e.g., XTENTH-CAR) used widely in education and rapid prototyping, where existing simulators are either overfit to 1:1 scale or lack monocular vision support.
Software-in-the-Loop design runs the exact same code in simulation that runs on physical hardware, eliminating the integration gap where simulation-developed code fails on real platforms due to API or timing differences.
Open-source release with documented integration points means robotics education programs gain a no-cost, community-maintained simulation baseline for small-scale vehicle autonomy.
Focuses specifically on monocular vision—the most common and lowest-cost sensing modality for small-scale platforms—making the simulation representative of real hardware constraints rather than assuming richer sensing.
This paper presents an open-source Software-in-the-Loop (SIL) simulation platform designed for autonomous Ackerman vehicle research and education. The proposed framework focuses on simplicity, while making it easy to work with small-scale experimental setups, such as the XTENTH-CAR platform. The system integrates monocular vision with Ackermann steering geometry in a simulation environment that runs the same codebase as the physical platform. The framework is made publicly available to lower the barrier for robotics education and rapid AV prototyping on small-scale platforms, where existing simulation tools are either overly complex or lack appropriate sensing models.
World model-based RL for driving previously required 100-step diffusion sampling per decision, making online planning computationally impractical; DreamerAD compresses this to a single step—an 80× speedup—through distillation of the diffusion model into a latent dynamics model.
Crucially maintains visual interpretability despite aggressive compression: the latent space still produces human-readable predictions of future road states, enabling debugging and trust-building that black-box approaches cannot provide.
Training on real-world driving data rather than simulation directly addresses the sim-to-real gap that is a persistent failure mode for model-based driving approaches trained in synthetic environments.
The distillation technique—compressing a diffusion world model into a single-step latent predictor—is a broadly applicable method that could accelerate any system where diffusion-based world models are the planning bottleneck.
We introduce DreamerAD, the first latent world model framework that enables efficient reinforcement learning for autonomous driving by compressing diffusion sampling from 100 steps to 1 - achieving 80x speedup while maintaining visual interpretability. Training RL policies on real-world driving data using latent world models has been limited by the computational cost of diffusion-based world model sampling. DreamerAD distills the multi-step diffusion process into a single-step latent dynamics model, enabling online RL planning at practical speeds. The compressed latent space preserves visual interpretability, and training on real driving data achieves strong performance without the sim-to-real gap that plagues simulation-trained approaches.
MARL for traffic signals either trains centralized policies (computationally prohibitive at deployment) or fully independent policies (ignoring inter-intersection effects that cause ripple congestion); CoordLight finds a structured middle ground through a decentralized coordination graph.
Agents share compact representations of local state with neighbors rather than full observations, enabling coordination at network scale without the exponential communication overhead of centralized approaches.
Outperforms both fully centralized and fully independent baselines on multi-intersection networks, demonstrating that coordination graph structure is the key variable—not just network capacity.
The coordination architecture is transport-domain-agnostic: the structured local message-passing approach applies equally to any distributed control problem where physical topology constrains agent interactions.
Adaptive traffic signal control (ATSC) is crucial in alleviating congestion, maximizing throughput and promoting sustainable mobility in ever-expanding cities. Multi-Agent Reinforcement Learning (MARL) has recently shown significant potential in addressing complex traffic dynamics, but the intricacies of network-wide coordination remain challenging. We propose CoordLight, a decentralized coordination approach where agents share compact local state representations with neighbors in a coordination graph. CoordLight enables network-wide coordination without centralized computation, outperforming both fully centralized and fully independent baselines on multi-intersection benchmarks, and demonstrating scalability to large traffic networks while maintaining real-time decision making.
Han Zheng, Yining Ma, Brandon Araki, Jingkai Chen, Cathy Wu
Core Contributions
Lifelong MAPF (continuous goal assignment vs. one-shot planning) is qualitatively harder than standard MAPF because planning decisions compound over time—bad priority orderings now cause throughput degradation minutes later; this work trains a learned policy specifically to predict high-throughput priority orderings.
The learned priority predictor guides the exact CBS planner rather than replacing it, preserving collision-free guarantees while dramatically improving scalability under heavy load where exhaustive priority search is intractable.
Addresses a real operational pain point in Amazon/Alibaba-style warehouses: system throughput degrades under heavy agent load due to planning bottlenecks, and this approach maintains near-peak throughput at scales where pure CBS would timeout.
Demonstrates a general principle: learning can act as a search heuristic for exact planners, combining the correctness of classical methods with the scalability of learned approaches—a hybrid paradigm with broad applicability beyond MAPF.
Lifelong Multi-Agent Path Finding (MAPF) is critical for modern warehouse automation, which requires multiple robots to continuously navigate conflict-free paths to optimize the overall system throughput. However, the complexity of warehouse environments and the long-term dynamics of lifelong MAPF often lead to significant throughput degradation under heavy agent loads due to planning bottlenecks. We propose a learning-guided prioritized planning framework that uses a trained policy to predict high-throughput agent priority orderings for the CBS planner, dramatically reducing the search space while preserving collision-free guarantees. The learning-guided approach maintains near-peak throughput at agent densities where exhaustive priority search is computationally intractable.
Drone light shows require simultaneously minimizing total trajectory length across thousands of UAVs during formation transitions—a combinatorial assignment problem; this paper solves it jointly rather than greedy per-UAV, which local methods cannot approach.
Joint optimal assignment produces shorter, smoother trajectories with significantly reduced inter-agent conflict and hover time compared to greedy allocation, directly translating to lower battery consumption and more complex achievable formations.
Drone light shows are replacing fireworks as the preferred large-scale entertainment display globally; this optimization enables more complex shows with larger fleets on existing hardware without requiring additional UAV endurance.
The assignment algorithm framework generalizes to any swarm formation transition problem—search and rescue coverage, sensor network reconfiguration, agricultural swarm repositioning—wherever multiple agents must collectively transition between spatial configurations.
Drone light shows (DLShows) represent a rapidly growing application of swarm robotics, creating captivating aerial displays through the synchronized flight of hundreds or thousands of unmanned aerial vehicles (UAVs) as environmentally friendly and reusable alternatives to traditional pyrotechnics. This paper addresses optimal UAV-to-formation-position allocation and trajectory generation for DLShow formation transitions. We formulate joint allocation optimization that minimizes total trajectory length across all agents simultaneously, producing collision-aware trajectories with reduced inter-agent conflicts and hover time compared to greedy assignment. The approach enables more complex formation sequences and larger fleets on existing UAV hardware with improved energy efficiency.
MARL for traffic control suffers from sparse reward and slow convergence; LATS uses an LLM as a training-time teacher that provides dense reward shaping encoding traffic domain knowledge—without requiring the LLM at deployment where inference latency would be prohibitive.
Unlike prior reward shaping requiring hand-specified heuristics (e.g., "reduce queue length"), the LLM teacher generates contextually appropriate guidance from natural language descriptions of traffic scenarios, making the system adaptable to new city layouts without manual reward engineering.
The teacher-student framework leverages LLM knowledge for exploration without the LLM becoming a deployed component—a principled way to use foundation models in latency-constrained real-time control applications.
Demonstrates that LLMs can accelerate MARL training by orders of magnitude (reaching equivalent performance in far fewer episodes) without being part of the deployed policy, suggesting a general "LLM-as-trainer" paradigm for RL in structured domains.
Adaptive Traffic Signal Control (ATSC) aims to optimize traffic flow and minimize delays by adjusting traffic lights in real time. Recent advances in Multi-agent Reinforcement Learning (MARL) have shown promise for ATSC, yet existing approaches still suffer from limited representational capacity, often slow convergence, or require manually specified reward shaping heuristics. We propose LATS, an LLM-assisted teacher-student framework where an LLM teacher provides dense reward shaping signals encoding traffic domain knowledge during training, while the deployed student policy operates at real-time speeds without LLM involvement. LATS accelerates convergence to high-performance ATSC policies by orders of magnitude compared to baselines, using natural language scenario descriptions to generate contextually appropriate reward guidance without manual heuristic specification.
Cooperative fixed-wing path-following over DEM terrain requires temporal synchronization (all UAVs at the same normalized path position) while each aircraft has limited control authority for speed adjustment—a coupling that prior distributed methods neglect.
Proposes a reference speed adjustment mechanism applied decentrally by all agents that maintains temporal synchronization under perturbations without requiring global replanning or communication with a central coordinator.
Integrates local obstacle replanning that handles sudden terrain features and wind disturbances without disrupting the synchronization of other agents—decoupling local hazard response from formation-level coordination.
DEM-based low-altitude flight planning is an emerging application for autonomous cargo delivery and terrain surveillance; this paper provides one of the first distributed solutions for fixed-wing aircraft operating in this regime.
Multiple fixed-wing unmanned aerial vehicles (multi-UAVs) encounter significant challenges in cooperative path following over complex Digital Elevation Model (DEM) low-altitude airspace, including wind field disturbances, sudden obstacles, and requirements of distributed temporal synchronization during formation flight. We propose a robust distributed cooperative path-following framework with integrated local replanning for multi-UAVs in DEM low-altitude environments. A novel reference speed adjustment mechanism maintains temporal synchronization decentrally under perturbations, while a local replanning module handles sudden obstacles and wind disturbances without disrupting formation-level coordination. The approach is validated in simulation on complex low-altitude terrain scenarios representing realistic delivery and surveillance missions.
Prior cooperative pursuit methods assume privileged ground-truth positions for all agents, sidestepping the perceptual uncertainty that dominates real deployment; PSTO operates from raw noisy ego-centric observations, modeling future positions of all agents through predictive spatio-temporal representations.
The Predictive Spatio-Temporal Observation module enables each agent to predict teammate movements from observation alone, enabling implicit coordination without explicit communication—critical for RF-denied environments.
Deep RL training in simulation transfers to physical hardware with minimal fine-tuning, suggesting the PSTO representation is sufficiently domain-invariant to handle the gap between simulated and real sensor noise.
The 3D cluttered pursuit setting is significantly harder than the 2D or open-space settings studied in most prior pursuit literature; demonstrating decentralized success in this regime validates the approach for realistic aerial operations.
Decentralized cooperative pursuit in cluttered environments is challenging for autonomous aerial swarms, especially under partial and noisy perception. Existing methods often rely on abstracted geometric features or privileged ground-truth states, and therefore sidestep perceptual uncertainty in realistic deployments. We propose a Predictive Spatio-Temporal Observation (PSTO) module that enables each agent to model future positions of all agents from noisy ego-centric observations, enabling implicit coordination without explicit communication. Deep RL policies trained with PSTO in simulation transfer to physical hardware with minimal fine-tuning, demonstrating decentralized cooperative pursuit in 3D cluttered environments under realistic sensor noise conditions.
Misato Sonoda, Ronan Hinchet, Amirhossein Kazemipour, Yasunori Toshimitsu, Robert K. Katzschmann
Core Contributions
Traditional robotic hands require complex external force sensors and control loops for safe contact; this design achieves inherent compliance through electrohydraulic actuator fluid dynamics, eliminating external sensors entirely while maintaining safe interaction behavior.
The musculoskeletal architecture—tendons pulled by hydraulic actuators mimicking biological muscle-tendon pairs—provides passive compliance that gracefully handles unexpected contact without requiring pre-programmed contact policies.
Demonstrates successful grasping of diverse objects without explicit force control, suggesting the passive compliance alone is sufficient for general manipulation—a significant simplification of the control stack compared to sensor-dependent approaches.
Electrohydraulic actuation remains rare in dexterous manipulation; this paper provides detailed characterization of the trade-offs (bandwidth, force density, compliance range) that practitioners need to evaluate the technology for specific applications.
Robotic manipulation in unstructured environments requires end-effectors that combine high kinematic dexterity with physical compliance. While traditional rigid hands rely on complex external sensors for safe interaction, electrohydraulic actuators offer a promising alternative. This paper presents a sensorless anthropomorphic musculoskeletal robotic hand driven by electrohydraulic actuators. The musculoskeletal architecture provides inherent compliance through passive fluid dynamics, enabling safe interaction without external force sensors. We characterize the hand's actuation properties and demonstrate successful object grasping across diverse shapes without explicit force control, showing that passive compliance alone is sufficient for robust unstructured manipulation.
Soft actuators for endoluminal robotics must fit in millimeter-scale channels while producing surgical-grade forces—a combination macro-scale pneumatic robots and MEMS actuators both fail to achieve; the fibre-reinforced design specifically targets this gap at centimeter scale.
Fibre reinforcement constrains radial expansion so that nearly all pneumatic energy converts to bending, achieving high force-to-size ratios that unreinforced soft actuators at this scale cannot match.
Detailed analytical modeling and experimental characterization produce predictable bending behavior enabling model-based control without per-device calibration—a critical requirement for surgical deployment where calibration time is unacceptable.
Targets Natural Orifice Transluminal Endoscopic Surgery (NOTES) procedures, where flexible instrument channels make traditional rigid-shaft actuators unusable; demonstrating sufficient force and controllability at centimeter scale opens a viable path toward autonomous endoluminal robotics.
Miniaturised soft pneumatic actuators are crucial for robotic intervention within highly constrained anatomical pathways. This work presents the design and validation of a fibre-reinforced soft actuator at the centimetre scale for integration into an endoluminal robotic platform for natural-orifice transluminal endoscopic surgery (NOTES). The fibre reinforcement constrains radial expansion, converting the majority of pneumatic energy to bending and achieving force-to-size ratios that unreinforced actuators at this scale cannot achieve. Analytical modeling produces a predictable bending-pressure relationship validated experimentally, enabling model-based control without per-device calibration. The actuator demonstrates sufficient force and range of motion for endoluminal surgical intervention in representative anatomical phantom testing.
Xiangyi Wei, Fei Wang, Haotian Zhang, Xin An, Haitian Zhu
Core Contributions
Chemical lab automation historically requires bespoke programming per procedure; AgentChemist uses a multi-agent architecture—separate perception, planning, and execution agents coordinating around a shared task representation—that handles long-tail experimental tasks without procedure-specific programming.
The architecture decomposes novel experimental procedures into known primitives that individual agents master independently, enabling generalization to infrequent tasks that rigid workflow automation systems fail on.
Integration of chemical perception (identifying reagents, reading instrument displays, interpreting visual cues) with precise robotic control (dispensing, mixing, transferring) in a single unified platform addresses a key gap where prior systems excelled at one but not the other.
Demonstrates that the same agent architecture generalizes from common procedures to novel ones, suggesting the approach is scalable to new chemistry domains without architectural redesign—unlike previous single-task lab robots.
Chemical laboratory automation has long been constrained by rigid workflows and poor adaptability to the long-tail distribution of experimental tasks. While most automated platforms perform well on a narrow set of standardized procedures, real laboratories involve diverse, infrequent, and evolving experimental tasks that require flexible, adaptive systems. We present AgentChemist, a multi-agent robotic platform integrating chemical perception and precise manipulation control. Separate perception, planning, and execution agents coordinate around a shared task representation, enabling generalization to long-tail experimental tasks by decomposing novel procedures into mastered primitives. AgentChemist demonstrates successful execution of both common and novel experimental tasks in a real chemistry laboratory setting.
Aditya Narendra, Mukhammadrizo Maribjonov, Dmitry Makarov, Dmitry Yudin, Aleksandr Panov
Core Contributions
Multi-task manipulation RL either trains separate per-task policies (not scalable) or monolithic multi-task policies (suffering from gradient interference between tasks with conflicting requirements); KG-M3PO uses a knowledge graph to represent task structure, sharing representations between related tasks while isolating conflicting ones.
Augments ego-centric vision with an online 3D scene graph updated from perception, giving the policy persistent spatial memory of the environment—addressing the partial observability that causes monolithic policies to fail on spatially extended manipulation tasks.
The knowledge graph structure provides debugging transparency: developers can inspect which task representations are shared, diagnose task-interference failures, and add new tasks incrementally without retraining the full policy.
Demonstrates that structured knowledge representation (as opposed to unstructured neural multi-task learning) improves both average and worst-case performance across the task distribution, suggesting that explicit task structure is a useful inductive bias for manipulation.
This paper introduces Knowledge Graph based Massively Multi-task Model-based Policy Optimization (KG-M3PO), a framework for multi-task robotic manipulation in partially observable settings that unifies Perception, Knowledge, and Policy. The method augments egocentric vision with an online 3D scene graph updated from perception, providing persistent spatial memory. A knowledge graph representing task structure guides representation sharing between related tasks while isolating conflicting ones, avoiding the gradient interference that plagues monolithic multi-task policies. KG-M3PO demonstrates improved average and worst-case performance across diverse manipulation tasks compared to both single-task and unstructured multi-task baselines, with interpretable task representations enabling efficient addition of new tasks.
🔬 Perception, Simulation & Special Applications
4 papers
Self-awareness in robots has been philosophically discussed but operationally undefined; this paper proposes a rigorous operationalization: the "self" is the invariant component of the robot's internal representation that persists across diverse experiences—isolatable by seeking what remains constant while the world model updates.
Uses continual learning to let the self-representation emerge from the constraint that it must compress information about the robot's own body accurately across morphological variations and damage scenarios, without being hand-crafted.
The emergent self is functionally useful: deviations between current and expected self-state serve as anomaly detectors for robot damage or miscalibration, turning the abstract concept of self-awareness into a practical fault detection mechanism.
Provides a principled framework for studying meta-cognitive capabilities in robots—a research direction that bridges cognitive science and robotics in ways that benchmark-driven ML research typically does not.
A key challenge to understanding self-awareness has been a principled way of quantifying whether an intelligent system has a concept of a "self," and if so how to differentiate the "self" from other cognitive structures. We propose that the "self" can be isolated by seeking the invariant portion of an agent's internal representation across diverse experiences. Using a continual learning framework, we show that a robot's self-representation emerges from the requirement that it must accurately encode the robot's own body state across varied scenarios and morphological changes. The emergent self is not hand-crafted but arises from compression constraints, and demonstrates functional utility as an anomaly detector for damage and miscalibration, providing a rigorous operationalization of machine self-awareness.
2026-03-25cs.CV, cs.GR, cs.RORocktim Jyoti Das (h=9)
Rocktim Jyoti Das, Dinesh Manocha
Core Contributions
Physics-based simulation requires spatially varying material property fields (Young's modulus, density) per asset that are expensive to measure manually; SLAT-Phys predicts them directly from 3D geometry, enabling automated digital twin generation at scale.
Unlike prior vision-based methods that infer only scalar bulk properties (one value per object), SLAT-Phys predicts full spatially varying fields—capturing heterogeneous materials like composite structures where bulk properties are inadequate.
Leverages structured 3D latents from a pretrained 3D encoder, showing that geometric structure encodes material property information that a simple decoder can extract efficiently—establishing a direct geometric-to-physical mapping that bypasses expensive measurement.
Practical impact on robotics simulation: automated material field estimation could enable large-scale digital twin generation where manual property specification is the current bottleneck, directly accelerating sim-to-real transfer research.
Estimating the material property field of 3D assets is critical for physics-based simulation, robotics, and digital twin generation. Existing vision-based approaches are either too expensive and slow or rely on 3D information. We present SLAT-Phys, an end-to-end method that predicts spatially varying material property fields from structured 3D latents extracted by a pretrained 3D foundation model. Unlike prior methods that estimate scalar bulk material properties, SLAT-Phys predicts full spatial property fields capturing material heterogeneity. The structured 3D latents encode geometric information that correlates strongly with material distribution, enabling efficient property field prediction through a lightweight decoder. SLAT-Phys is evaluated on diverse 3D assets and demonstrates accurate material field prediction with significant speedup over existing approaches.
Elaheh Sanoubari, Alicia Pan, Keith Rebello, Neil Fernandes, Andrew Houston
Core Contributions
Social robots in education almost universally serve as information-dispensing tutors; this work explores Robot-Mediated Applied Drama (RMAD) as a fundamentally different paradigm where robots are life-like puppets in interactive dramatic narratives, activating emotional and narrative engagement rather than didactic instruction.
The REMind system targets memory and social engagement for elderly users through improv-style interactive drama, where robots play characters that respond to user participation—an application domain where engagement depth matters more than information transfer accuracy.
User studies show higher emotional engagement and sustained interaction compared to conventional tutoring robots, suggesting that the dramatic/artistic register activates qualitatively different cognitive and social processes than tutoring.
Raises underexplored design questions about robot aesthetics as distinct from robot capability: how should a robot feel to interact with as a dramatic partner, not just what tasks should it perform—a framing that could reshape HRI design methodology.
Social robots are increasingly used in education, but most applications cast them as tutors offering explanation-based instruction. We explore an alternative: Robot-Mediated Applied Drama (RMAD), in which robots function as life-like puppets in interactive dramatic experiences designed to support reflective learning and emotional engagement. We present REMind, an RMAD application for memory support and social engagement with elderly users. Robots play characters in improv-style narratives that respond to user participation, activating emotional and narrative engagement rather than didactic instruction. User studies demonstrate higher emotional engagement and sustained interaction compared to conventional tutoring robot applications, and surface design questions about robot aesthetics — how robots should feel to interact with as dramatic partners, beyond what tasks they perform.
Michael Somma, Markus Großpointner, Paul Zabalegui, Eppu Heilimo, Branka Stojanović
Core Contributions
Robotic systems are networked cyber-physical systems with unique attack surfaces—network topology tied to physical motion, safety-critical real-time loops—that generic pentesting tools miss; grounding security assessment in the robot's operational context is the core contribution.
The multi-agent workflow combines LLM-based planning with environment-specific knowledge graphs representing the robot's network topology, active protocols, and failure modes, enabling identification of vulnerabilities that domain-agnostic security tools systematically overlook.
Addresses a gap where robot operators typically lack security expertise and security experts lack robotics domain knowledge—the system bridges this expertise gap autonomously.
Surfaces a dual-use concern: the same environment-grounded planning capability that finds vulnerabilities for defense could be repurposed for attack, making responsible disclosure protocols and access control for such systems critical design considerations.
The increasing complexity and interconnectivity of digital infrastructures make scalable and reliable security assessment methods essential. Robotic systems represent a particularly important class of operational technology, as modern robots are highly networked cyber-physical systems deployed in domains where security breaches can have physical consequences. We present an environment-grounded multi-agent workflow for autonomous penetration testing of robotic systems. The workflow combines LLM-based planning with environment-specific knowledge graphs representing the robot's network topology, active protocols, and physical failure modes. By grounding security assessment in the robot's operational context, the system identifies vulnerabilities that generic pentesting tools miss, demonstrated on representative robotic platforms in laboratory testing scenarios.