Daily curated robotics research, ranked by author h-index
π 2026-03-26π 30 Papersπ 7 Research Areasπ€ Generated by Claude
π Research Landscape
Today's Research Overview
Two forces dominate March 26's batch: the pervasive adoption of Vision-Language-Action (VLA) models across every robotics domain, and a sophisticated reckoning with their limitations. No fewer than seven papers either propose new VLA architectures or stress-test existing ones. MMaDA-VLA builds a fully diffusion-native backbone that unifies multi-modal understanding and action generation in a single forward pass, directly challenging the hierarchical pipelines that chain separate vision, language, and action modules. Fast-dVLA and LaMP attack complementary inefficiencies β the former makes discrete diffusion policies real-time capable, the latter inserts 3D scene flow as an explicit geometric prior so policies aren't forced to infer spatial relationships implicitly from 2D features. Meanwhile SABER takes an adversarial lens, showing that the same natural-language instruction channel that makes VLAs flexible is also a stealthy attack surface: minimal text edits can alter robot behavior without triggering human review. Together these papers suggest the field has moved past "do VLAs work?" into a harder engineering phase: how to make them fast, safe, physically grounded, and robust to manipulation.
The second dominant theme is scalable multi-agent coordination β in traffic, drone fleets, and smart infrastructure. COIN (multi-agent self-driving), CROSS (city-scale traffic signal control with a Mixture-of-Experts RL framework), CTS-PLL (multi-robot task sequencing with anytime refinement), and IMD-TAPP (joint multi-drone allocation and trajectory planning) collectively represent a shift from single-agent optimization to fleet-level decision-making. A notable cross-paper thread is the rejection of the planning hierarchy: IMD-TAPP argues that solving task allocation, sequencing, and trajectory generation jointly outperforms sequential pipelines; CTS-PLL's lock-detection mechanism similarly addresses failures that arise only when agents interact. CROSS and the R1-style traffic simulation paper (paper 16) share an RL flavor but attack opposite sides of the same problem β CROSS improves real-time signal control, while the simulation paper improves the quality of the synthetic traffic data used to train and evaluate such controllers.
A quieter but important undercurrent is the reliability gap in long-horizon and autoregressive systems. SoftMimicGen extends data-generation pipelines to deformable objects, attacking the data scarcity that limits manipulation learning. The Persistent World Models paper (paper 25) directly addresses error compounding in autoregressive video prediction β the same structural problem that afflicts LLM long-form generation β using RL post-training as the fix, a design pattern borrowed explicitly from language model alignment. The Mahjong system design paper (paper 20) makes the strongest argument: many long-horizon robot failures are not AI failures but systems engineering failures β the absence of cross-module consistency checks, not weak perception or planning. Taken together, today's papers paint a field that is simultaneously scaling up (larger VLAs, fleet coordination, city-scale control) and engineering down (robustness, reliability, real-time performance).
π Research Areas
VLA & Foundation Models
Architecture, efficiency, safety, and grounding of Vision-Language-Action models
First systematic black-box adversarial framework targeting VLAs through the natural-language instruction channel β a security surface that visual perturbation attacks miss entirely, since the attack requires no access to model weights or gradients.
Uses GRPO (Group Relative Policy Optimization) to train an attacker agent that generates minimal text edits within a bounded edit budget, producing adversarial instructions that remain semantically plausible to human review β making them hard to detect via manual content filtering.
The "stealthy" property distinguishes SABER from prior adversarial NLP work: instructions look like natural rephrasings rather than obvious corruption, meaning compromised commands could pass through safety review undetected.
Raises a concrete deployment security concern: if user instructions can be subtly poisoned (through adversarial UIs, manipulated command histories, or compromised voice input), robots may execute unsafe behaviors without triggering any existing guardrail.
The black-box, agent-centric design generalizes across VLA architectures β demonstrating vulnerabilities in multiple models without model-specific tuning, suggesting a systemic rather than implementation-specific weakness.
Vision-language-action (VLA) models enable robots to follow natural-language instructions grounded in visual observations, but the instruction channel also introduces a critical vulnerability: small textual perturbations can alter downstream robot behavior. Systematic robustness evaluation therefore requires a black-box attacker that can generate minimal yet effective instruction edits across diverse VLA models. To this end, we present SABER, an agent-centric approach for automatically generating instruction-based adversarial attacks on VLA models under bounded edit budgets. SABER uses a GRPO-based training strategy to optimize the attacker for effectiveness and stealthiness, enabling systematic evaluation of VLA robustness without requiring model internals.
Johnathan Tucker, Denis Liu, Aiden Swann, Allen Ren, Javier Yu
Core Contributions
Takes Οβ β a manipulation-pretrained VLA β and transfers it to aerial pick-and-place, a domain with fundamentally incompatible dynamics: fixed-base arms are quasi-static while quadrotors are underactuated and highly dynamic.
Key empirical finding: visual representations from manipulation VLAs transfer surprisingly well to aerial tasks, but the action prediction head fails because it has learned dynamics assumptions (no gravity compensation, quasi-static equilibrium) that don't hold for flight.
Introduces a physics-guided adapter that corrects for control latency and aerodynamic effects absent in arm training β analogous to domain randomization but applied at the dynamics level rather than the visual level.
Demonstrates real-world aerial manipulation experiments, showing that retraining the full VLA from scratch for aerial robots is unnecessary β representations are more transferable than dynamics assumptions, a finding with broad implications for cross-platform VLA reuse.
Directly challenges a common assumption in the field that different robot morphologies require separate foundation models; instead, targeted physics adapters can bridge large kinematic and dynamic gaps.
Vision-Language-Action (VLA) models such as Οβ have demonstrated remarkable generalization across diverse fixed-base manipulators. However, transferring these foundation models to aerial platforms remains an open challenge due to the fundamental mismatch between the quasi-static dynamics of fixed-base arms and the underactuated, highly dynamic nature of flight. In this work, we introduce AirVLA, a system that investigates the transferability of manipulation-pretrained VLAs to aerial pick-and-place tasks. We find that while visual representations transfer effectively, the specific control dynamics require physics-guided adaptation. AirVLA demonstrates successful aerial manipulation in real-world experiments using a physics-informed adapter approach.
Identifies a specific finetuning dilemma for discrete diffusion VLAs: standard SFT converges slowly and underperforms, while auxiliary-task objectives improve capability but multiply compute cost through additional losses.
Decouples capability improvement from inference cost by using the auxiliary objectives only during training (distillation-style), producing a policy that has absorbed their benefits at zero additional inference overhead β achieving real-time performance.
Bridges the expressiveness of diffusion-based action generation with the latency constraints of reactive robot control β a prerequisite for discrete diffusion VLAs to be deployable on physical hardware.
The decoupling principle extends beyond VLAs: any system where powerful training signals are too expensive to retain at inference can benefit from this train-heavy, deploy-light design pattern.
Complements MMaDA-VLA (paper 8, same first author group) by addressing inference speed while MMaDA addresses architecture unification β together they advance a coherent research program on diffusion-based robot policies.
This paper proposes a novel approach to address the challenge that pretrained VLA models often fail to effectively improve performance and reduce adaptation costs during standard supervised finetuning (SFT). Some advanced finetuning methods with auxiliary training objectives can improve performance and reduce the number of convergence steps. However, they typically incur significant computational overhead due to the additional losses from auxiliary tasks. To simultaneously achieve the enhanced capabilities of auxiliary training with the simplicity of standard SFT, we decouple the two objectives, enabling real-time performance for discrete diffusion VLA models without sacrificing capability gains from auxiliary training.
Yang Liu, Pengxiang Ding, Tengyue Jiang, Xudong Wang, Wenxuan Song
Core Contributions
Unlike hierarchical VLA pipelines that chain separate vision encoders, language models, and action heads (incurring architectural overhead and temporal inconsistency across stages), MMaDA-VLA uses a single native discrete diffusion model that handles all modalities in one unified forward pass.
The discrete diffusion formulation allows iterative refinement of action predictions rather than single-pass autoregressive commitment, reducing error accumulation in multi-step tasks and providing a natural mechanism for uncertainty-aware action generation.
Built on a pre-trained large multimodal diffusion foundation, enabling zero-shot generalization to new task descriptions and visual layouts unseen during robot finetuning β the manipulation-specific training layer is thin relative to the pre-trained backbone.
Eliminates the need for external dynamics or world models: environment state is captured through the diffusion model's multi-modal context window, reducing the number of separately-trained components that can diverge during deployment.
Makes a fundamental architectural argument against the current "glue two pretrained models together" paradigm that dominates VLA design, proposing instead that native pre-training across all modalities is worth the engineering investment.
Vision-Language-Action (VLA) models aim to control robots for manipulation from visual observations and natural-language instructions. However, existing hierarchical and autoregressive paradigms often introduce architectural overhead, suffer from temporal inconsistency and long-horizon error accumulation, and lack a mechanism to capture environment dynamics without extra modules. To this end, we present MMaDA-VLA, a fully native pre-trained large diffusion VLA model that unifies multi-modal understanding and generation in a single framework. Our key idea is a native discrete diffusion formulation that jointly handles vision, language, and action modalities, enabling coherent multi-step robot control.
Addresses a genuine safety blind spot in current VLA frameworks: robots are thermally invisible β they have no mechanism to detect hot surfaces, boiling liquids, or overheated machinery that humans instinctively avoid through infrared perception and thermal memory.
Integrates a thermal camera stream alongside RGB-D into a VLA pipeline, using a VLM as a high-level planner that incorporates thermal context when reasoning about task steps β not just an add-on sensor but a planning-level input.
Unlike simple thresholding (flag anything above XΒ°C), the system reasons about thermal relevance to the current task: a hot stovetop is hazardous during food retrieval but irrelevant during shelf scanning.
Demonstrates concrete behavioral changes from thermal awareness: the system plans alternative approach trajectories around hot objects and avoids contact with thermally anomalous surfaces, behaviors invisible to purely RGB-based policies.
Opens a largely neglected sensing modality for robot manipulation β thermal cameras are cheap and increasingly available, making this a practically deployable safety enhancement for kitchen, lab, and industrial automation.
In recent human-robot collaboration environments, there is a growing focus on integrating diverse sensor data beyond visual information to enable safer and more intelligent task execution. Although thermal data can be crucial for enhancing robot safety and operational efficiency, its integration has been relatively overlooked in prior research. This paper proposes a novel Vision-Language-Action (VLA) framework that incorporates thermal information for robot task execution. The proposed system leverages a Vision-Language Model (VLM) as a high-level planner to interpret complex natural language instructions with thermal context, enabling safer and more context-aware robotic manipulation.
Xinkai Wang, Chenyi Wang, Yifu Xu, Mingzhe Ye, Fu-Cheng Zhang
Core Contributions
Most VLA models regress actions from 2D semantic features, requiring the policy network to implicitly recover 3D spatial relationships from flat image representations β LaMP makes these explicit by generating dense 3D scene flow as a geometric intermediate before action prediction.
Dual-expert architecture: a Motion Expert (flow-matching based) generates dense 3D flow predictions capturing "where each point in the scene will move," while an Action Expert queries this geometric prior via gated cross-attention to predict robot actions.
The gated cross-attention mechanism is critical: the Action Expert can selectively ignore flow information when the task doesn't require precise 3D reasoning, making LaMP robust across both geometrically demanding and simpler tasks.
Outperforms standard VLA baselines on contact-rich and precisely-constrained manipulation tasks (e.g., peg insertion, constrained placement) where 3D spatial accuracy matters most, while maintaining comparable performance on visually-guided reaching tasks.
The flow-as-latent-prior design is architecturally modular β it can be combined orthogonally with other VLA improvements (diffusion heads, chain-of-thought reasoning) and could be pre-trained from large unlabeled video datasets since 3D flow doesn't require robot action labels.
We introduce LaMP, a dual-expert Vision-Language-Action framework that embeds dense 3D scene flow as a latent motion prior for robotic manipulation. Existing VLA models regress actions directly from 2D semantic visual features, forcing them to learn complex 3D physical interactions implicitly. This implicit learning strategy degrades under unfamiliar spatial dynamics. LaMP addresses this limitation by aligning a flow-matching Motion Expert with a policy-predicting Action Expert through gated cross-attention. Specifically, the Motion Expert generates a one-step partially observable 3D scene flow as an intermediate geometric representation, which the Action Expert uses as a structured prior for precise manipulation policy learning.
Motonari Kambara, Koki Seno, Tomoya Kaichi, Yanan Wang, Komei Sugiura
Core Contributions
Uses 2D optical flow as the action representation rather than joint angles or end-effector deltas β predicting "where each object will move" is more transferable across robot morphologies and enables training directly on human and web videos of object manipulation without any robot action labels.
"Object-centric" is the critical design choice: LILAC generates flow only for the specific object referenced in the natural language instruction, sharply reducing the representation space compared to dense whole-scene flow and improving instruction-to-flow alignment.
Open-loop trajectory generation is deliberate β unlike closed-loop policies that correct for errors reactively, LILAC commits to a predicted trajectory, forcing the model to develop robust predictive representations rather than relying on visual feedback correction.
Directly addresses the embodiment gap: optical flow is robot-agnostic, meaning the same model can generate trajectories executable on different manipulator morphologies, and the training corpus can include human hand videos alongside robot demonstrations.
Demonstrates that language-to-flow alignment is achievable with far less robot-specific data than standard imitation learning, pointing toward a training paradigm where internet-scale video provides the primary supervision signal.
We address language-conditioned robotic manipulation using flow-based trajectory generation, which enables training on human and web videos of object manipulation and requires only minimal embodiment-specific data. This task is challenging, as object trajectory generation from pre-manipulation images and natural language instructions requires appropriate instruction-flow alignment. To tackle this challenge, we propose the flow-based Language Instruction-guided open-Loop ACtion generator (LILAC). This flow-based Vision-Language-Action model (VLA) generates object-centric 2D optical flow from an initial image and natural language instruction, which is then converted into robot trajectories for execution.
Yuqian Shao, Xiaosong Jia, Langechuan Liu, Junchi Yan
Core Contributions
Identifies a practical gap in E2E autonomous driving evaluation: current benchmarks optimize for safety and route completion but completely ignore driver personalization β whether a user can specify desired speed or overtaking behavior and have the system honor it.
Introduces Bench2Drive-Speed with a new Speed-Adherence Score (SAS) metric that quantifies how closely a policy tracks user-specified speed targets β distinct from travel speed metrics that only measure what the system does, not whether it respects user intent.
Tests multiple SOTA E2E baselines and finds they largely ignore speed conditioning inputs: they achieve competitive overall driving scores while failing to differentiate between a user requesting 30 km/h versus 60 km/h, exposing a fundamental UX gap.
The overtake/follow instruction component treats driver preference as a discrete behavioral mode, not just a speed parameter β acknowledging that personalization spans both continuous (how fast) and categorical (what driving style) dimensions.
Adoption of autonomous vehicles depends partly on whether drivers feel in control of their vehicle's character; a benchmark that measures instruction adherence directly operationalizes this often-ignored user experience requirement.
End-to-end autonomous driving (E2E-AD) has achieved remarkable progress. However, one practical and useful function has been long overlooked: users may wish to customize the desired speed of the policy or specify whether to allow the autonomous vehicle to overtake. To bridge this gap, we present Bench2Drive-Speed, a benchmark with metrics, dataset, and baselines for desired-speed conditioned autonomous driving. We introduce explicit inputs of users' desired target-speed and overtake/follow instructions to driving policy models. We design quantitative metrics, including Speed-Adherence Score and evaluate multiple baseline models, revealing that current methods largely fail to honor user speed preferences.
Sicheng Zuo, Yuxuan Li, Wenzhao Zheng, Zheng Zhu, Jie Zhou
Core Contributions
Prior language-augmented driving models use language only for scene description or chain-of-thought reasoning β Vega is the first to enable genuine instruction following, where a user can specify behavioral preferences ("drive cautiously near pedestrians," "maintain lane discipline") and the vehicle adapts accordingly.
Introduces InstructScene, a 100K-scene dataset annotated with diverse driving instructions paired with corresponding trajectories β addressing the fundamental data gap that prevents instruction-conditioned driving research.
Vega's Vision-Language-World-Action architecture jointly predicts future world states alongside actions: imagining the consequences of following an instruction before executing improves instruction adherence compared to direct action prediction.
The world model component acts as an instruction-to-consequence reasoner β if the predicted future violates traffic rules or safety constraints, the policy can revise the instruction interpretation before committing to an action.
Points toward autonomous vehicles that function as driving partners adapting to individual preferences rather than applying a fixed optimization objective β a necessary step toward widespread consumer adoption.
Vision-language-action models have reshaped autonomous driving to incorporate languages into the decision-making process. However, most existing pipelines only utilize the language modality for scene descriptions or reasoning and lack the flexibility to follow diverse user instructions for personalized driving. To address this, we first construct a large-scale driving dataset (InstructScene) containing around 100,000 scenes annotated with diverse driving instructions with the corresponding trajectories. We then propose a unified Vision-Language-World-Action model, Vega, for instruction-based generalized autonomous driving that jointly models world states and actions conditioned on natural language driving instructions.
Unlike MARL self-driving methods that treat other vehicles as dynamic obstacles in the environment, COIN explicitly models pairwise interaction intentions β each agent maintains predictions of what it expects other agents to do and plans accordingly, enabling emergent cooperative negotiation.
The interaction-aware representation is especially valuable in dense urban intersections where agents must implicitly coordinate right-of-way, merges, and lane changes without explicit V2V communication protocols.
Demonstrates significant improvement in both safety (collision rate reduction) and efficiency (intersection throughput) over non-interaction-aware multi-agent baselines, with the largest gains in the highest-density scenarios where interaction complexity peaks.
Advances toward fleets of autonomous vehicles that can coordinate safely through learned implicit signaling rather than requiring standardized V2X communication infrastructure β reducing deployment barriers.
The MASD framing (Multi-Agent Self-Driving) positions this work beyond single-vehicle autonomy toward the cooperative fleet intelligence needed for future smart city traffic management.
Multi-Agent Self-Driving (MASD) systems provide an effective solution for coordinating autonomous vehicles to reduce congestion and enhance both safety and operational efficiency in future intelligent transportation systems. Multi-Agent Reinforcement Learning (MARL) has emerged as a promising approach for developing advanced end-to-end MASD systems. However, achieving efficient and safe collaboration in dynamic MASD systems remains a significant challenge in dense scenarios with complex agent interactions. To address this challenge, we propose COIN, a collaborative interaction-aware multi-agent reinforcement learning framework that explicitly models pairwise interaction intentions to enable safer and more efficient autonomous driving in dense urban environments.
Applies the DeepSeek-R1 RL framework (GRPO with outcome-based rewards) to traffic simulation β framing multi-agent vehicle trajectory generation as a reasoning/exploration problem rather than supervised next-token imitation, enabling the model to discover diverse but realistic motion patterns.
Unlike SFT-based simulators that only fit observed trajectories, the RL training objective rewards realism over extended rollouts, forcing the model to maintain plausible traffic dynamics beyond the immediate next step β addressing the distribution shift that makes SFT simulators brittle at long horizons.
Entropy-guided exploration: high-uncertainty motion token regions signal that the model should explore more, directing RL rollouts toward rare but safety-critical scenarios that SFT under-represents due to their low frequency in data.
The tokenized representation connects traffic simulation to the LLM scaling ecosystem β the same training infrastructure, scaling laws, and data mixing strategies that benefit large language models become directly applicable to traffic scenario generation.
Better simulation quality and diversity directly improve AV evaluation: if safety-critical rare events are underrepresented in simulation, evaluation metrics overestimate real-world performance; this paper attacks that gap from the simulator side.
Learning diverse and high-fidelity traffic simulations from human driving demonstrations is crucial for autonomous driving evaluation. The recent next-token prediction (NTP) paradigm has been applied to traffic simulation and achieves iterative improvements via supervised fine-tuning (SFT). However, such methods limit active exploration of potentially valuable motion tokens, particularly in suboptimal regions. Entropy patterns provide a promising perspective for enabling exploration driven by motion token uncertainty. Motivated by this insight, we propose an R1-style tokenized traffic simulation model that uses GRPO-based reinforcement learning to improve simulation realism and diversity beyond what SFT can achieve.
Xibei Chen, Yifeng Zhang, Yuxiang Xiao, Mingfeng Fan, Maonan Wang
Core Contributions
City-scale traffic signal control faces two conflicting requirements: specialization (different intersections have different topology and demand patterns) and generalization (a deployable system must work on intersections it hasn't seen). CROSS resolves this with a Mixture-of-Experts RL architecture where each expert specializes while a shared routing network selects the appropriate expert at each intersection.
Unlike prior ATSC approaches that either train one policy per intersection type (expensive, doesn't generalize) or use a single shared policy (generalizes poorly to rare topologies), MoE achieves both in a single model with controlled parameter growth.
Demonstrates generalization to intersection topologies and demand patterns not seen during training β a critical requirement for real city deployments where new intersections are added and traffic patterns shift seasonally.
The MoE + RL combination is directly applicable beyond traffic signals to any distributed coordination problem with heterogeneous subtask types, including warehouse robot fleets serving different zones and multi-robot inspection across varied environments.
Complements the R1-style traffic simulation paper (paper 16): better simulators enable better ATSC training; better ATSC policies motivate better simulators β these two papers together advance both sides of the AV/smart-city evaluation loop.
Recent advances in robotics, automation, and artificial intelligence have enabled urban traffic systems to operate with increasing autonomy. Adaptive traffic signal control (ATSC) dynamically optimizes signal phases to mitigate congestion. However, achieving effective and generalizable large-scale ATSC remains a significant challenge due to the diverse intersection topologies and highly dynamic traffic demand patterns. We present CROSS, a Mixture-of-Experts reinforcement learning framework that achieves both specialization across intersection types and generalization to unseen configurations, demonstrating improved traffic flow on city-scale networks.
Observes that near-term and far-term trajectory segments have qualitatively different constraint structures: near-term plans are dominated by instantaneous vehicle dynamics (physics-constrained), while far-term plans are dominated by navigational goals (topology-constrained) β treating them identically in one diffusion process is a category error.
Introduces a "noise-as-mask" paradigm with independent noise schedules per temporal segment: near-term segments are denoised with high confidence early in the diffusion process while far-term segments retain uncertainty longer, preserving future optionality during planning.
Unlike standard diffusion planners that refine the entire trajectory uniformly across denoising steps, TDDM's decoupled schedule enables rapid commitment to safe near-term actions while continuing to explore far-term alternatives β more aligned with human planning intuition.
Achieves better long-horizon trajectory quality (reaching distant navigational goals) without compromising short-horizon safety, validating that temporal structure in planning objectives should be reflected in the generative model architecture.
The noise-as-mask idea is general: any temporal planning problem where different lookahead horizons have different uncertainty profiles could benefit from this decoupled diffusion formulation, extending beyond driving to legged locomotion planning and manipulation with long action sequences.
Motion planning in dynamic urban environments requires balancing immediate safety with long-term goals. While diffusion models effectively capture multi-modal decision-making, existing approaches treat trajectories as monolithic entities, overlooking heterogeneous temporal dependencies where near-term plans are constrained by instantaneous dynamics and far-term plans by navigational goals. To address this, we propose Temporally Decoupled Diffusion Model (TDDM), which reformulates trajectory generation via a noise-as-mask paradigm. By partitioning trajectories into segments with independent noise schedules, TDDM enables confident near-term planning while preserving long-horizon flexibility.
Masoud Moghani, Mahdi Azizian, Animesh Garg, Yuke Zhu, Sean Huver
Core Contributions
Extends the MimicGen data synthesis paradigm β which successfully scales rigid-object manipulation data β to deformable objects (cloth, foam, flexible tubes), where state representations are high-dimensional and contact dynamics are non-trivial to simulate consistently.
The core challenge isn't just simulation fidelity but demonstration transfer: deformable objects deform differently under each grasp, so naive trajectory replication from seed demonstrations fails β SoftMimicGen develops deformation-aware transfer that accounts for how geometry changes affect the execution.
Demonstrates that a small seed dataset of real demonstrations can be amplified into a large synthetic training corpus that substantially reduces real-world data requirements β directly attacking the primary cost bottleneck in robot learning pipelines for soft manipulation.
Opens data synthesis to manipulation domains previously considered too hard to simulate: surgical tool handling (deformable tissue), textile folding (cloth), and cable routing (flexible tubing) all become tractable targets for synthetic data generation.
The broader implication parallels SoftMimicGen's rigid-object predecessor: once synthetic data at scale becomes available for a domain, the sample efficiency and generalization of learned policies improve dramatically, potentially enabling deployment from days rather than months of data collection.
Large-scale robot datasets have facilitated the learning of a wide range of robot manipulation skills, but these datasets remain difficult to collect and scale further, owing to the intractable amount of human time, effort, and cost required. Simulation and synthetic data generation have proven to be an effective alternative, especially with recent work showing that such synthetic datasets can dramatically reduce real-world data requirements and facilitate generalization to novel scenarios. However, this paradigm has been limited to rigid object manipulation. SoftMimicGen extends data synthesis to deformable object manipulation, demonstrating scalable robot learning for cloth, foam, and flexible object manipulation scenarios.
Jai Bardhan, Patrik Drozdik, Josef Sivic, Vladimir Petrik
Core Contributions
Identifies a concrete failure mode in action-conditioned robot world models: models trained with short-horizon prediction objectives degrade rapidly during autoregressive rollout because each frame's prediction errors become the next frame's context, causing compounding visual degradation.
Applies RL post-training to video world models β using a stability reward that penalizes drift from realistic visual statistics over extended multi-step rollouts β analogous to RLHF for language models but targeting visual realism over long action sequences.
The RL approach improves without requiring any ground-truth long-horizon video data: the stability reward is self-supervised, trained against a discriminator that learns what realistic frames look like from short demonstrations.
Demonstrates that RL-trained world models maintain visual quality for 3Γ longer rollouts than SFT baselines, enabling more reliable data augmentation and planning pipelines that depend on extended model-based rollouts.
Draws an explicit structural parallel to LLM alignment: the instability that causes language models to go off-rails in long generations is the same distribution shift problem that degrades video world models β and RL-based correction is the appropriate tool in both cases.
Action-conditioned robot world models generate future video frames of the manipulated scene given a robot action sequence, offering a promising alternative for simulating tasks that are difficult to model with traditional physics engines. However, these models are optimized for short-term prediction and break down when deployed autoregressively: each predicted clip feeds back as context for the next, causing errors to compound and visual quality to rapidly degrade. We address this through an RL post-training scheme that applies a stability reward over extended multi-step rollouts, improving visual quality persistence by 3Γ compared to SFT baselines.
Unlike standard robot motor learning that uses simplified joint-torque actuators, MuscleMimic uses physiologically realistic muscle-tendon units (126 muscles for upper body, 416 muscles for full body), capturing energy storage, force-velocity relationships, and muscle co-contraction absent in torque models.
Provides two validated musculoskeletal embodiments and a SMPL-format motion capture retargeting pipeline as open-source infrastructure β filling a critical tooling gap since validated full-body musculoskeletal models for imitation learning are currently unavailable to the community.
Shows that motion imitation learning is tractable with muscle actuation at scale despite the dramatically higher control dimensionality β overturning the common assumption that muscle-driven simulation is too slow or unstable for RL training on complex tasks.
The upper-body model (126 muscles) targets bimanual manipulation: understanding how muscles govern dexterous hand-arm coordination could inspire actuator and control designs for next-generation dexterous robot hands that go beyond current motor-joint paradigms.
Bridges computational neuroscience and robot learning β potentially enabling biomechanical insights (muscle synergies, impedance modulation) to transfer into robot control laws, a direction largely unexplored but with significant potential for energy-efficient and robust locomotion.
Learning motor control for muscle-driven musculoskeletal models is hindered by the computational cost of biomechanically accurate simulation and the scarcity of validated, open full-body models. Here we present MuscleMimic, an open-source framework for scalable motion imitation learning with physiologically realistic, muscle-actuated humanoids. MuscleMimic provides two validated musculoskeletal embodiments β a fixed-root upper-body model (126 muscles) for bimanual manipulation and a full-body model (416 muscles) for locomotion β together with a retargeting pipeline that maps SMPL-format motion capture data to muscle activations.
Rejects the false dichotomy between Bayesian navigation (principled uncertainty but hand-crafted action selection) and deep RL (adaptive policy but implicit uncertainty) β the hybrid maintains a live spatial belief map updated via Bayesian inference from calibrated detections, then feeds this probabilistic state directly to a trained RL policy.
The belief map acts as an explicit uncertainty signal: the RL policy receives structured probability distributions over target locations rather than raw pixels, dramatically reducing the representation complexity the policy must learn to interpret.
Calibrated object detection is key β miscalibrated confidence scores corrupt the Bayesian update and degrade belief map quality; explicit calibration connects the computer vision and probabilistic navigation components properly.
Evaluated in Habitat 3.0 across two indoor environments, the method improves success rate while reducing total search effort (path length, number of actions) compared to pure Bayesian and pure RL baselines.
Addresses a real deployment challenge: indoor object search under sensor noise and occlusion requires both principled uncertainty reasoning (which pure RL performs implicitly and imperfectly) and adaptive action policies (which classical planners cannot provide) β the two components are genuinely complementary, not redundant.
Autonomous object search is challenging for mobile robots operating in indoor environments due to partial observability, perceptual uncertainty, and the need to trade off exploration and navigation efficiency. Classical probabilistic approaches explicitly represent uncertainty but typically rely on handcrafted action-selection heuristics, while deep reinforcement learning enables adaptive policies but often suffers from slow convergence and limited interpretability. This paper proposes a hybrid object-search framework that integrates Bayesian inference with deep reinforcement learning. The method maintains a spatial belief map over target locations, updated online through Bayesian inference from calibrated object detections, and trains a reinforcement learning policy to select navigation actions directly from this probabilistic representation. The approach is evaluated in realistic indoor simulation using Habitat 3.0 and compared against developed baseline strategies.
Judith Treffler, VladimΓr Kubelka, Henrik Andreasson, Martin Magnusson
Core Contributions
Extends Neural Radiance Fields from optical sensors to 3D radar β a sensor with unique all-weather advantages (fog, smoke, dust penetration) but distinct challenges: sparse, noisy point clouds and radar cross-section physics fundamentally different from visible light reflectance.
Jointly reconstructs scene geometry AND view-dependent radar intensities in a single implicit representation, enabling novel-viewpoint radar return prediction β useful for simulation augmentation and training data generation for adverse-weather AV evaluation.
The view-dependent radar intensity model captures specular reflectance that depends on material, incidence angle, and polarization β significantly more complex than the Lambertian assumptions that often suffice for camera NeRF and requiring physically-grounded radar signal modeling.
Uses a memory-efficient implicit representation that scales to the spatial ranges radar operates at (tens to hundreds of meters), making it practical for outdoor mapping scenarios unlike prior radar reconstruction methods that require dense point clouds.
Closes a critical gap for all-weather autonomous robots: current scene reconstruction pipelines degrade when cameras and LiDAR fail in low-visibility conditions, but this work enables NeRF-quality reconstructions from the one sensor that remains reliable in those exact conditions.
Robust scene representation is essential for autonomous systems to safely operate in challenging low-visibility environments. Radar has a clear advantage over cameras and lidars in these conditions due to its resilience to environmental factors such as fog, smoke, or dust. However, radar data is inherently sparse and noisy, making reliable 3D surface reconstruction challenging. To address these challenges, we propose a neural implicit approach for 3D mapping from radar point clouds, which jointly models scene geometry and view-dependent radar intensities. Our method leverages a memory-efficient implicit representation to achieve accurate surface and reflectance modelling from sparse radar data.
Yanmei Jiao, Anpeng Lu, Wenhan Hu, Rong Xiong, Yue Wang
Core Contributions
Identifies a specific failure mode in topological-map navigation: local decisions are made from current egocentric observations alone, ignoring that the topological graph encodes global structure beyond the field of view β so when obstacles block direct progress, the agent persists along wrong directions without global correction.
"Topological intent" is a directional signal derived continuously from the goal-directed topology (which graph nodes lead toward the target), injected into the local reactive controller to bias actions toward globally consistent directions even when local perception is ambiguous.
Unlike classical hybrid navigation that triggers a full global replan when stuck, IntentReact propagates intent signal through the local controller continuously β producing smoother, more purposeful motion without the latency and discontinuity of replanning events.
The decoupling of global intent (from topology) and local reactivity (from perception) is computationally efficient: the intent signal is a low-dimensional directional bias, not a full map update, enabling tight integration with fast reactive controllers.
Demonstrates improved success rate on long-horizon object-goal navigation benchmarks, with the largest gains on tasks requiring directional persistence through cluttered environments where local myopia causes the most navigational failures.
Object-goal visual navigation requires robots to reason over semantic structure and act effectively under partial observability. Recent approaches based on object-level topological maps enable long-horizon navigation without dense geometric reconstruction, but their execution remains limited by the gap between global topological guidance and local perception-driven control. In particular, local decisions are made solely from the current egocentric observation, without access to information beyond the robot's field of view. IntentReact addresses this by deriving a topological intent signal from the goal graph and injecting it into the reactive controller, improving navigation success on long-horizon object-goal tasks.
Roman Kueble, Marco Hueller, Mrunmai Phatak, Rainer Lienhart, Joerg Haehner
Core Contributions
Reframes active semantic mapping as a task-directed RL problem: instead of maximizing map coverage or observation diversity, the reward directly measures semantic scene graph (SSG) quality β node completeness and relation accuracy β within a finite action budget.
Unlike coverage-based exploration that optimizes a proxy metric (visit many locations), this approach optimizes the actual downstream objective (build a useful semantic graph), closing the gap between exploration strategy and the purpose the map serves.
Integrates modern deep RL components (attention-based policies, large observation spaces) into the established SSG literature β bridging two communities that have developed separately despite addressing complementary aspects of the same problem.
Demonstrates better semantic scene graph quality per unit action budget than generic exploration baselines, validating that task-specific exploration rewards outperform heuristics that ignore what the map will ultimately be used for.
Positioned within Organic Computing, this work supports objective-driven self-adaptation: an agent that builds its own world model based on what it will need to know, rather than passively mapping everything equally, is a prerequisite for resource-efficient autonomy.
Semantic world models enable embodied agents to reason about objects, relations, and spatial context beyond purely geometric representations. Semantic scene graphs (SSGs) provide a structured and compact representation for this purpose. However, constructing them within a finite action horizon requires exploration strategies that trade off coverage against informativeness. We present a modernized RL-based navigation system that uses task-directed rewards targeting SSG quality directly, rather than generic coverage proxies, achieving better semantic graph completeness per action budget than prior exploration baselines.
Identifies that even theoretically optimal fixed-interval smoothers (which have access to the full measurement sequence) leave a persistent position error because they inherit the systematic bias of raw GNSS measurements β a model-based gap that no amount of filter tuning can close.
BLENDS trains a deep network to predict and correct GNSS positional bias using the full trajectory context available during post-processing β exploiting offline access to future measurements in a way that real-time estimators structurally cannot.
Unlike black-box deep learning smoothers, BLENDS integrates a principled Bayesian framework: the learned corrections come with calibrated uncertainty estimates, maintaining the interpretability and consistency guarantees that practitioners depend on for survey-grade positioning.
The deep smoothing formulation uses both past and future measurements simultaneously (fixed-interval, non-causal), giving it access to the bidirectional context that forward-only Kalman filters miss β directly exploiting the offline setting rather than treating it as a computational convenience.
Particularly valuable for UAV surveys, mobile mapping vehicles, and precision agriculture where post-processed navigation quality determines deliverable accuracy and real-time latency is irrelevant.
Accurate post-processing navigation is essential for applications such as survey and mapping, where the full measurement history can be exploited to refine past state estimates. Fixed-interval smoothing algorithms represent the theoretically optimal solution under Gaussian assumptions. However, loosely coupled INS/GNSS systems fundamentally inherit the systematic position bias of raw GNSS measurements, leaving a persistent accuracy gap that model-based smoothers cannot resolve. To address this limitation, we propose BLENDS, which integrates Bayesian learning with deep smoothing to enhance navigation accuracy by learning to correct GNSS position bias using bidirectional trajectory context.
Jointly solves three problems that drone fleet planning pipelines typically treat as sequential stages: task allocation (which drone serves which goal), tour sequencing (in what order), and trajectory generation (how to fly safely) β showing that decoupled approaches discard critical cross-problem interaction information.
The joint formulation matters concretely: an allocation that looks optimal in isolation may produce infeasible trajectories in obstacle-dense 3D environments, or require detours that make a different allocation clearly superior β effects invisible to sequential planners.
Introduces IMD-TAPP with a principled 3D discretization that makes the joint optimization tractable for multi-drone quadrotor teams while satisfying dynamic feasibility constraints (maximum velocity, acceleration bounds).
Demonstrates that end-to-end joint planning reduces total mission time and collision events compared to sequential allocation+planning pipelines in cluttered 3D scenarios, with gains that increase with environment density.
Directly applicable to urban drone delivery (canyons with buildings), search-and-rescue (rubble and debris fields), and warehouse inventory inspection (racks and infrastructure), where obstacle density makes trajectory feasibility tightly coupled with allocation decisions.
Coordinating teams of aerial robots in cluttered three-dimensional (3D) environments requires a principled integration of discrete mission planning β deciding which robot serves which goals and in what order β with continuous-time trajectory synthesis that enforces collision avoidance and dynamic feasibility. This paper introduces IMD-TAPP (Integrated Multi-Drone Task Allocation and Path Planning), an end-to-end framework that jointly addresses multi-goal allocation, tour sequencing, and safe trajectory generation for quadrotor teams operating in obstacle-rich spaces, outperforming sequential planning approaches on mission efficiency and safety metrics.
Junkai Jiang, Yitao Xu, Ruochen Li, Shaobing Xu, Jianqiang Wang
Core Contributions
Addresses CTS-MAPF β teams of robots completing sequences of tasks while avoiding collisions β where the combinatorial complexity of joint task sequencing and collision-free path planning makes naive approaches computationally intractable for large fleets.
First innovation: a lock-agent detection and release mechanism that identifies when agents enter deadlock configurations (all blocked, none can proceed) and triggers targeted local replanning to break the deadlock without global replanning overhead.
Second innovation: anytime refinement via Large Neighborhood Search (LNS) β the system immediately returns a feasible plan and continues improving it in the background, so robots can begin moving while the planner optimizes, critical for dynamic environments where waiting for the optimal plan is worse than acting with a good one.
The "anytime" property is practically important for deployment: in warehouses or factories where task lists arrive dynamically, agents need a good-enough plan now rather than an optimal plan after a planning delay.
Demonstrates significantly lower makespan (total task completion time) vs. prior CTS-MAPF methods in dense 25-agent scenarios, with the largest improvement in the highest-density configurations where deadlocks are most frequent.
The Collaborative Task Sequencing and Multi-Agent Path Finding (CTS-MAPF) problem requires agents to accomplish sequences of tasks while avoiding collisions, posing significant challenges due to its combinatorial complexity. This work introduces CTS-PLL, a hierarchical framework that extends the configuration-based CTS-MAPF planning paradigm with two key enhancements: a lock agents detection and release mechanism leveraging a complete planning method for local re-planning, and an anytime refinement procedure based on Large Neighborhood Search (LNS). These additions ensure robustness in dense environments and significantly reduce makespan compared to prior methods.
Haruki Kawase, Taiga Sugawara, A. Daniel Carnerero
Core Contributions
Unlike standard coverage control that treats all locations as equally valuable, the dissimilarity map derived from a kriging model weights positions by how much a measurement there would reduce prediction uncertainty β a principled information-theoretic approach to sensor placement.
"Persistent" coverage means robots continuously reposition as irradiance patterns shift with cloud movement and sun angle, rather than converging to a fixed configuration β creating a closed-loop adaptive monitoring system that responds to what has already been observed.
The kriging-based dissimilarity metric is computed from current robot measurements and updated as new data arrives, creating a feedback loop where coverage objectives evolve based on the information gap at each moment.
Demonstrates improved solar irradiance prediction accuracy compared to both static sensor placement and uniform coverage strategies with the same number of robots β directly translating multi-robot coordination quality into quantifiable energy system performance gains.
Establishes a template for adaptive environmental monitoring: any spatial process that can be modeled by kriging (soil moisture, air quality, temperature gradients) could benefit from this information-driven persistent coverage formulation.
Accurate forecasting of future solar irradiance is essential for the effective control of solar thermal power plants. Although various kriging-based methods have been proposed to address the prediction problem, these methods typically do not provide an appropriate sampling strategy to dynamically position mobile sensors for optimizing prediction accuracy in real time. This paper introduces a dissimilarity map derived from a kriging model and proposes a persistent coverage control algorithm that effectively guides a team of robots to continuously reposition for maximum irradiance prediction improvement, outperforming static and uniform coverage baselines.
Yifei Li, Ruizhe Fu, Huihang Liu, Guha Manogharan, Feng Ju
Core Contributions
Removes the fundamental constraint of fixed-location 3D printers by enabling Mobile AM Robots (MAMbots) to navigate to the workpiece and fabricate in-situ, opening up structures too large for any fixed printer envelope β a paradigm shift for manufacturing of large-scale or spatially-distributed components.
Unlike prior mobile AM work that treats navigation and printing as separate sequential phases, this system co-optimizes obstacle-aware movement paths with the print deposition process β ensuring the robot reaches each print location in the correct orientation while maintaining required standoff distances.
The framework couples path planning with the deposition process model: constraints on nozzle angle, travel speed, and print bead geometry are incorporated into the navigation objective, not treated as post-hoc corrections.
Enables manufacturing scenarios previously impossible or impractical: construction-scale concrete printing, aerospace composite repair in-situ on large fuselages, or ship hull maintenance without drydocking.
Demonstrates that robotics navigation algorithms can directly improve manufacturing flexibility β a cross-domain transfer with significant industrial relevance as mass customization demand grows.
As the demand for mass customization increases, manufacturing systems must become more flexible and adaptable. Additive manufacturing (AM) enhances production adaptability by enabling on-demand fabrication of customized components directly from digital models, but its flexibility remains constrained by fixed equipment layouts. Integrating mobile robots addresses this limitation by allowing manufacturing resources to move and adapt to changing production requirements. Mobile AM Robots (MAMbots) combine AM with mobile robotics to produce and transport components. This paper presents a framework for intelligent navigation and obstacle-aware fabrication that co-optimizes robot trajectories and print deposition for in-situ additive manufacturing.
Davide Tebaldi, NiccolΓ² Paradisi, Fabio Pini, Luigi Biagiotti
Core Contributions
Exploits the kinematic redundancy of mobile manipulators (more degrees of freedom than required for the end-effector task) specifically for energy minimization during physical human-robot interaction β unlike prior redundancy resolution methods that minimize joint velocity or manipulability without considering actual electrical energy consumption.
Formulates energy-optimal posture control as a null-space task: the energy minimization uses only the redundant degrees of freedom that don't affect the end-effector trajectory, so energy savings come without any compromise to interaction compliance or safety.
Energy efficiency in pHRI is not a niche concern β assistive and service robots operate for hours on battery and must complete long-duration tasks; reducing electrical consumption directly extends operational autonomy between charges.
Evaluated in human-robot interaction scenarios where the robot must comply with external contact forces while adjusting posture β precisely the setting where null-space motion is physically meaningful and energy profiling across joints matters most.
Bridges the control theory and physical HRI communities: most pHRI research focuses on impedance/admittance control for safety and transparency without addressing energy budget, a gap this paper directly fills with rigorous energy minimization formulation.
Research on mobile manipulation systems that physically interact with humans has expanded rapidly in recent years. Within this context, developing suitable control methodologies is essential since mobile manipulators introduce additional degrees of freedom, making the design of control approaches more challenging and more prone to performance optimization. This paper proposes a control approach for a mobile manipulator with the objective of minimizing electrical energy consumption during physical human-robot interaction, exploiting kinematic redundancy through null-space control without compromising end-effector task performance or interaction safety.
Addresses a fundamental mismatch in VR-based teleoperation: operators control motion intuitively via 6-DOF controllers, but receive no indication of contact forces, making contact-rich tasks (peg insertion, surface wiping, assembly) difficult to perform safely without haptic hardware.
The AR overlay renders the impedance controller's virtual target pose as a semi-transparent ghost arm, with the spatial displacement between ghost and actual arm visually encoding contact force magnitude and direction β transforming an invisible internal controller state into an intuitive perceptual cue.
Unlike haptic feedback systems that require specialized exoskeleton or force-feedback controllers costing thousands of dollars, this approach runs on commodity AR glasses and requires only the controller's existing internal state variables β no additional sensing hardware needed.
User study demonstrates significant improvement in task completion time and force regulation accuracy for insertion tasks compared to motion-only teleoperation, validating that visual force encoding partially compensates for absent tactile feedback.
Establishes a practical path for deploying capable contact-rich teleoperation systems at commodity cost: if visual force feedback can substitute for haptic feedback on many manipulation tasks, the barrier to deploying remote manipulation in hazardous or inaccessible environments drops substantially.
Teleoperation for contact-rich manipulation remains challenging, especially when using low-cost, motion-only interfaces that provide no haptic feedback. Virtual reality controllers enable intuitive motion control but do not allow operators to directly perceive or regulate contact forces, limiting task performance. To address this, we propose an augmented reality (AR) visualization of the impedance controller's target pose and its displacement from each robot end effector. This visualization conveys the forces generated by the controller, providing operators with intuitive, real-time feedback without requiring specialized haptic hardware. A user evaluation demonstrates significant improvement in contact-rich task performance.
Addresses the "dark factory" problem: as manufacturers automate production lines and remove human presence, existing safety systems (stationary cameras, fixed sensors) cannot respond dynamically to novel hazards or access the full spatial extent of the facility β humanoid robots fill this gap.
The agentic ReAct (Reasoning + Acting) framework enables the humanoid to decide when to investigate further, when to report, and when to call for intervention β moving beyond simple detection classifiers to context-aware hazard assessment and response selection.
Multi-modal perception (RGB-D + thermal imaging) on the Unitree G1 platform characterizes hazard type and severity, not just flags anomalies: fire vs. smoke vs. gas leak vs. thermal anomaly each trigger different response sequences, reducing false alarms and enabling appropriate interventions.
Humanoid morphology is a deliberate choice β it enables access to the same physical spaces humans previously occupied: operating manual shutoff valves, checking elevated equipment, and navigating irregular terrain that wheeled platforms cannot traverse.
Demonstrates three critical hazard scenarios (fire/smoke detection, abnormal temperature monitoring, gas leak response), establishing a benchmark for autonomous industrial safety systems that goes beyond the visual anomaly detection literature.
The rise of unmanned "dark factories" operating without human presence demands autonomous safety systems capable of detecting and responding to multiple hazard types. We present SafeGuard ASF (Agentic Security Fleet), a comprehensive framework deploying humanoid robots for autonomous hazard detection in industrial environments. Our system integrates multi-modal perception (RGB-D imaging and thermal sensing), a ReAct-based agentic reasoning framework, and learned locomotion policies on the Unitree G1 humanoid platform. We address three critical hazard scenarios: fire and smoke detection, abnormal temperature monitoring, and gas leak response, demonstrating autonomous industrial safety inspection in dark factory environments.
Giulio Pisaneschi, Pierpaolo Serio, Estelle Gerbier, Andrea Dan Ryals, Lorenzo Pollini
Core Contributions
Prior HRI studies conflate robot appearance, behavior, and explanation style β making it impossible to determine whether humans attribute mental states to robots because of how they look, what they do, or how their actions are described. This platform isolates explanation style by holding all other variables constant.
The same robot behavior is narrated in three distinct frames β mentalistic ("the robot believes..."), teleological ("the robot is trying to..."), and mechanistic ("the robot executes...") β using LLM-generated narration to ensure linguistic quality without researcher-introduced confounds.
The LLM integration is both practical and methodologically important: it enables scalable generation of multiple explanatory framings for any behavioral sequence without scripting separate conditions, allowing the study to scale to diverse scenarios.
If mentalistic framing independently increases anthropomorphization, designers gain a powerful lever: the same robot can be made to feel more or less agentic through narration alone β with significant implications for trust calibration in human-robot teaming.
Contributes to a growing debate in HRI: as robots become more capable and language-fluent, whether we should encourage or discourage the intentional stance in users is an ethical design question this platform helps make empirically tractable.
This paper presents an experimental platform for studying intentional-state attribution toward a non-humanoid robot. The system combines a simulated robot, realistic task environments, and large language model-based explanatory layers that can express the same behavior in mentalistic, teleological, or mechanistic terms. By holding behavior constant while varying the explanatory frame, the platform provides a controlled way to investigate how language and framing shape the adoption of the intentional stance in robotics.
Guangyu Zhao, Ceyao Zhang, Chengdong Ma, Tao Wu, Yiyang Song
Core Contributions
Uses Mahjong as a rigorous long-horizon robotics test bed: a multi-hour game with a 136-tile state space, multi-player interaction, turn-based sequencing, and catastrophic failure modes where a single perception error invalidates hours of accumulated task state.
Argues β and demonstrates β that many long-horizon robotic failures are not caused by weak perception or planning but by the absence of cross-module consistency checks: when perceptual state, execution state, and interaction state are maintained in isolated silos, small errors cascade silently until the entire task state is corrupted.
Introduces a state consistency monitor that detects when perception and execution representations disagree about the current game state, triggering targeted recovery behaviors before errors propagate through the decision-making pipeline.
The architecture explicitly partitions state into three layers (perceptual, execution, interaction) with defined interfaces and consistency invariants β a systems engineering discipline largely absent from current robotic manipulation research.
The Mahjong domain generalizes: the state accumulation problem this paper addresses applies directly to service robots completing multi-step household tasks and assistive robots supporting activities of daily living, where error recovery is as important as task execution.
Long-horizon tabletop games pose a distinct systems challenge for robotics: small perceptual or execution errors can invalidate accumulated task state, propagate across decision-making modules, and ultimately derail interaction. This paper studies how to maintain internal state consistency in turn-based, multi-human robotic tabletop games through deliberate system design rather than isolated component improvement. Using Mahjong as a representative long-horizon setting, we present an integrated architecture that explicitly maintains perceptual, execution, and interaction state with consistency monitors, demonstrating robust long-horizon robot game play with improved error recovery compared to component-only approaches.