🤖 Robotics arXiv Digest

Thursday, June 18, 2026

📄 30 papers 📂 7 research areas Generated by Claude

🔭 Research Landscape

The defining tension in today's 30 papers is memory, frequency, and consistency in action generation — the community is converging on the realization that flow-matching and VLA policies, for all their expressivity, are fragile in time. MemoryWAM (#1) confronts the non-Markovian failure of bounded-window world models with a hybrid memory of recent frames, event anchors, and compressed gist tokens. Frequency-Aware Flow Matching (#25) and VFILC (#28) both attack the same enemy from the signal-processing side: discretized action chunks break under heterogeneous control frequencies, so FAFM moves flow matching into the DCT/cosine domain while VFILC adds iterative learning control to extrapolate motion speeds, cutting frequency error by up to 81%. MirrorDuo (#29) and Pose6DAug (#27) round out the augmentation story, manufacturing reflection-symmetric and 6D-pose-swapped demonstrations to stretch scarce data — a clear sign that data efficiency, not raw model scale, is the operative constraint.

A second theme is the aggressive pursuit of efficiency and compression across the embodied stack. Finetuning VLAs Requires Fewer Layers Than You Think (#18) shows pi_0 and GR00T-N1.5 carry severe layer-wise redundancy — a single forward pass with Centered Kernel Alignment lets you delete up to 50% of layers and still match the base model while cutting training time 40–50%. GazeLNN (#5) achieves state-of-the-art scanpath prediction at 0.61 GFLOPs (a 99.4% compute reduction), and the neuromorphic RMFS pathfinder (#30) reports an astonishing 11,281x energy saving by distilling an ANN policy into a spiking network on a neuromorphic chip. These papers share a thesis: the heavyweight models the field has built are far larger than the tasks require, and the next gains come from ruthless pruning, lightweight recurrent engines, and event-driven hardware.

The third current is structure and guarantees re-entering learned robotics. The Token Is a Group Element (#3) puts attention tokens directly on matrix Lie groups so the pairwise score becomes a closed-form algebra norm with tautological equivariance — reaching affine groups that representation-theoretic methods exclude. Stable Transformer-Actor-Critic MPC (#21) proves Transformers can satisfy incremental input-to-state stability and uses contraction theory as a training regularizer for certifiable robustness, while priority-ordered STL planning (#14) and the POSG target-search formulation (#19) inject formal specifications and game-theoretic reasoning under uncertainty. Alongside a notable hardware-design cluster — generating robot hands from 4M frames of human motion (#2), monolithic 3D-printed continuum platforms (#12), and the soft Belt-Finger gripper (#22) — the batch suggests a field simultaneously compressing its models and re-grounding them in geometry, control theory, and physical embodiment.

VLA & World-Action Models

Persistent memory, dual-arm coordination, layer compression, and pose-swap augmentation for VLAs.

#1 MemoryWAM: Efficient World Action Modeling with Persisten...
#16 Co-VLA: Coordination-Aware Structured Action Modeling for...
#18 Finetuning Vision-Language-Action Models Requires Fewer L...
#27 Pose6DAug: Physically Plausible Multi-view Object Swappin...

Flow Matching & Imitation Learning

Frequency-aware and temporally consistent action generation, reflection symmetry, and object-dynamics modeling.

#20 FlowMaps: Modeling Long-Term Multimodal Object Dynamics w...
#25 Frequency-Aware Flow Matching for Continuous and Consiste...
#28 VFILC: Accurate Frequency Extrapolations in Imitation Lea...
#29 MirrorDuo: Reflection-Consistent Visuomotor Learning from...

Navigation, Active Perception & VLN

Attention-guided perception, failure anticipation, latency-resilient VLM planning, and neuromorphic pathfinding.

#5 Fast Human Attention Prediction for Fixation-guided Activ...
#6 GroundControl: Anticipating Navigation Failures in Vision...
#7 Slow Brain, Fast Planner: Latency-Resilient VLM-Augmented...
#30 A Neuromorphic Reinforcement Learning Framework for Effic...

SLAM, Mapping & State Estimation

Robust joint estimation, thermal Gaussian splatting, underwater reconstruction, and LiDAR pretraining.

#8 ARC: Adaptive Robust Joint State and Covariance Estimation
#10 LIT-GS: LiDAR-Inertial-Thermal Gaussian Splatting for Ill...
#15 Towards 3D karst underwater scene reconstruction from rot...
#23 HilDA: Hierarchical Distillation with Diffusion for Advan...

Continuum, Soft & Co-Designed Hardware

Generating robot embodiments, resilient continuum planning, reproducible platforms, and soft grippers.

#2 Generating Robot Hands from Human Demonstrations
#4 Increasing Resilience of Continuum Robots via Motion Plan...
#12 CoLI: A Reproducible Platform for Continuum Robot Learnin...
#22 Belt-Finger: An Affordable Soft Belt-Driven Gripper for D...

Tactile, Sim-to-Real & Robotic Automation

FEM tactile simulation, synthetic data linking, HRC assembly tracking, and LLM-driven lab automation.

#9 TaCauchy: An Extensible FEM Framework for Vision-Based Ta...
#17 Efficiently Linking Real Scenes with Synthetic Data Gener...
#24 Robust Assembly State Reasoning from Action Recognition f...
#26 Dual-Agent Framework for Cross-Model Verified Translation...

Control, Formal Methods & Multi-Robot Systems

Lie-group attention, auditable research agents, decentralized localization, STL planning, search games, and stable MPC.

#3 The Token Is a Group Element: On Lie-Algebra Attention ov...
#11 Agentic AutoResearch forSpace Autonomy: An Auditable, LLM...
#13 An Infrastructure-less, Control-Independent Solution to R...
#14 Autonomous Driving with Priority-Ordered STL Specificatio...
#19 Mobile Target Search with Imperfect Perception: A Partial...
#21 Stable Transformer-Actor-Critic Model Predictive Control:...

VLA & World-Action Models

#1 h=n/a

MemoryWAM: Efficient World Action Modeling with Persistent Memory

2026-06-18 cs.RO

Sizhe Yang, Juncheng Mu, Tianming Wei, Chenhao Lu, Xiaofan Li

Core Contributions

Resolves the core trade-off in world-action models — efficient methods condition on a short recent window and fail in non-Markovian tasks, while long-history methods blow up in time and space cost — with a hybrid memory design.
Combines three memory types: recent frames for detail, event-boundary anchor frames for salient moments, and compact gist tokens that summarize long-range history, so memory cost stays bounded as horizons grow.
A tailored attention mechanism retrieves both detailed short-term and compressed long-term context, enabling memory-dependent decisions with reduced latency and GPU memory versus full-history WAMs.
On long-horizon, memory-dependent manipulation in sim and the real world it beats strong VLA and WAM baselines while keeping inference efficient — showing memory structure, not just window size, is what matters.

Show abstract

Robust robotic manipulation in the real world requires not only an understanding of the current observation, but also memory and dynamics modeling. World action models (WAMs) possess these capabilities by jointly modeling visual foresight and actions conditioned on both current and historical observations, making them a promising paradigm for robotic manipulation. However, existing WAMs face a fundamental trade-off: methods with efficient inference typically condition only on a bounded window of recent observations and therefore struggle in non-Markovian environments, whereas methods that preserve long histories incur time and space costs that grow substantially with sequence length. To address this challenge, we introduce MemoryWAM, a world action model with efficient persistent memory. MemoryWAM uses a hybrid memory design that combines recent frames, event-boundary anchor frames, and compact gist tokens that summarize long-range history. A tailored attention mechanism enables retrieval of both detailed short-term context and compressed long-term context, supporting memory-dependent decision-making with reduced inference latency and GPU memory usage. Across long-horizon, memory-dependent manipulation tasks in both simulation and the real world, MemoryWAM outperforms strong vision-language-action (VLA) and WAM baselines while maintaining favorable computational efficiency.

#16 h=n/a

Co-VLA: Coordination-Aware Structured Action Modeling for Dual-Arm Vision-Language-Action Systems

2026-06-18 cs.RO

Yandong Wang, Jiaqian Yu, Xiongfeng Peng, Lu Xu, Yamin Mao

Core Contributions

Argues implicit coordination from end-to-end learning is insufficient for tightly-coupled bimanual tasks, and injects explicit structural priors into VLA models for dual-arm manipulation.
Replaces the monolithic action head with a Structured Action Expert that splits a shared latent (task-level coordination intent) from residual latents (per-arm execution adjustments), shaped by a coordination-aware loss.
A Latent-Aware Controller interprets these latents at deployment to modulate synchronization strength, execution asymmetry, smoothness, and safety in real time — at the joint-command level, with no force/impedance control needed.
Delivers a 27% success-rate gain on tight-coordination tasks and more than doubles out-of-distribution real-world performance (13% to 27%) while cutting completion time up to 25%.

Show abstract

Vision-language-action (VLA) models show strong capabilities in single and dual-arm robotic manipulation. Prior works show coordinated bimanual behaviors can emerge from end-to-end learning, leveraging large vision-language backbones with continuous action prediction. However, as bimanual tasks become tightly coupled and execution constraints become critical, implicit coordination alone is insufficient to ensure reliable, interpretable, and stable behavior. In this work, we propose Co-VLA, a coordination-aware bimanual manipulation framework introducing explicit structural priors into VLA models. We instantiate our method on a state-of-the-art vision-language backbone by replacing its monolithic action head with a Structured Action Expert (SAE) designed for bimanual coordination. Specifically, we introduce explicit structure at the action generation level with a modular coordination-aware loss that shapes shared and residual latents according to task-specific structures. The shared latent encodes task-level coordination intent, while residual latents capture execution adjustments for each arm. At deployment, a Latent-Aware Controller (LAC) interprets the learned representations to modulate synchronization strength, execution asymmetry, smoothness, and safety constraints in real time. LAC operates at the joint-command level and remains compatible with standard control pipelines without requiring force or impedance control. Experiments across simulation and real-world benchmarks show Co-VLA significantly outperforms monolithic baselines, achieving a 27% success rate gain in tight-coordination tasks, more than doubling performance in OOD real-world scenarios (from 13% to 27%), and reducing task completion time by up to 25%.

#18 h=n/a

Finetuning Vision-Language-Action Models Requires Fewer Layers Than You Think

2026-06-18 cs.RO, cs.AI

Gia-Binh Nguyen, Trong-Bao Ho, Thien-Loc Ha, Khoa Vo, Philip Lund Møller

Core Contributions

Reveals that VLA foundation policies like pi_0 and GR00T-N1.5, despite training on diverse trajectories, carry severe layer-wise representational redundancy — a non-obvious architectural finding.
Introduces a fully training-free compression pipeline: a single forward pass with Centered Kernel Alignment identifies redundant 'twin' layers, which are removed to cut model depth by up to 50% across both backbone and control head.
Unlike methods that must load full models to learn token reductions or dynamic layer selectors, this needs no learning of the compression itself, making it cheap to apply.
Validated across LIBERO, RoboCasa, SimplerEnv and 10 real-world tasks on 4 embodiments, it cuts training time 40–50% and inference up to 30% while matching or exceeding full-scale performance.

Show abstract

Vision-Language-Action (VLA) models pre-trained on massive video-robot datasets have revolutionized robotic manipulation, yet their multi-billion parameter architectures impose prohibitive computational burdens during downstream fine-tuning and real-time inference. In this work, we reveal a highly non-trivial architectural characteristic of these continuous control foundation policies (e.g., pi_0, GR00T-N1.5): despite being trained on diverse physical trajectories, they exhibit severe layer-wise representational redundancy. To exploit this, we introduce a structural compression pipeline that is entirely training-free, bypassing the need of existing methods to load full-scale models to learn optimized token reductions or dynamic layer selectors. Instead, using only a single forward pass via Centered Kernel Alignment to identify redundant layer features, we remove twin layers to permanently compress the model depth by up to 50% across both the VLM backbone and the continuous control policy head. Downstream fine-tuning of this streamlined architecture yields a dual acceleration benefit: a 40-50% reduction in training time and up to 30% faster real-time inference, while matching or exceeding full-scale base model performance. We comprehensively validate our method across three simulation benchmarks (LIBERO, RoboCasa, SimplerEnv) and 10 diverse real-world manipulation tasks across 4 unique robotic embodiments. These results prove that advanced VLAs require significantly fewer layers than previously assumed, offering a highly compute-efficient paradigm for scalable robot learning.

#27 h=n/a

Pose6DAug: Physically Plausible Multi-view Object Swapping for Robot Data Augmentation

2026-06-18 cs.RO, cs.LG

Jonghoon Lee, Seong Hyeon Park, Byungwoo Jeon, Minha Lee, Jinwoo Shin

Core Contributions

Addresses VLA failures on novel out-of-distribution objects without collecting new teleoperation data, by turning a policy's own successful episodes into targeted demonstrations for its failure modes.
Key insight: every successful episode already encodes a physically valid trajectory plus calibrated multi-view observations, so swapping only the manipulated object yields new, physically grounded demonstrations.
Operates in 3D rather than 2D video editing — anchoring the target object with an explicit mesh driven by a temporally coherent 6D pose trajectory — to preserve multi-view consistency under occlusion and egocentric views.
Finetuning a VLA on this augmented data improves success on novel objects by 16.5% relative to state of the art while preserving in-distribution performance.

Show abstract

Vision-language-action (VLA) policies have shown strong potential for general-purpose manipulation, yet they often fail on novel, out-of-distribution objects whose appearance or geometry deviates from the training distribution. The standard remedy is to collect multi-view teleoperation data for every failure case, but this scales poorly in both cost and time. We introduce Pose6DAug, a failure-driven data augmentation framework that turns a policy's own successful episodes into targeted demonstrations for its failure modes, without any new data collection. Our key insight is that each successful episode already encodes a physically valid action trajectory together with calibrated multi-view observations. By swapping only the manipulated object while preserving this trajectory, we obtain new and physically grounded demonstrations. However, naive 2D video editing breaks multi-view consistency and physical plausibility, particularly under heavy occlusion and egocentric viewpoints. Our method instead operates directly in 3D, anchoring the target object with an explicit mesh driven by a temporally coherent 6D pose trajectory, ensuring geometrically consistent renderings across all camera views. Fine-tuning a VLA on data augmented by our method improves success rates by 16.5% relative to the state-of-the-art baseline on novel objects, while preserving in-distribution performance. These results show that multi-view and physically consistent augmentation is a practical path to scalable VLA generalization.

Flow Matching & Imitation Learning

#20 h=n/a

FlowMaps: Modeling Long-Term Multimodal Object Dynamics with Flow Matching

2026-06-18 cs.RO, cs.AI

Francesco Argenziano, Miguel Saavedra-Ruiz, Sacha Morin, Charlie Gauthier, Daniele Nardi

Core Contributions

Models long-term, multimodal object dynamics in 3D — where objects move around a home over time due to human routines — so robots can re-find objects rather than assuming static scenes.
FlowMaps is a latent flow-matching model that estimates multimodal distributions over future object locations, learning implicit dependencies among objects conditioned on past human interactions.
Because human habits induce spatio-temporally consistent patterns, the model generalizes across unseen environments that share similar object routines, rather than memorizing one layout.
Across 600+ episodes of downstream dynamic Object Navigation in sim and the real world, it outperforms state-of-the-art approaches, showing continuous multimodal dynamics modeling improves robotic search.

Show abstract

Joint spatial and temporal understanding of 3D scenes is a crucial requirement for robots deployed in everyday household environments. Such agents must not only comprehend and navigate spatial layouts, but also reason about how these spaces evolve over time. In particular, humans interact with objects daily, causing them to change position throughout the environment and making it difficult for robots to reliably associate current observations with previously seen objects. However, these interactions are not random: human habits and routines induce spatio-temporally consistent patterns in object locations, which robotic agents can potentially learn and then exploit for downstream tasks such as navigation. To this end, we introduce FlowMaps, a latent flow matching model for estimating multimodal distributions over the future locations of dynamic objects in a continuous 3D space. By learning the implicit dependencies among objects and their temporal evolution, FlowMaps predicts likely changes in object locations conditioned on past human interactions, while supporting generalization across previously unseen environments that share similar object routines. To demonstrate the utility of this method, we deploy FlowMaps in a downstream dynamic Object Navigation task in both simulated and real-world environments. Across more than 600 episodes, FlowMaps outperforms state-of-the-art approaches, showing that modeling object dynamics through continuous, multimodal spatio-temporal distributions improves robotic search and navigation in changing household environments. Code and additional material is available at https://fra-tsuna.github.io/flowmaps/.

#25 h=n/a

Frequency-Aware Flow Matching for Continuous and Consistent Robotic Action Generation

2026-06-18 cs.RO, cs.AI

Jianing Guo, Fangzheng Chen, Zihao Mao, Wong Lik Hang Kenny, Zhenhong Wu

Core Contributions

Fixes two flaws of chunk-based flow-matching and diffusion policies: brittleness to demonstrations at heterogeneous control frequencies and temporally inconsistent actions that hurt control stability.
Transforms discrete action sequences into the frequency domain via the discrete cosine transform, performs flow matching over DCT coefficients, then reconstructs continuous actions through cosine-basis expansion.
Regularizes the first-order temporal derivative — a Sobolev-type constraint that suppresses high-frequency error and discourages abrupt action changes — to enforce smooth, consistent motion.
Adds no network parameters and plugs into standalone flow policies and VLAs, improving success, smoothness, convergence speed, and robustness to mixed-frequency input across LIBERO, LapGym, and a real Franka.

Show abstract

Flow matching has emerged as a standard paradigm for robotic manipulation owing to its strong expressive power for modelling complex, multimodal action distributions, alongside similar approaches like diffusion policy. However, existing methods rely on discretized action chunks, making them brittle to demonstrations collected at heterogeneous control frequencies and prone to temporally inconsistent actions that degrade control stability. In this paper, we propose Frequency-Aware Flow Matching (FAFM), which outputs continuous, temporally consistent actions. To handle heterogeneous frequency input, we transform discrete action sequences into the frequency domain with the discrete cosine transform (DCT), perform flow matching over the resulting coefficients, and reconstruct continuous actions via cosine basis expansion. To generate temporally consistent actions, we regularize the first-order temporal derivative to promote smooth actions. This corresponds to a Sobolev-type constraint that suppresses high-frequency errors and discourages abrupt action changes. Our FAFM is simple, introduces no additional network parameters and applies to standalone flow-matching policies and vision-language action models. Across synthetic toy benchmark, obstacle avoidance, LapGym, and LIBERO, FAFM improves success rates, multimodal expressivity, motion smoothness, convergence speed, robustness to mechanical bias and mixed-frequency input. These gains are consistent when deployed on a real-world Franka robot. Code available at https://anonymous.4open.science/r/FAFM.

#28 h=n/a

VFILC: Accurate Frequency Extrapolations in Imitation Learning via Sampling Frequency ILC

2026-06-18 cs.RO

Nozomu Masuya, Toshiaki Tsuji, Sho Sakaino

Core Contributions

Targets variable-speed imitation learning, where prior NN methods either only interpolated trained speeds or produced unpredictable motion when extrapolating beyond the training velocity range.
Builds on Variable-Frequency Imitation Learning (which links sampling frequency to motion frequency) but adds iterative learning control with both feedforward and feedback parts to correct the frequency errors VFIL leaves open-loop.
The feedback component is the key advance: it reduced frequency error by a striking 81% in a wiping task and 50% in a shaking task when extrapolating to double the average training speed.
Also improved accuracy by 27% over VFIL even at an interpolated frequency on a contact-rich mixing task with complex friction — showing benefits beyond just extrapolation.

Show abstract

Conventional neural network (NN)-based imitation learning methods for variable-speed motion either restricted their scope to interpolated speeds, or generated unpredictable motions when extrapolating beyond trained velocity ranges. Variable-frequency imitation learning (VFIL) enabled extrapolations of speeds by linking the NN model's sampling frequency to the motion frequency, whereas its open-loop configuration caused frequency errors, especially in the extrapolated high-frequency settings. This study proposes variable-frequency imitation learning with iterative learning control (VFILC) based on a combination of VFIL and iterative learning control (ILC) with both feedforward and feedback parts, the former taking advantage of VFIL and the latter adjusting the frequency errors. The experimental results showed that the proposed method successfully and accurately extrapolated motion speeds and reduced frequency errors in all three tasks, and that the feedback especially reduced the frequency errors by a remarkable 81% in the wiping task and 50% in the shaking task, both compared to simple feedforward VFIL, when extrapolating at double the average speed in the training data. The proposed method also improved accuracy by 27% compared with VFIL even at an interpolated frequency for a contact-rich mixing task affected by complex friction traits.

#29 h=n/a

MirrorDuo: Reflection-Consistent Visuomotor Learning from Mirrored Demonstration Pairs

2026-06-18 cs.RO

Zheyu Zhuang, Ruiyu Wang, Giovanni Luca Marchetti, Florian T. Pokorny, Danica Kragic

Core Contributions

Exploits reflection symmetry to halve demonstration cost — 'collect one, get one for free' — by generating a mirrored counterpart for each demo over image, proprioception, and full 6-DoF end-effector action tuples.
Can be used two ways: as a plain data-augmentation strategy for behavior cloning or diffusion policy, or as a structural prior baked into reflection-equivariant policy networks.
By leveraging overlap between original and mirrored domains, it improves performance under the same data budget when demos are spread across both sides of the workspace.
When demos are confined to one side, MirrorDuo transfers skills to the mirrored workspace with as few as zero or five target-side demos — a concrete answer to workspace-variation generalization.

Show abstract

Image-based behaviour cloning leverages demonstrations captured from ubiquitous RGB cameras. However, it remains constrained by the cost of collecting diverse demos, especially for generalizing across workspace variations. We propose MirrorDuo, a reflection-based formulation that operates on image, proprioception, and full 6-DoF end-effector action tuples, generating a mirrored counterpart for each original demonstration, effectively achieving "collect one, get one for free". It can be applied as a data augmentation strategy for existing learning pipelines, such as standard behaviour cloning or diffusion policy, or as a structural prior for reflection-equivariant policy networks. By leveraging the overlap between the original and mirrored domains, MirrorDuo achieves significantly improved performance under the same data budget when demonstrations are evenly distributed across both sides of the workspace. When demonstrations are confined to one side, MirrorDuo enables efficient skill transfer to the mirrored workspace with as few as zero or five demos in the target arrangement.

Navigation, Active Perception & VLN

#5 h=n/a

Fast Human Attention Prediction for Fixation-guided Active Perception in Autonomous Navigation

2026-06-18 cs.RO, cs.CV

Fatma Youssef Mohammed, Grzegorz Malczyk, Kostas Alexis

Core Contributions

Brings human-like structured scanpaths into robot autonomy, overcoming the high compute cost that has kept predictive attention models out of real-time robotics.
GazeLNN uses Liquid Neural Networks as its recurrent engine with MobileNetV3 features, predicting sequential fixation heatmaps auto-regressively from the current stimulus and fixation history.
At just 0.61 GFLOPs it reaches a 0.47 ScanMatch score (state of the art on MIT Low Resolution) while cutting compute by 99.4% and running up to six times faster than recurrent baselines.
Integrated into an RL-trained active camera-control policy, it enables human-fixation-guided perception during autonomous navigation, validated on a real aerial robot.

Show abstract

Human visual attention relies on structured scanpaths to efficiently process scenes, yet instilling this behavior into robot autonomy is in its infancy and hindered by the high,computational costs of existing predictive models. To address this, we introduce GazeLNN, a computationally lightweight,scanpath prediction model that leverages Liquid Neural Networks as its recurrent engine and employs MobileNetV3 for feature extraction. Operating auto-regressively, the architecture predicts sequential fixation heatmaps conditioned on the current visual stimulus and fixation history. Despite requiring only 0.61 GFLOPs, GazeLNN achieves state-of-the-art performance on the MIT Low Resolution dataset achieving 0.47 ScanMatch score. It outperforms existing recurrent baselines across diverse evaluation metrics, while reducing computational costs by 99.40% and accelerating inference by up to six times. To investigate the role of human attention modeling in robot autonomy and demonstrate the practical utility of this highly efficient architecture, we integrate GazeLNN into an active camera-robot control policy trained via Reinforcement Learning. This integration enables human-fixation-guided perception during autonomous navigation, validated through successful real-world deployments on an aerial robot.

#6 h=n/a

GroundControl: Anticipating Navigation Failures in Vision-Language Agents via Trajectory-Consistent Uncertainty Estimates

2026-06-18 cs.RO

Nastaran Darabi, Divake Kumar, Sina Tayebati, Devashri Naik, Amit Ranjan Trivedi

Core Contributions

Anticipates vision-language navigation failures before they fully unfold — oscillation, stagnation, detours — rather than relying on instantaneous action entropy that reacts too late.
GroundControl models distance-to-goal evolution with a constant-velocity Kalman filter and combines normalized innovation statistics with trajectory features (progress, monotonicity, path efficiency, oscillation) into one uncertainty score.
Introduces Selective Risk-Coverage Navigation (SRCN) to evaluate uncertainty quality independent of task success, using risk-coverage curves and AURC/E-AURC summaries.
Across five EB-Navigation splits it achieves near-oracle failure ordering (weighted E-AURC of 0.0024 for GPT-4o), substantially beating entropy-, conformal-, and heuristic baselines.

Show abstract

Vision-language navigation agents achieve competitive average success on benchmark tasks, yet failures often arise through predictable trajectory-level breakdowns such as oscillation, stagnation, or inefficient detours. Reliable deployment, therefore, requires uncertainty signals that anticipate emerging failure dynamics during execution rather than reflect only instantaneous action entropy. We introduce \emph{GroundControl}, a trajectory-consistent uncertainty estimator defined as statistical deviation from nominal goal-directed distance-to-goal dynamics aggregated over an episode. GroundControl models distance evolution using a constant-velocity Kalman filter and combines normalized innovation statistics with complementary trajectory features capturing progress, monotonicity, path efficiency, and oscillatory behavior. The resulting uncertainty score reflects geometric and temporal inconsistency in navigation behavior rather than local prediction dispersion. To evaluate uncertainty quality independently of task success, we formalize \emph{Selective Risk--Coverage Navigation (SRCN)}, a protocol that measures how effectively an uncertainty score ranks episodes by failure or inefficiency using risk--coverage curves and AURC / E-AURC summaries. Across five EB-Navigation splits ($N=300$ episodes), trajectory-consistent uncertainty achieves near-oracle ordering under success-based selective risk, with weighted-average $\mathrm{E\text{-}AURC}_{\mathrm{SR}}=0.0024$ for the GPT-4o model, substantially outperforming entropy-, conformal-, and heuristic baselines. Under SPL-based selective evaluation, GroundControl consistently achieves the lowest AURC and E-AURC across models and navigation splits. These results show that modeling deviation from goal-directed dynamics provides an interpretable and robust signal for anticipating navigation failures in vision-language agents.

#7 h=n/a

Slow Brain, Fast Planner: Latency-Resilient VLM-Augmented Urban Navigation

2026-06-18 cs.RO

Zhenghao "Mark'' Peng, Honglin He, Quanyi Li, Yukai Ma, Bolei Zhou

Core Contributions

Diagnoses a 'trajectory scoring gap' in sidewalk navigation: learned planners generate good candidate trajectories but their scoring functions often pick bad ones (onto grass, toward pedestrians) even when better candidates exist.
Rather than replacing the planner with an end-to-end VLA, it adds a VLM-Planner interface where a VLM selects a candidate index and fuses it with the planner's output.
Because VLMs take 1–3s and can't drive a 5–20Hz control loop, it contributes a training-free, latency-resilient fusion layer that turns a stale VLM selection into real-time scoring via geometric similarity with exponential decay.
On ~2,000 challenging real scenarios, VLM selection cuts ADE 30% versus the planner's best choice, and Score Fusion keeps >80% success with delays up to 5s, demonstrated on a real campus robot.

Show abstract

Learning-based planners for sidewalk navigation can generate diverse candidate trajectories in real time, yet their scoring functions often fail to select the best trajectory in challenging situations, outputting trajectories that make the mobile robot drive onto grass, toward pedestrians, or in the wrong direction, even when better candidates exist in the same set. We call this the trajectory scoring gap: in real-world sidewalk navigation, the gap between an anchor-based planner's top choice and the best possible candidate is substantial, likely due to limited high-level scene understanding capability of the planner. Rather than replacing the planner with an end-to-end Vision-Language-Action model, we propose a VLM-Planner interface that uses a VLM to select a candidate index from the planner's proposal set and then fuse it with the planner's initial output. However, VLMs take 1--3s per query and so cannot directly drive a 5--20Hz control loop. We contribute a training-free, latency-resilient trajectory-level fusion layer that turns a stale VLM selection into real-time planner scoring via geometric similarity with exponential decay. On $\sim$2,000 challenging real-world scenarios (e.g., junctions, pedestrian encounters), VLM selection achieves 30% ADE reduction versus the planner's best selection, while the planner remains competitive in routine situations. In simulation, Score Fusion maintains >80% success rate with delays up to 5s. We demonstrate the full system on a mobile robot navigating challenging campus sidewalks with varied network latency.

#30 h=n/a

A Neuromorphic Reinforcement Learning Framework for Efficient Pathfinding in Robotic Mobile Fulfillment Systems

2026-06-18 cs.RO, cs.AI

Junzhe Xu, Zecui Zeng, Lusong Li, Yuetong Fang, Renjing Xu

Core Contributions

Targets pathfinding in Robotic Mobile Fulfillment Systems where search- and rule-based methods suffer high complexity and latency, and deploying RL policies with extreme energy efficiency on edge hardware is unsolved.
SDQN-RMFS is a full-stack pipeline: train an ANN policy with a collision-allowing strategy to densify informative trajectories, then convert it to a spiking neural network via hard-label knowledge distillation.
The distillation specifically addresses ANN-to-SNN output distribution mismatch, preserving policy capability while cutting inference latency through event-driven, compute-only-when-triggered operation.
Hardware experiments show up to 11,281x energy savings and nearly 2x lower latency versus a high-performance GPU baseline, with decision quality on par with the original policy.

Show abstract

Dynamic environmental changes, confined workspaces, and stringent real-time constraints make pathfinding in Robotic Mobile Fulfillment Systems (RMFS) a challenging problem for conventional search- and rule-based methods, which typically suffer from high computational complexity and long decision latency. While reinforcement learning (RL) has emerged as a powerful alternative, deploying learned policies with extreme energy efficiency on resource-constrained hardware remains an open challenge. We present SDQN-RMFS, an end-to-end framework that achieves high-fidelity deployment of an RL-trained policy from a full-precision artificial neural network (ANN) through to a neuromorphic chip. By computing only when triggered by sparse events, this framework unlocks ultra-low-power RMFS pathfinding. Our full-stack pipeline operates as follows: an ANN policy is first efficiently trained via a collision-allowing strategy to densify informative trajectories, and then converted into a spiking neural network (SNN) via a hard-label knowledge distillation approach. This effectively addresses the output distribution mismatch, preserving policy capability across the ANN-to-SNN pipeline while substantially reducing inference latency. Hardware experiments demonstrate up to 11,281$\times$ energy savings and a nearly two-fold reduction in latency compared to a high-performance GPU baseline, while maintaining decision quality on par with the original trained policy. These results establish physical neuromorphic inference as a practical and energy-sustainable pathway for large-scale RMFS operations.

SLAM, Mapping & State Estimation

#8 h=n/a

ARC: Adaptive Robust Joint State and Covariance Estimation

2026-06-18 cs.RO

Alexandre Hadji-Thomas, Andrew Stirling, James R. Forbes

Core Contributions

Unifies two normally separate capabilities — outlier rejection and measurement-covariance estimation — into one self-tuning estimator robust to non-Gaussian noise.
Robust estimators downweight outliers but don't estimate covariance, while joint state-covariance estimators assume Gaussian residuals and fixed loss shapes; ARC removes both limitations.
Uses a Block-Coordinate Descent framework combining a norm-aware adaptive robust loss, an IRLS state update, and a Minimum Weighted Covariance Determinant covariance estimator.
On Monte-Carlo simulation and real ultra-wideband localization in cluttered non-line-of-sight settings, it recovers the true inlier covariance and matches or beats baselines with no manual tuning.

Show abstract

Sensor measurements are frequently corrupted by outliers and non-Gaussian noise. These imperfections in the sensor data can cause classical state estimators to generate biased and unreliable state and uncertainty estimates. Robust estimators reject or downweight outliers but do not perform measurement covariance estimation, whereas joint state and covariance estimators assume Gaussian residuals and fixed loss shape parameters. Integrating these two capabilities into a single framework is an opportunity to simultaneously estimate both state and covariance in the presence of outliers. This paper proposes a unified Block-Coordinate Descent framework that combines a norm-aware adaptive robust loss, an Iteratively Reweighted Least-Squares state update, and a Minimum Weighted Covariance Determinant covariance estimator, yielding a self-tuning joint state and covariance estimator. The framework is evaluated in a Monte-Carlo simulation and on real-world ultra-wideband localization experiments in cluttered non-line-of-sight environments. Results show that the proposed estimator consistently recovers the true inlier measurement covariance and matches or exceeds the state estimation accuracy of all baselines, without requiring any manual parameter tuning.

#10 h=n/a

LIT-GS: LiDAR-Inertial-Thermal Gaussian Splatting for Illumination-Robust Mapping

2026-06-18 cs.RO

Shikuan Shi, Chunran Zheng, Jiaming Xu, Tianyong Ye, Tao Yu

Core Contributions

Makes Gaussian Splatting mapping robust to illumination changes and texture-poor scenes by fusing LiDAR, inertial, and thermal sensing — escaping the fragility of RGB photometric cues.
Injects LiDAR-derived plane geometry as an explicit constraint in both pose/structure refinement and Gaussian optimization, using LIV visual map points as confidence-aware cross-modal anchors for thermal-LiDAR association.
Incorporates weighted point-to-plane residuals into bundle adjustment to jointly refine camera poses and 3D points under weak thermal supervision, then adds a LiDAR-plane-regularized splatting objective.
The plane regularization mitigates surface thickening and structural drift in low-contrast thermal imagery, improving geometric accuracy and rendering quality over LIV-based baselines, especially in challenging lighting.

Show abstract

Gaussian Splatting has enabled real-time neural rendering, yet existing LiDAR-inertial-visual (LIV) Gaussian mapping pipelines remain fragile under illumination changes and texture-deficient scenes due to their reliance on RGB photometric cues. We present LIT-GS, a LiDAR-inertial-thermal Gaussian Splatting framework that injects LiDAR-derived plane geometry as an explicit constraint in both pose/structure refinement and Gaussian optimization. Specifically, we exploit LIV visual map points as confidence-aware cross-modal anchors to establish reliable thermal-LiDAR associations, and incorporate weighted LiDAR point-to-plane residuals into bundle adjustment to jointly refine camera poses and 3D points under weak thermal supervision. Building on the refined structure, we further introduce a LiDAR-plane-regularized differentiable splatting objective that constrains rendered 3D points to align with locally observed planes, mitigating surface thickening and structural drift in low-contrast thermal imagery. Experiments on proprietary sequences and public datasets demonstrate that LIT-GS consistently improves geometric accuracy and rendering quality over state-of-the-art LIV-based Gaussian Splatting baselines, particularly in challenging lighting conditions.

#15 h=n/a

Towards 3D karst underwater scene reconstruction from rotating sonar data

2026-06-18 cs.RO

Georgios Evangelos Margaritis, Lionel Lapierre, Simon Rohou, Zhi Yan, Andreas Nüchter

Core Contributions

Tackles a hazardous, poorly-mapped domain — karst underwater conduits — where sonar data is sparse and noisy and navigation drifts, defeating standard 3D reconstruction.
Combines a continuous-time SLAM approach to correct trajectory drift with a novel two-stage deep-learning surface-reconstruction method tailored to rotating sonar profiler data.
Produces an immersive, navigable 3D mesh of the conduit suitable for hydrogeological analysis, rather than just a raw point cloud.
Addresses a practical freshwater-resource problem, demonstrating a full pipeline from drift-prone sonar exploration to usable geometry.

Show abstract

Karst aquifers provide critical freshwater resources but pose significant hazards due to their complex and poorly understood subsurface geometry. Mapping these environments is challenging because sonar data from underwater exploration is sparse and noisy, while navigation estimates suffer from drift limiting standard 3D reconstruction methods. We present a pipeline for reconstructing underwater karst conduits from a sonar profiler. We combine a continuous-time SLAM approach to correct trajectory drift with a novel two-stage deep learning method for surface reconstruction, producing an immersive and navigable 3D mesh for hydrogeological analysis.

#23 h=n/a

HilDA: Hierarchical Distillation with Diffusion for Advancing Self-Supervised LiDAR Pre-trainin

2026-06-18 cs.CV, cs.AI, cs.RO

Maciej Wozniak, Jesper Ericsson, Hariprasath Govindarajan, Truls Nyberg, Thomas Gustafsson

Core Contributions

Improves camera-to-LiDAR knowledge distillation for self-supervised AD pretraining, where current methods treat vision foundation models as black-box teachers and rely only on frame-wise feature similarity.
HilDA captures both the 'what' and 'where' via hierarchical distillation: multi-layer distillation for progressive semantic alignment plus global context distillation for scene-level semantics.
Adds a temporal occupancy diffusion objective to promote spatiotemporal consistency across LiDAR sequences — exploiting information prior methods ignore.
Achieves state-of-the-art on cross-modal distillation benchmarks and outperforms prior distillation on 3D object detection, scene flow, and semantic occupancy prediction.

Show abstract

Leveraging Vision Foundation Models (VFMs) for camera-to-LiDAR knowledge distillation offers a promising solution to the scarcity of annotated data needed to represent the immense geometric and kinematic diversity of real-world autonomous driving (AD). However, current approaches typically treat VFMs as black-box teachers, relying exclusively on frame-wise feature similarity. Consequently, they do not fully exploit the teacher's layer-wise semantic structure and global context, as well as the rich spatiotemporal information inherent in LiDAR sequences. We propose HilDA, a self-supervised pretraining framework for LiDAR backbones that better captures the semantic what and geometric where needed for driving tasks. HilDA combines hierarchical distillation comprising multi-layer distillation for progressive semantic alignment and global context distillation for scene-level semantics, with a temporal occupancy diffusion objective promoting spatiotemporal consistency. Models pre-trained with HilDA achieve state-of-the-art results on cross-modal distillation benchmarks and outperform models trained via prior distillation approaches on 3D object detection, scene flow, and semantic occupancy prediction. Code available at: https://maxiuw.github.io/hilda.

Continuum, Soft & Co-Designed Hardware

#2 h=n/a

Generating Robot Hands from Human Demonstrations

2026-06-18 cs.RO

Sha Yi, Nicklas Hansen, Xueqian Bai, Carmelo Sferrazza, Michael T. Tolley

Core Contributions

Tackles the hard co-design problem — jointly searching robot body and controller is combinatorial — by generating robot hand designs that use the same simple control policy intended for the fabricated hand.
Instead of learning a controller per candidate design, it optimizes tree-structured hands to reproduce target motions via fingertip-position matching with inverse kinematics, using over 4 million frames of human fingertip motion.
An RL actor proposes good designs and joint angles, slashing design search from hours to minutes, and hands are fabricated as one-piece articulated structures with print-in-place joints.
The resulting 6-DoF hand achieved teleoperated fingertip tracking better than commercial robot hands, while lower-DoF task-specific hands reproduced trajectories with reduced mechanical complexity — large-scale human data optimizing the body, not just the brain.

Show abstract

Robot learning has advanced rapidly in learning control, but learning the physical body of a robot remains much more difficult because jointly searching over design and control creates a very large combinatorial problem. Here, we present a data-driven framework for generating robot hands from human demonstrations. Instead of learning a complex controller together with each candidate design, we generate robot hand designs using the same simple control policy used after fabrication: matching fingertip positions through inverse kinematics. Using more than 4 million frames of human fingertip motion from everyday manipulation, our algorithm optimizes tree-structured robot hands to reproduce desired target motions. The framework produced both a 6-degree-of-freedom (DoF) general-purpose hand and lower-DoF task-specific hands with spatial four-bar mimic joints. To accelerate the search over designs, we trained a reinforcement-learning (RL) actor to propose good hand designs and joint angles, reducing search time from hours to minutes. We fabricated the mechanisms directly as one-piece articulated structures with print-in-place joints. In real-world experiments, the 6-DoF hand achieved highly accurate teleoperated fingertip tracking better than available commercial robot hands, whereas the specialized 3-DoF hands reproduced structured human and synthetic trajectories with reduced mechanical complexity. These results showed that large-scale human motion data can be used not only to train robot controllers but also as a reference for optimizing and generating the physical embodiment of robots.

#4 h=n/a

Increasing Resilience of Continuum Robots via Motion Planning Algorithms

2026-06-18 cs.RO

Oxana Shamilyan, Ievgen Kabin, Zoya Dyka, Oleksandr Sudakov, Peter Langendoerfer

Core Contributions

Studies motion planning specifically for resilience of continuum robots, optimizing not just for shortest path but for properties that extend time between maintenance operations.
Modifies Genetic and A* algorithms with the Analytical Hierarchy Process to evaluate path quality across four criteria — distance, motor damage, mechanical arm damage, and accuracy.
Multi-criteria decision-making lets the planner trade raw path efficiency for reduced wear, directly targeting the robot's long-term resilience.
Experiments in two simulated environments show the Genetic algorithm's runtime, unlike A*, is independent of environment cardinality and generates more diverse paths, increasing resilience.

Show abstract

This paper presents an experimental study of motion planning for resilient continuum robots. In this study we mainly focused on multi-criteria decision-making, its application for path-planning algorithms, impact on the generated path and execution time. To do this, we used two well-known algorithms for path planning, namely Genetic algorithm and A star algorithm, and modified them by adding the Analytical Hierarchy Process algorithm to evaluate the quality of the paths generated. In our experiment the Analytical Hierarchy Process considers four different criteria, i.e. distance, motors damage, mechanical damage of the robot's arm and accuracy, each considered to contribute to the resilience of a continuum robot. The use of different criteria is necessary to increase the time to maintenance operations of the continuum robot. We conducted the experiments using two different simulated environments of the robot. Although we significantly simplified the robot's model and its environment, we still implemented some of the features of the environment based on the real robot prototype. In particular, one of the environments has single- as well as multi-path points, and other consists of the multi-path points only. The results show that, in contrast to A star, the performance time of Genetic algorithm does not depend on the environment's cardinality. It generates more diverse paths, which increases the robot's resilience.

#12 h=n/a

CoLI: A Reproducible Platform for Continuum Robot Learning via Monolithic 3D Printing and Isomorphic Teleoperation

2026-06-18 cs.RO

Ziyuan Tang, Chenxi Xiao*

Core Contributions

Addresses reproducibility barriers in continuum robotics — complex fabrication, hard kinematic modeling, and non-intuitive control — that have slowed research and adoption.
Fabricates the arm as a monolithic compliant structure via multi-material 3D printing with minimal assembly, simplifying the build pipeline dramatically.
Controls it through an isomorphic teleoperation interface with direct actuator-level mapping, eliminating explicit kinematic modeling and providing a singularity-free, intuitive mapping.
Supports imitation-learning-based autonomy on top of the hardware, delivering an open-source, reproducible, learning-ready platform for community benchmarking.

Show abstract

Continuum robots offer strong potential for manipulation tasks due to their high degrees of freedom, compliant structures, and operational safety. However, their adoption in both research and practical applications has been hindered by reproducibility issues arising from complex fabrication and assembly processes, challenging kinematic modeling, and a lack of intuitive control interfaces. To address these challenges, we present a novel open-source continuum robot design. The platform features a simplified fabrication pipeline enabled by multi-material 3D printing, allowing the arm to be fabricated as a monolithic compliant structure with minimal assembly. Control is achieved through an isomorphic teleoperation interface that establishes a direct actuator-level mapping, eliminating the need for explicit kinematic modeling and providing a singularity-free mapping. Building on this hardware design, the platform further supports imitation-learning-based autonomous control. The proposed system is evaluated through hardware characterization and a set of manipulation tasks. Experimental results demonstrate that the platform provides a reproducible, learning-ready continuum robot system, accelerating algorithmic development and systematic benchmarking for the continuum robotics community.

#22 h=n/a

Belt-Finger: An Affordable Soft Belt-Driven Gripper for Dexterous In-Hand Manipulation

2026-06-18 cs.RO

Boya Zhang, Andreas Zell, Georg Martius

Core Contributions

Upgrades the ubiquitous parallel-jaw gripper — simple and cheap but limited in in-hand mobility — with a double-soft-belt finger module that preserves standard open/close while adding three in-hand DoFs.
The added translation, pitch, and roll let the gripper reposition objects in-hand without large arm motions, broadening dexterity in confined workspaces while keeping manufacturing inexpensive.
Demonstrated in two control pipelines: an MPC for in-hand manipulation of known objects, and a lightweight teleoperation interface controlling arm and gripper (10 DoFs) with minimal hardware.
Across teleoperation, MPC, and trained-policy tasks it consistently improves dexterity and task feasibility over a conventional parallel gripper — a low-cost path to greater manipulation range.

Show abstract

Parallel-jaw grippers are the default manipulator choice in robotics because they are simple, robust, and inexpensive. Their limited in-hand mobility, however, often forces large arm motions and restricts dexterous manipulation in confined workspaces. We present a parallel-gripper upgrade: a double-soft-belt-based finger module that preserves standard opening/closing while adding three in-hand degrees of freedom (DoF): translation, pitch, and roll. The mechanism is deliberately kept simple and engineered for inexpensive manufacturing and straightforward integration, preserving the reliability and precise control of traditional parallel grippers while greatly broadening the range of manipulation capabilities. To demonstrate the utility of the added DoFs, we integrate the gripper in two control pipelines. First, we adapt a model predictive controller for in-hand manipulation of known objects. Second, we introduce a lightweight teleoperation interface that enables simultaneous control of the robot arm and gripper (10 DoFs total) with minimal hardware. Across a suite of challenging manipulation tasks executed via teleoperation, MPC, and trained policies, the proposed gripper consistently improves dexterity and task feasibility compared to a conventional parallel gripper

Tactile, Sim-to-Real & Robotic Automation

#9 h=n/a

TaCauchy: An Extensible FEM Framework for Vision-Based Tactile Simulation

2026-06-18 cs.RO

Hengfei Zhao, Yifan Xie, Junhao Gong, Yue Sun, Kai Zhu

Core Contributions

Provides high-fidelity tactile simulation for RL by integrating rigorous physics-based force computation into Isaac Sim — addressing existing approaches' inability to give accurate mechanical stress fields on GPU platforms.
Built on the Unified Incremental Potential Contact solver, TaCauchy computes Cauchy stress tensors directly from hyperelastic laws and projects them onto contact surfaces for traction and pressure from first principles, not empirical estimation.
Features automatic mesh generation with geometry-aware adaptive refinement and a modular interface for rapid integration of GelSight Mini, DIGIT, and 9DTact sensors.
Hits 33.4 FPS single-environment and 555 FPS aggregate across 60 parallel environments with under 1ms stress overhead, and matches real tactile responses (SSIM above 0.93) from 1.26N to 4.73N.

Show abstract

Vision-based tactile sensors require high-fidelity simulation for reinforcement learning, yet existing approaches struggle to provide accurate mechanical stress fields within GPU-accelerated robotics platforms. We present TaCauchy, an extensible Finite Element Method (FEM) framework that integrates rigorous physics-based force computation into Isaac Sim. Built on the Unified Incremental Potential Contact (UIPC) solver, TaCauchy directly computes Cauchy stress tensors from hyperelastic constitutive laws and projects them onto contact surfaces to obtain traction forces and pressure distributions, providing mechanical ground truth from first principles rather than empirical estimation. Our framework features automatic mesh generation with geometry-aware adaptive refinement and a modular sensor interface enabling rapid integration of diverse sensors (GelSight Mini, DIGIT, 9DTact) with minimal configuration. Performance benchmarks demonstrate 33.40 FPS for single environments and 555 FPS aggregate throughput across 60 parallel environments, with stress extraction overhead under 1 ms. Physical validation experiments show strong agreement between simulated and real tactile responses across force ranges from 1.2556 N to 4.7332 N, achieving SSIM above 0.93, confirming the framework's capability to provide accurate, physically-grounded force supervision for downstream robotic manipulation tasks.

#17 h=n/a

Efficiently Linking Real Scenes with Synthetic Data Generation for AI-based Cognitive Robotics and Computer Vision Applications

2026-06-18 cs.RO, cs.CV

Paul Koch, Vivek Chavan, André Sers, Adem Karakurt, Paul Hofmann

Core Contributions

Addresses the persistent sim-to-real domain gap for AI vision in cognitive robotics — semantic analysis, 6D and grasp pose estimation — where training data and architectures must scale beyond domain gaps.
Surveys current limits and trends in the state of the art that challenge precision and scalability, framing the work as bridging real and synthetic data.
Presents work-in-progress on efficiently linking real scenes with synthetic data generation in the training-data pipeline, rather than treating the two as separate sources.
Aims at industrial and household cognitive-robotics use cases where robust, scalable perception across the domain gap is the bottleneck.

Show abstract

AI vision models are a driving factor for the potential use case scenarios of cognitive robotics within in the industry and household applications. A large array of methods from semantic environment analysis towards 6D and grasping pose estimation have been proposed based on the latest AI achievements. However, such advancements require further strong and efficient methods w.r.t. training data and AI-architectures, which are capable in synergy to tackle current challenges, precision limits, and scalability beyond domain gaps. In this paper, we discuss these current limits and trends in the related state-of-the-art which are challenging those. Further we discuss our current work in progress on bridging the domain gap between simulations and real world applications by linking those in the training data generation.

#24 h=n/a

Robust Assembly State Reasoning from Action Recognition for Human-Robot Collaboration

2026-06-18 cs.RO

James Fant-Male, Roel Pieters

Core Contributions

Investigates the under-studied problem of robustly tracking assembly state from Human Action Recognition in human-robot collaboration, which is non-trivial in realistic, noisy scenarios.
Systematically compares five state-tracking approaches — logic-based, Hidden Markov Model, and neural network — across two diverse datasets, rather than advocating a single method.
Finds no universally best approach: NN and HMM methods excel in low-variability tasks, while logic-based methods are more robust elsewhere, and methods modeling expected action duration matter for repeated actions.
Tests with both simulated inputs at varying noise levels and realistic HAR-model inputs, giving practical guidance on which tracker fits which collaborative task.

Show abstract

Human Action Recognition (HAR) is frequently investigated in Human-Robot Collaboration (HRC) research to understand what actions have been performed and hence the state of a collaborative task. Accurately tracking an assembly state from HAR is however not fully investigated, and in realistic scenarios is not a trivial task. This research systematically investigates and compares methods for tracking assembly state using action recognition inputs. Investigations using two diverse datasets and five state tracking approaches, including logic-based, Hidden Markov Model (HMM), and neural network (NN) methods, show that optimal approaches are not uniform across different tasks and that different methods fail under different circumstances. Testing is performed using both simulated inputs with varying noise levels and realistic inputs from a HAR model. Results show NN and HMM methods can perform well in tasks with limited variability, but for other scenarios logic-based approaches can be more robust. Methods which model expected action duration are also important for tasks with repeated actions where no additional sensing is provided.

#26 h=n/a

Dual-Agent Framework for Cross-Model Verified Translation of Natural-Language Protocols into Robotic Laboratory Platform

2026-06-18 cs.RO, cs.AI

Hyeonna Choi, Jung Yup Kim, Hyuneui Lim, Seunggyu Jeon

Core Contributions

Bridges the semantic gap between natural-language biological protocols and predefined automation commands, focusing on microplate experiments that require coordinating well mapping, sample-reagent combos, replicates, and parallel dispensing.
Uses a dual-agent design: a Parser Agent formalizes the protocol into a structured representation, and a rule-based mapping engine deterministically applies platform constraints to generate device-level commands.
A heterogeneous LLM Validation Agent checks completeness, parameter accuracy, and execution order, triggering a self-correction loop with structured feedback when errors appear — cross-model verification.
A sweep of 7 Parsers and 3 Validators on ELISA protocols quantifies how model scale and validator type affect accuracy, and a real Bradford protein-quantification assay validates end-to-end autonomous execution.

Show abstract

Biological experiment protocols are written in natural language, whereas automation systems rely on predefined control commands, creating a semantic gap that limits autonomous execution. Microplate-based automatic experiments are particularly challenging due to the need to simultaneously control well mapping, sample-reagent combinations, replicate placement, and parallel dispensing. This study proposes an agent-based protocol translation framework that converts natural-language microplate-based protocols into executable control commands for a robotic laboratory platform. A Parser Agent formalizes the natural-language protocol into a structured representation, and a rule-based mapping engine deterministically incorporates the operational constraints of the robotic laboratory platform to generate device-level control commands. A heterogeneous LLM Validation Agent verifies completeness, parameter accuracy, and execution order, and triggers a self-correction loop with structured feedback when errors are detected. A sweep involving 7 Parsers and 3 Validators on randomly selected ELISA protocols evaluates how model scale and Validator type affect translation accuracy and pass rates under cross-model verification. The accuracy-latency trade-off is further verified by comparing the rule-based mapping of the proposed framework with LLM end-to-end direct mapping. Finally, Bradford assay-based protein quantification using a microplate was demonstrated on a robotic laboratory platform, validating end-to-end autonomous execution from natural-language protocols to real-world experiments. The proposed framework provides a flexible approach to narrowing the semantic gap between natural-language protocols and microplate-based self-driving laboratories.

Control, Formal Methods & Multi-Robot Systems

#3 h=n/a

The Token Is a Group Element: On Lie-Algebra Attention over Matrix Lie Groups

2026-06-18 cs.LG, cs.CV, cs.GR, cs.RO, math.DG

Przemyslaw Musialski

Core Contributions

Proposes a radically minimal attention formulation: a token IS an element of a matrix Lie group — a bare transformation with no feature payload — claimed to be the first attention whose tokens are bare group elements.
The pairwise score is the closed-form algebra norm of the relative pose, log(g_i^-1 g_j), rather than a learned kernel; equivariance under the group action becomes tautological and the cocycle condition holds automatically.
Crucially it reaches non-compact, non-abelian affine groups with scale and shear — Aff(2) — that irrep-based and surjective-exp methods must exclude, with no spherical harmonics or Clebsch-Gordan machinery.
On SE(2), SO(3), and Aff(2) sequence completion, the closed-form score matches a learned MLP kernel and outperforms it on SE(2) using 50–80x fewer score parameters, while vector-token baselines break invariance by orders of magnitude.

Show abstract

We place the attention token on the group: a token is an element $g_i$ of a matrix Lie group $G$ -- a bare transformation, with no feature payload and no external action $ρ(g)$ carrying it. To our knowledge this is the first attention construction whose tokens are bare matrix Lie group elements: their score is the closed-form algebra norm of the relative pose rather than a learned kernel, and it reaches the affine full-frame groups that every irrep- or surjective-exp-based method must exclude. We call it Lie-Algebra Attention. Once tokens are group elements, the rest follows with none of the usual representation-theoretic machinery. The relative geometry of a pair is canonical, $g_i^{-1} g_j$, so the pairwise invariant $w_{ij} = \log(g_i^{-1} g_j)$ is intrinsic rather than designed; equivariance under the diagonal $G$-action is tautological, and the cocycle condition holds automatically. The attention score is the negative squared algebra norm, $s_{ij} = -\|\log(g_i^{-1} g_j)\|_λ^2/τ$: the canonical proximity kernel under a block-weighted Frobenius inner product, with no irreducible representations, spherical harmonics, Clebsch-Gordan products, or learned kernel. The construction applies to any matrix Lie group on a chosen logarithm chart containing the relative poses, including the non-compact non-abelian affine groups with scale and shear that no vector-token attention method reaches: neither the irrep tradition nor surjective-exp methods. Three sequence-completion experiments, on SE(2), SO(3), and Aff(2), bear this out: the closed-form score matches a learned MLP kernel on the same invariant and outperforms it on SE(2), using 50 to 80x fewer score parameters, while a vector-token baseline breaks invariance by five to twelve orders of magnitude.

#11 h=n/a

Agentic AutoResearch forSpace Autonomy: An Auditable, LLM-Driven Research Agent for Aerospace Control Problems

2026-06-18 cs.RO, math.OC

Amit Jain, Richard Linares

Core Contributions

Builds an auditable, LLM-driven research agent for aerospace control that automates the experimental loop — architecture/hyperparameter choice, runs, and judging whether improvements are real or seed noise.
Embeds a credibility layer that certifies every reported result against the problem's own measured seed noise, so no result is credited until it passes three checks: seed-noise measurement, reseeded verification, and leave-one-out pruning of edits.
Carefully scopes the LLM's role: it is only the offline research agent producing the policy; the trained policy is deployed onboard and the model itself never operates the spacecraft.
Applied unchanged to Clohessy-Wiltshire rendezvous and keep-out-zone docking, the audited policy clears measured seed noise by many standard deviations while undirected search yields no feasible docking policy.

Show abstract

Spacecraft guidance, navigation, and control functions are increasingly realized as learned policies distilled from expert solvers. Developing such a policy is itself a research process: an investigator selects an architecture and hyperparameters, runs experiments, and must determine whether an apparent improvement is genuine or merely seed noise. This paper presents AutoResearch, a framework in which a large language model autonomously drives that loop for aerospace control problems, coupled with a credibility layer, built into the loop, that certifies each reported result against the problem's own measured seed noise. The language model serves only as the offline research agent that develops the control policy; the trained policy it produces is then deployed onboard the spacecraft, while the model itself never operates the vehicle. At each iteration the agent reads a plain-language problem description and the run history, proposes a single edit to the training script, executes it, and logs the outcome. No reported result is credited until it passes the same three checks: measured per-problem seed noise, reseeded verification of the best configuration, and leave-one-out pruning of the agent's edits. The same loop is applied, unchanged, to two aerospace control problems: a Clohessy-Wiltshire relative rendezvous and a safety-constrained collision-avoidance docking past a keep-out zone, each calibrated against a known optimal control benchmark. In both, the audited policy clears the measured seed noise by many standard deviations; an undirected search over the same parameters does not. On the docking problem the gap becomes categorical: undirected search yields no feasible policy, while the learned policy stays outside the keep-out zone on every seed.

#13 h=n/a

An Infrastructure-less, Control-Independent Solution to Relative Localisation of a Team of Mobile Robots using Ranging Measurements

2026-06-18 cs.RO, cs.MA

Paolo Golinelli, Tommaso Faraci, Daniele Fontanelli

Core Contributions

Solves relative localization for robot teams without fixed infrastructure, anchors, or motion control to ensure observability — relying only on local odometry, sparse inter-agent ranging, and short-range communication.
Unlike most approaches, it does not require controlling robot motion for team observability, making deployments fast, flexible, and minimal in requirements.
Adopts a multi-hypothesis Bayesian framework that maintains the full set of feasible solutions, ensuring robustness under transient unobservable conditions.
Through information sharing, each agent benefits from the whole group's estimates even under partial connectivity — a fully decentralized, anchor-less, control-independent solution.

Show abstract

The ability to localise teams of robots is essential for applications ranging from robotic fleets in unstructured environments to cooperative control and navigation tasks. In such contexts, fixed infrastructure is often unavailable, deployments must be fast and flexible, and system requirements must be minimal. We present a decentralised cooperative localisation algorithm that addresses all these challenges at once. The method is anchor-less, fully decentralised, and, unlike most existing approaches, does not require controlling the robots motion to ensure team observability. It relies only on local odometry, sparse inter-agent ranging measurements, and short-range communication, all of which are widely available in practice. The algorithm adopts a multi-hypothesis Bayesian framework that maintains the entire set of feasible solutions, ensuring robustness under transient unobservable conditions. Moreover, through information sharing, each agent benefits from the estimates of the entire group, even in partially connected conditions.

#14 h=n/a

Autonomous Driving with Priority-Ordered STL Specifications Under Multimodal Uncertainty

2026-06-18 cs.RO

Taha Bouzid, Shuhao Qi, Mircea Lazar, Sofie Haesaert

Core Contributions

Plans autonomous-driving trajectories that satisfy many requirements (safety, comfort, traffic rules) but prioritizes them when they conflict in safety-critical scenarios where all cannot be met simultaneously.
Incorporates a predefined lexicographic ordering over Signal Temporal Logic specifications that stays valid under uncertainty — encoding which rules to sacrifice first.
Explicitly accounts for multimodal uncertainty in surrounding-traffic predictions (other vehicles, pedestrians) rather than assuming deterministic forecasts.
Implements the formulation with Model Predictive Path Integral control, demonstrating efficient handling of conflicting objectives under realistic multimodal uncertainty in simulation.

Show abstract

Autonomous vehicles must plan trajectories that satisfy a multitude of requirements on safety, passenger comfort, and compliance with traffic rules. However, in safety-critical scenarios, it is not always possible to satisfy all requirements simultaneously, necessitating their prioritization based on importance. At the same time, in these safety-critical scenarios, the uncertainty in trajectory predictions of the surrounding traffic, such as other vehicles and pedestrians, should be explicitly accounted for. In this work, we propose an uncertainty-aware trajectory planning framework that incorporates a predefined lexicographic ordering over Signal Temporal Logic (STL) specifications that stays valid under uncertainty. We implement this formulation with Model Predictive Path Integral (MPPI) control and we demonstrate the effectiveness of our method on simulation scenarios, showing that our framework efficiently handles conflicting objectives under realistic multi-modal uncertainty.

#19 h=n/a

Mobile Target Search with Imperfect Perception: A Partially Observable Stochastic Game Theoretical Approach

2026-06-18 cs.RO, cs.GT

Hanzheng Zhang, Shu Liang, Shuyu Liu

Core Contributions

Formulates mobile target search under imperfect perception (sensor limits, jamming, communication noise) as a partially observable stochastic game, generalizing POMDPs to include target intelligence and evasion.
Introduces a novel detectability concept to determine whether a search strategy guarantees eventual detection despite false alarms and missed detections, with sufficient criteria from stochastic recurrence analysis.
Develops a server-assisted distributed algorithm exploiting an aggregative potential-game structure for searchers and a KL-divergence-based reduction for predicting the target.
Numerical simulations validate both the algorithm's effectiveness and the detectability analysis — adversarial, perception-aware search rather than static coverage.

Show abstract

This paper investigates mobile target search under imperfect perceptions caused by sensor limitations, malicious jamming, or communication noise. Searchers and targets operate in a grid-shaped area with bounded mobility, leading to a dynamic interplay between search and evasion. To capture this adversarial interaction under imperfect perceptions, we adopt the partially observable stochastic game (POSG) approach, which generalizes partially observable Markov decision processes (POMDPs) by incorporating target intelligence. To handle false alarms and missed detections caused by perceptual uncertainties, we propose a novel detectability concept to determine whether a search strategy guarantees eventual detection, and provide sufficient detectability criteria based on stochastic recurrence analysis. We further develop a server-assisted distributed algorithm that utilizes the aggregative potential game structure for searchers and a KL-divergence-based reduction for target prediction. Numerical simulations validate the effectiveness of the proposed algorithm and support the detectability analysis.

#21 h=n/a

Stable Transformer-Actor-Critic Model Predictive Control: A Contraction Analysis Approach

2026-06-18 cs.RO

Antonio Marino, Valerio Modugno, Marco Cognetti

Core Contributions

Provides formal closed-loop stability guarantees for Actor-Critic MPC that uses sequence-based learning models, which normally lack such guarantees despite handling non-convex control well.
Proves Transformer networks can satisfy global incremental Input-to-State Stability — a foundational result enabling certified use of Transformers inside predictive control.
Applies Riemannian contraction theory to analyze the coupled dynamics of the physical plant and the predictive neural network, deriving robustness bounds.
Integrates those bounds as a training regularizer to yield a certifiably robust policy, validated on a nonlinear 3D drone doing target-reaching and obstacle avoidance.

Show abstract

Actor-Critic Model Predictive Control (MPC) effectively addresses complex, non-convex control problems, but guaranteeing the closed-loop stability of sequence-based learning models within these pipelines remains challenging. This paper introduces a novel Transformer-Actor-Critic MPC architecture with formal robustness guarantees. First, we prove that Transformer networks can satisfy global incremental Input-to-State Stability ($δ$ISS). We then leverage Riemannian contraction theory to analyze the interconnected dynamics between the physical plant and the predictive neural network. Finally, we integrate these theoretical bounds as a training regularizer to yield a certifiably robust policy. The framework is validated on a nonlinear 3D drone model executing target-reaching and obstacle-avoidance maneuvers.