🤖 Robotics arXiv Digest

Thursday, April 16, 2026

📄 28 papers 📂 7 research areas ✨ Generated by Claude

🔭 Research Landscape

Today's batch reveals a robotics community converging on policy learning architectures that reason beyond single-step action prediction. The World-Value-Action (WAV) model tackles the exponential decay problem of feasible trajectories in action space by moving planning into a learned latent space with trajectory value functions, while HiST-AT introduces hierarchical spatiotemporal tokenization for in-context imitation learning, and R3D diagnoses why 3D policy learning has historically underperformed — pinpointing missing data augmentation and Batch Normalization as culprits rather than fundamental architectural limitations. Together with ADAPT's affordance-aware planning and DockAnywhere's viewpoint-invariant demonstration generation, these papers collectively argue that the next leap in robot manipulation requires structured reasoning about what will happen and whether it should happen, not just faster imitation of demonstrations.

A second prominent thread is SLAM and localization in extreme or degraded environments. The CAVERS dataset provides the first multimodal SLAM benchmark inside a natural karstic cave with motion-capture ground truth, while CAL2M tackles kilometer-scale SLAM using Visual Geometry Foundation Models without any calibration. Meanwhile, two 4D radar papers — Graph Theoretical Outlier Rejection for open-pit mines and 4D Radar Gaussian Modeling with RCS — demonstrate that radar is maturing as a primary sensing modality for GPS-denied, visually degraded settings. The Dual Pose-Graph system for drone racing achieves 56–74% ATE reduction by fusing semantic landmark detection with odometry, showing that domain structure can compensate for sensor limitations at extreme speeds.

A cross-cutting observation is the growing investment in infrastructure and datasets as first-class research contributions. DigiForest deploys heterogeneous robots (aerial, legged, marsupial) for precision forestry across multiple European sites; HRDexDB provides 1.4K grasping trials with synchronized tactile, visual, and kinematic data across human and robotic hands; and the multi-platform LiDAR forestry dataset links point clouds with decades of ecological flux measurements. The DEX-Mouse open-source teleoperation interface (under $150) further lowers the barrier to collecting dexterous manipulation data. This infrastructure turn suggests the community recognizes that scaling robot capabilities requires not just better algorithms but better data pipelines and benchmarks.

🧠 VLA & Policy Learning

Latent-space planning, hierarchical action tokenization, 3D policy architectures, affordance reasoning, and viewpoint-invariant imitation.

#6 HiST-AT — Hierarchical spatiotemporal tokenizer
#10 WAV — World-Value-Action implicit planning
#12 ADAPT — Affordance-aware commonsense planning
#17 DockAnywhere — View-generalized mobile manipulation
#21 R3D — Revisiting 3D policy learning

📍 SLAM & Localization

Visual-inertial-ranging fusion, calibration-free large-scale SLAM, cave datasets, and semantic pose graphs for drone racing.

#2 Sylvester Pose — Efficient closed-form solvers
#3 Dual Pose-Graph — Semantic drone racing loc.
#16 CAVERS — Cave multimodal SLAM dataset
#27 CAL2M — Calibration-free km-scale SLAM
#28 CT-VIR — Continuous-time VIR fusion

🌲 Radar, LiDAR & Field Sensing

Precision forestry with heterogeneous robots, multi-platform LiDAR datasets, and 4D radar scan matching advances.

#1 DigiForest — Digital forestry with autonomous robots
#7 Multi-platform LiDAR — Forest inventory dataset
#8 Graph PCM — 4D radar outlier rejection
#24 4D Radar Gaussian — RCS-aware scan matching

🤲 Manipulation & Grasping

Differentiable regrasp planning, large-scale dexterous grasping datasets, low-cost teleoperation, and POMDP-based object search.

#4 Differentiable Regrasp — EBM pose connectivity
#14 HRDexDB — Human & robotic dexterous grasps
#18 GNPF-kCT — POMDP object search in 3D
#20 DEX-Mouse — $150 teleoperation interface

🗺️ Navigation & Path Planning

Bio-inspired path planning, coverage planning benchmarks, multi-UAV trajectory optimization, and assistive trajectory frameworks.

#13 NEAT-NC — Neuro-evolution navigation cells
#19 Enhanced Tube-RRT* — Multi-UAV cascaded transport
#22 MHHTOF — Assistive trajectory optimization
#26 Hex Coverage — Maritime CPP benchmark

🦿 Humanoid & Legged Locomotion

Multi-skill switching for humanoids and passive body dynamics for energy-efficient biped walking and running.

#9 Switch — Agile humanoid skill transitions
#15 Passive Biped — Body dynamics exploit for RL

🛡️ Sim2Real, Control & Safety

Abstract sim2real transfer, energy-regularized neural MPC, conformal-prediction HRC safety, and robotic waste management.

#5 Abstract Sim2Real — Coarse simulator transfer
#11 Safe HRC — Conformal prediction guarantees
#23 Smart Waste — Robotic bio-digestor framework
#25 Energy MPC — Regularized neural MPC for UAVs

🧠 VLA & Policy Learning

6 h=21

A Hierarchical Spatiotemporal Action Tokenizer for In-Context Imitation Learning in Robotics

2026-04-16 cs.RO Quoc-Huy Tran · h=21

Fawad Javed Fateh, Ali Shah Ali, Murad Popattia, Usman Nizamani, Andrey Konin

Core Contributions

Introduces a two-level vector quantization hierarchy where a lower level assigns actions to fine-grained subclusters and an upper level groups these into coarser clusters — unlike flat VQ approaches, this preserves both local action precision and global structure
Extends the spatial tokenizer with temporal cues by jointly recovering input actions and their timestamps, enabling the model to capture motion dynamics rather than treating actions as unordered sets
Achieves new state-of-the-art on multiple simulation and real-robot manipulation benchmarks for in-context imitation learning, where the agent must generalize from a handful of demonstrations at test time
Demonstrates that the hierarchical design consistently outperforms its non-hierarchical counterpart, suggesting that multi-resolution action representations are key to efficient few-shot policy transfer

Show abstract

We present a novel hierarchical spatiotemporal action tokenizer for in-context imitation learning. We first propose a hierarchical approach, which consists of two successive levels of vector quantization. In particular, the lower level assigns input actions to fine-grained subclusters, while the higher level further maps fine-grained subclusters to clusters. Our hierarchical approach outperforms the non-hierarchical counterpart, while mainly exploiting spatial information by reconstructing input actions. Furthermore, we extend our approach by utilizing both spatial and temporal cues, forming a hierarchical spatiotemporal action tokenizer, namely HiST-AT. Specifically, our hierarchical spatiotemporal approach conducts multi-level clustering, while simultaneously recovering input actions and their associated timestamps. Finally, extensive evaluations on multiple simulation and real robotic manipulation benchmarks show that our approach establishes a new state-of-the-art performance in in-context imitation learning.

10 h=10

World-Value-Action Model: Implicit Planning for Vision-Language-Action Systems

2026-04-16 cs.RO cs.LG Hongyin Zhang · h=10

Runze Li, Hongyin Zhang, Junxi Jin, Qixin Zeng, Zifeng Zhuang

Core Contributions

Provides a theoretical analysis showing that planning directly in action space suffers from exponential probability decay of feasible trajectories with increasing horizon — motivating the shift to latent-space inference
Unifies world model, trajectory value function, and action generation into a single framework where the model progressively concentrates probability mass on high-value, dynamically feasible trajectories
Unlike explicit trajectory optimizers like CEM or MPPI, WAV performs implicit planning through structured latent representations, avoiding the computational cost of forward rollouts during inference
Demonstrates significant improvements in task success rate, generalization, and robustness over state-of-the-art VLA methods, with particularly strong gains in long-horizon and compositional scenarios

Show abstract

Vision-Language-Action (VLA) models have emerged as a promising paradigm for building embodied agents that ground perception and language into action. However, most existing approaches rely on direct action prediction, lacking the ability to reason over long-horizon trajectories and evaluate their consequences, which limits performance in complex decision-making tasks. In this work, we introduce World-Value-Action (WAV) model, a unified framework that enables implicit planning in VLA systems. Rather than performing explicit trajectory optimization, WAV model learn a structured latent representation of future trajectories conditioned on visual observations and language instructions. A learned world model predicts future states, while a trajectory value function evaluates their long-horizon utility. Action generation is then formulated as inference in this latent space, where the model progressively concentrates probability mass on high-value and dynamically feasible trajectories. We provide a theoretical perspective showing that planning directly in action space suffers from an exponential decay in the probability of feasible trajectories as the horizon increases. In contrast, latent-space inference reshapes the search distribution toward feasible regions, enabling efficient long-horizon decision making. Extensive simulations and real-world experiments demonstrate that the WAV model consistently outperforms state-of-the-art methods, achieving significant improvements in task success rate, generalization ability, and robustness, especially in long-horizon and compositional scenarios.

12 h=9

ADAPT: Benchmarking Commonsense Planning under Unspecified Affordance Constraints

2026-04-16 cs.AI cs.CL cs.CV cs.RO Jia-Fong Yeh · h=9

Pei-An Chen, Yong-Ching Liang, Jia-Fong Yeh, Hung-Ting Su, Yi-Ting Chen

Core Contributions

Introduces DynAfford, a benchmark where object affordances change dynamically over time and are never specified in the instruction — forcing agents to perceive states, infer preconditions, and adapt on the fly rather than blindly following commands
ADAPT is a plug-and-play module that augments any existing task planner with explicit affordance reasoning, checking whether target objects can actually be manipulated before committing to actions
A domain-adapted, LoRA-finetuned VLM used as the affordance inference backend outperforms GPT-4o, demonstrating that task-aligned fine-tuning beats scale for grounded physical reasoning
Shows significant robustness improvements across both seen and unseen environments, highlighting that affordance awareness is a missing piece in current embodied AI pipelines

Show abstract

Intelligent embodied agents should not simply follow instructions, as real-world environments often involve unexpected conditions and exceptions. However, existing methods usually focus on directly executing instructions, without considering whether the target objects can actually be manipulated, meaning they fail to assess available affordances. To address this limitation, we introduce DynAfford, a benchmark that evaluates embodied agents in dynamic environments where object affordances may change over time and are not specified in the instruction. DynAfford requires agents to perceive object states, infer implicit preconditions, and adapt their actions accordingly. To enable this capability, we introduce ADAPT, a plug-and-play module that augments existing planners with explicit affordance reasoning. Experiments demonstrate that incorporating ADAPT significantly improves robustness and task success across both seen and unseen environments. We also show that a domain-adapted, LoRA-finetuned vision-language model used as the affordance inference backend outperforms a commercial LLM (GPT-4o), highlighting the importance of task-aligned affordance grounding.

17 h=5

DockAnywhere: Data-Efficient Visuomotor Policy Learning for Mobile Manipulation via Novel Demonstration Generation

2026-04-16 cs.RO Ziyu Shan · h=5

Ziyu Shan, Yuheng Zhou, Gaoyuan Wu, Ziheng Ji, Zhenyu Wu

Core Contributions

Identifies the "view generalization problem" in mobile manipulation — where docking point shifts between training and deployment cause visuomotor policies to fail — and solves it by lifting a single demonstration to diverse feasible docking configurations
Decouples docking-dependent base motions from contact-rich manipulation skills that remain invariant across viewpoints, enabling structure-preserving augmentation of demonstrations
Synthesizes visual observations in 3D by representing robot and objects as point clouds and applying point-level spatial editing, ensuring observation-action consistency across viewpoints without requiring additional real demonstrations
Achieves substantial success rate improvements on both ManiSkill and real-world platforms, with strong generalization to completely novel docking points unseen during training

Show abstract

Mobile manipulation is a fundamental capability that enables robots to interact in expansive environments such as homes and factories. Most existing approaches follow a two-stage paradigm, where the robot first navigates to a docking point and then performs fixed-base manipulation using powerful visuomotor policies. However, real-world mobile manipulation often suffers from the view generalization problem due to shifts of docking points. To address this issue, we propose a novel low-cost demonstration generation framework named DockAnywhere, which improves viewpoint generalization under docking variability by lifting a single demonstration to diverse feasible docking configurations. Specifically, DockAnywhere lifts a trajectory to any feasible docking points by decoupling docking-dependent base motions from contact-rich manipulation skills that remain invariant across viewpoints. Feasible docking proposals are sampled under feasibility constraints, and corresponding trajectories are generated via structure-preserving augmentation. Visual observations are synthesized in 3D space by representing the robot and objects as point clouds and applying point-level spatial editing to ensure the consistency of observation and action across viewpoints. Extensive experiments on ManiSkill and real-world platforms demonstrate that DockAnywhere substantially improves policy success rates and easily generalizes to novel viewpoints from unseen docking points during training, significantly enhancing the generalization capability of mobile manipulation policy in real-world deployment.

21 h=3

R3D: Revisiting 3D Policy Learning

2026-04-16 cs.CV cs.RO Zhengdong Hong · h=3

Zhengdong Hong, Shenrui Wu, Haozhe Cui, Boyi Zhao, Ran Ji

Core Contributions

Systematically diagnoses why 3D policy learning has underperformed expectations: the omission of 3D data augmentation and the adverse effects of Batch Normalization — rather than inherent limitations of 3D representations — are the primary culprits
Proposes a new architecture pairing a scalable transformer-based 3D encoder with a diffusion decoder, engineered for training stability at scale and designed to leverage large-scale 3D pre-training
Significantly outperforms state-of-the-art 3D baselines on challenging manipulation benchmarks, establishing that 3D policy learning can be competitive when training recipes are corrected
Opens the door to scaling 3D imitation learning by removing the instabilities and overfitting problems that previously prevented adopting powerful 3D perception models for robot control

Show abstract

3D policy learning promises superior generalization and cross-embodiment transfer, but progress has been hindered by training instabilities and severe overfitting, precluding the adoption of powerful 3D perception models. In this work, we systematically diagnose these failures, identifying the omission of 3D data augmentation and the adverse effects of Batch Normalization as primary causes. We propose a new architecture coupling a scalable transformer-based 3D encoder with a diffusion decoder, engineered specifically for stability at scale and designed to leverage large-scale pre-training. Our approach significantly outperforms state-of-the-art 3D baselines on challenging manipulation benchmarks, establishing a new and robust foundation for scalable 3D imitation learning. Project Page: https://r3d-policy.github.io/

📍 SLAM & Localization

2 h=39

Efficient closed-form approaches for pose estimation using Sylvester forms

2026-04-16 cs.CV cs.RO E. Malis · h=39

Jana Vráblíková, Ezio Malis, Laurent Busé

Core Contributions

Introduces a new class of resultant-based solvers that exploit Sylvester forms to reduce the algebraic complexity of closed-form pose estimation — unlike prior resultant approaches that use larger elimination matrices
Demonstrates numerical accuracy on par with state-of-the-art solvers while achieving faster computational times, which is critical for real-time applications like visual servoing and SLAM
Applies the framework to both 3D-to-3D and 3D-to-2D correspondence problems (PnP), showing the generality of the Sylvester form approach across different pose estimation variants
Leverages careful rotation parametrization to reduce the optimization to a polynomial system, then uses the structure of Sylvester matrices to avoid the full resultant computation

Show abstract

Solving non-linear least-squares problem for pose estimation (rotation and translation) is often a time consuming yet fundamental problem in several real-time computer vision applications. With an adequate rotation parametrization, the optimization problem can be reduced to the solution of a system of polynomial equations and solved in closed form. Recent advances in efficient closed form solvers utilizing resultant matrices have shown a promising research direction to decrease the computation time while preserving the estimation accuracy. In this paper, we propose a new class of resultant-based solvers that exploit Sylvester forms to further reduce the complexity of the resolution. We demonstrate that our proposed methods are numerically as accurate as the state-of-the-art solvers, and outperform them in terms of computational time. We show that this approach can be applied for pose estimation in two different types of problems: estimating a pose from 3D to 3D correspondences, and estimating a pose from 3D points to 2D points correspondences.

3 h=28

Dual Pose-Graph Semantic Localization for Vision-Based Autonomous Drone Racing

2026-04-16 cs.RO P. Campoy · h=28

David Perez-Saura, Miguel Fernandez-Cortizas, Alvaro J. Gaona, Pascual Campoy

Core Contributions

Proposes a dual pose-graph architecture where a temporary graph accumulates multiple gate observations between keyframes and optimizes them into a single refined constraint per landmark before promoting it to a persistent main graph — preventing graph growth from degrading real-time performance
Achieves 56% to 74% reduction in Absolute Trajectory Error compared to standalone VIO on the TII-RATM dataset, demonstrating that structured racing environments can be exploited for robust localization
Ablation study confirms the dual-graph design achieves 10–12% higher accuracy than a single-graph baseline at identical computational cost, validating the information-preserving design
Successfully deployed in the A2RL competition for real-time onboard localization during high-speed flight, reducing drift by up to 4.2 m per lap compared to the odometry baseline

Show abstract

Autonomous drone racing demands robust real-time localization under extreme conditions: high-speed flight, aggressive maneuvers, and payload-constrained platforms that often rely on a single camera for perception. Existing visual SLAM systems, while effective in general scenarios, struggle with motion blur and feature instability inherent to racing dynamics, and do not exploit the structured nature of racing environments. In this work, we present a dual pose-graph architecture that fuses odometry with semantic detections for robust localization. A temporary graph accumulates multiple gate observations between keyframes and optimizes them into a single refined constraint per landmark, which is then promoted to a persistent main graph. This design preserves the information richness of frequent detections while preventing graph growth from degrading real-time performance. The system is designed to be sensor-agnostic, although in this work we validate it using monocular visual-inertial odometry and visual gate detections. Experimental evaluation on the TII-RATM dataset shows a 56% to 74% reduction in ATE compared to standalone VIO, while an ablation study confirms that the dual-graph architecture achieves 10% to 12% higher accuracy than a single-graph baseline at identical computational cost. Deployment in the A2RL competition demonstrated that the system performs real-time onboard localization during flight, reducing the drift of the odometry baseline by up to 4.2 m per lap.

16 h=5

CAVERS: Multimodal SLAM Data from a Natural Karstic Cave with Ground Truth Motion Capture

2026-04-16 cs.RO Marcello Chiaberge · h=5

Giacomo Franchini, David Rodríguez-Martínez, Alfonso Martínez-Petersen, C. J. Pérez-del-Pulgar, Marcello Chiaberge

Core Contributions

Provides the first multimodal SLAM dataset from a natural karstic cave with mm-accurate 6-DoF ground truth at 120 Hz from a motion capture system installed directly inside the cave — a uniquely challenging environment with irregular geometry, reflective wet surfaces, and zero ambient light
Combines RGB-D-I, thermal-IR, and LiDAR sensing across 24 sequences (~335 GB) in both handheld and rover configurations under full darkness and artificial illumination conditions
Benchmarks seven state-of-the-art SLAM/odometry algorithms spanning visual, visual-inertial, thermal-inertial, and LiDAR pipelines, establishing quantitative baselines for cave perception
Addresses a critical gap: existing subterranean datasets focus on mines or tunnels, which have regular geometry fundamentally different from the branching, irregular passages of natural caves

Show abstract

Autonomous robots operating in natural karstic caves face perception and navigation challenges that are qualitatively distinct from those encountered in mines or tunnels: irregular geometry, reflective wet surfaces, near-zero ambient light, and complex branching passages. Yet publicly available datasets targeting this environment remain scarce and offer limited sensing modalities and environmental diversity. We present CAVERS, a multimodal dataset acquired in two structurally distinct rooms of Cueva de la Victoria, Málaga, Spain, comprising 24 sequences totaling approximately 335 GB of recorded data. The sensor suite combines an Intel RealSense D435i RGB-D-I camera, an Optris PI640i near-IR thermal camera, and a Velodyne VLP-16 LiDAR, operated both handheld and mounted on a wheeled rover under full darkness and artificial illumination. For most of the sequences, mm-accurate 6-DoF ground truth pose and velocity at 120 Hz are provided by an Optirack motion capture system installed directly inside the cave. We benchmark seven state-of-the-art SLAM and odometry algorithms spanning visual, visual-inertial, thermal-inertial, and LiDAR-based pipelines, as well as a 3D reconstruction pipeline, demonstrating the dataset's usability.

27 h=1

Keep It CALM: Toward Calibration-Free Kilometer-Level SLAM with Visual Geometry Foundation Models via an Assistant Eye

2026-04-16 cs.RO Tianchen Deng · h=1

Tianjun Zhang, Fengyi Zhang, Tianchen Deng, Lin Zhang, Hesheng Wang

Core Contributions

Argues that single linear transforms (Sim3, SL4) are fundamentally insufficient for aligning Visual Geometry Foundation Model outputs at kilometer scale, because VGFMs introduce complex non-linear geometric distortions that accumulate into trajectory drift
Introduces an "assistant eye" that exploits the prior of constant physical spacing to eliminate scale ambiguity without any temporal or spatial pre-calibration — a completely novel approach to calibration-free SLAM
Proposes an epipolar-guided intrinsic and pose correction model that rectifies rotation and translation errors caused by inaccurate intrinsics through fundamental matrix decomposition
Uses anchor propagation for globally consistent mapping, applying nonlinear transformations to elastically align sub-maps rather than rigid transforms, effectively eliminating geometric misalignment at scale

Show abstract

Visual Geometry Foundation Models (VGFMs) demonstrate remarkable zero-shot capabilities in local reconstruction. However, deploying them for kilometer-level Simultaneous Localization and Mapping (SLAM) remains challenging. In such scenarios, current approaches mainly rely on linear transforms (e.g., Sim3 and SL4) for sub-map alignment, while we argue that a single linear transform is fundamentally insufficient to model the complex, non-linear geometric distortions inherent in VGFM outputs. Forcing such rigid alignment leads to the rapid accumulation of uncorrected residuals, eventually resulting in significant trajectory drift and map divergence. To address these limitations, we present CAL2M (Calibration-free Assistant-eye based Large-scale Localization and Mapping), a plug-and-play framework compatible with arbitrary VGFMs. Distinct from traditional systems, CAL2M introduces an "assistant eye" solely to leverage the prior of constant physical spacing, effectively eliminating scale ambiguity without any temporal or spatial pre-calibration. Furthermore, leveraging the assumption of accurate feature matching, we propose an epipolar-guided intrinsic and pose correction model. Supported by an online intrinsic search module, it can effectively rectify rotation and translation errors caused by inaccurate intrinsics through fundamental matrix decomposition. Finally, to ensure accurate mapping, we introduce a globally consistent mapping strategy based on anchor propagation. By constructing and fusing anchors across the trajectory, we establish a direct local-to-global mapping relationship. This enables the application of nonlinear transformations to elastically align sub-maps, effectively eliminating geometric misalignments and ensuring a globally consistent reconstruction. The source code of CAL2M will be publicly available at https://github.com/IRMVLab/CALM.

28 h=0

CT-VIR: Continuous-Time Visual-Inertial-Ranging Fusion for Indoor Localization with Sparse Anchors

2026-04-16 cs.RO Yu-An Liu · h=0

Yu-An Liu, Li Zhang

Core Contributions

Addresses the practical constraint that high-accuracy UWB localization requires well-deployed anchors — which is often infeasible in narrow or low-power environments — by constructing virtual anchors from VIO motion priors and UWB measurements
Parameterizes the pose trajectory using B-splines in continuous time, enabling natural handling of asynchronous multi-sensor sampling without the discrete-time alignment issues that plague filtering and standard optimization methods
Formulates inertial, visual, and ranging constraints as factors in a sliding-window graph, jointly optimizing spline control points and auxiliary parameters for a continuous-time trajectory estimate
Demonstrates effectiveness on public datasets and real-world experiments, showing that spline-based continuous-time fusion can balance positioning accuracy, trajectory consistency, and computational efficiency

Show abstract

Visual-inertial odometry (VIO) is widely used for mobile robot localization, but its long-term accuracy degrades without global constraints. Incorporating ranging sensors such as ultra-wideband (UWB) can mitigate drift; however, high-accuracy ranging usually requires well-deployed anchors, which is difficult to ensure in narrow or low-power environments. Moreover, most existing visual-inertial-ranging (VIR) fusion methods rely on discrete time-based filtering or optimization, making it difficult to balance positioning accuracy, trajectory consistency, and fusion efficiency under asynchronous multi-sensor sampling. To address these issues, we propose a spline-based continuous-time state estimation method for VIR fusion localization. In the preprocessing stage, VIO motion priors and UWB ranging measurements are used to construct virtual anchors and reject outliers, thereby alleviating geometric degeneration and improving range reliability. In the estimation stage, the pose trajectory is parameterized in continuous time using a B-spline, while inertial, visual, and ranging constraints are formulated as factors in a sliding-window graph. The spline control points, together with a small set of auxiliary parameters, are then jointly optimized to obtain a continuous-time trajectory estimate. Evaluations on public datasets and real-world experiments demonstrate the effectiveness and practical potential of the proposed approach.

🌲 Radar, LiDAR & Field Sensing

1 h=81

DigiForest: Digital Analytics and Robotics for Sustainable Forestry

2026-04-16 cs.RO C. Stachniss · h=81

Marco Camurri, Enrico Tomelleri, Matías Mattamala, Sebastián Barbas Laina, Martin Jacquet

Core Contributions

Deploys a complete precision forestry pipeline spanning autonomous data collection with heterogeneous robots (aerial, legged, and marsupial), automated tree trait extraction, a decision support system for growth forecasting, and autonomous selective logging harvesters
Validates all four components in real-world forests across Finland, UK, and Switzerland — not just in simulation — demonstrating that the full technology stack works under actual forestry conditions
Addresses a critical EU policy need: forests cover ~40% of European land and are central to climate neutrality and biodiversity goals, but current management practices lack the automation needed for tree-level precision at scale
The marsupial robot design — where a legged robot deploys from an aerial platform — represents a novel approach to sub-canopy data collection that avoids the limitations of either platform alone

Show abstract

Covering one third of Earth's land surface, forests are vital to global biodiversity, climate regulation, and human well-being. In Europe, forests and woodlands reach approximately 40% of land area, and the forestry sector is central to achieving the EU's climate neutrality and biodiversity goals; these emphasize sustainable forest management, increased use of long-lived wood products, and resilient forest ecosystems. To meet these goals and properly address their inherent challenges, current practices require further innovation. This chapter introduces DigiForest, a novel, large-scale precision forestry approach leveraging digital technologies and autonomous robotics. DigiForest is structured around four main components: (1) autonomous, heterogeneous mobile robots (aerial, legged, and marsupial) for tree-level data collection; (2) automated extraction of tree traits to build forest inventories; (3) a Decision Support System (DSS) for forecasting forest growth and supporting decision-making; and (4) low-impact selective logging using purpose-built autonomous harvesters. These technologies have been extensively validated in real-world conditions in several locations, including forests in Finland, the UK, and Switzerland.

7 h=19

A multi-platform LiDAR dataset for standardized forest inventory measurement at long term ecological monitoring sites

2026-04-16 cs.RO K. Ellenrieder · h=19

Michael R. Chang, Anna Candotti, Karl von Ellenrieder, Enrico Tomelleri, Marco Camurri

Core Contributions

Integrates UAV-borne, terrestrial, and backpack mobile laser scanning from an ICOS forest plot into a single curated dataset — explicitly designed for calibration, benchmarking, and linking 3D structure with ecological observations and allometric models
Uses marker-free, SLAM-aware protocols to reduce both field and processing time while maintaining registration quality — a practical innovation for repeated forest inventories at scale
Situates acquisitions at a long-term ICOS monitoring site with decades of ecological and flux measurements, creating a unique bridge between remote sensing and established ecological science
Provides 333 million TLS points with complementary ULS/MLS data in LAZ and E57 formats with UTM coordinates, enabling benchmarks for registration, segmentation, quantitative structure models, and biomass estimation

Show abstract

We present a curated multi-platform LiDAR reference dataset from an instrumented ICOS forest plot, explicitly designed to support calibration, benchmarking, and integration of 3D structural data with ecological observations and standard allometric models. The dataset integrates UAV-borne laser scanning (ULS) to measure canopy coverage, terrestrial laser scanning (TLS) for detailed stem mapping, and backpack mobile laser scanning (MLS) with real-time SLAM for efficient sub-canopy acquisition. We focus on the control plot with the most complete and internally consistent registration, where TLS point clouds (~333 million points) are complemented by ULS and MLS data capturing canopy and understory strata. Marker-free, SLAM-aware protocols were used to reduce field and processing time, while manual and automated methods were combined. Final products are available in LAZ and E57 formats with UTM coordinates, together with registration reports for reproducibility. The dataset provides a benchmark for testing registration methods, evaluating scanning efficiency, and linking point clouds with segmentation, quantitative structure models, and allometric biomass estimation. By situating the acquisitions at a long-term ICOS site, it is explicitly linked to 3D structure with decades of ecological and flux measurements. More broadly, it illustrates how TLS, MLS, and ULS can be combined for repeated inventories and digital twins of forest ecosystems.

8 h=13

Graph Theoretical Outlier Rejection for 4D Radar Registration in Feature-Poor Environments

2026-04-16 cs.RO Daniel Adolfsson · h=13

Georg Dorndorf, Daniel Adolfsson, Masrur Doostdar

Core Contributions

Integrates graph-based pairwise consistency maximization (PCM) into the ICP loop for 4D radar, with a radar-adapted scoring function that incorporates anisotropic, per-detection uncertainty from a measurement model — unlike standard PCM that assumes isotropic noise
Reduces segment relative position error by 29.6% on 1 m segments and up to 55% on 100 m segments compared to the GICP baseline on real open-pit mine data — a setting where feature poverty makes correspondence reliability particularly poor
Uses a greedy heuristic to approximate maximum clique finding in the consistency graph, keeping the method suitable for online use despite the combinatorial nature of the outlier rejection problem
Specifically targets open-pit mines where the lack of distinctive landmarks compounds radar's inherent challenges of scan sparsity and multipath reflections, demonstrating practical viability in industrial settings

Show abstract

Automotive 4D imaging radar is well suited for operation in dusty and low-visibility environments, but scan registration remains challenging due to scan sparsity and spurious detections caused by noise and multipath reflections. This difficulty is compounded in feature-poor open-pit mines, where the lack of distinctive landmarks reduces correspondence reliability. We integrate graph-based pairwise consistency maximization (PCM) as an outlier rejection step within the iterative closest points (ICP) loop. We propose a radar-adapted pairwise distance-invariant scoring function for graph-based (PCM) that incorporates anisotropic, per-detection uncertainty derived from a radar measurement model. The consistency maximization problem is approximated with a greedy heuristic that finds a large clique in the pairwise consistency graph. The refined correspondence set improves robustness when the initial association set is heavily contaminated. We evaluate a standard Euclidean distance residual and our uncertainty-aware residual on an open-pit mine dataset collected with a 4D imaging radar. Compared to the generalized ICP (GICP) baseline without PCM, our method reduces segment relative position error (RPE) by 29.6% on 1 m segments and by up to 55% on 100 m segments. The presented method is intended for integration into localization pipelines and is suitable for online use due to the greedy heuristic in graph-based (PCM).

24 h=2

4D Radar Gaussian Modeling and Scan Matching with RCS

2026-04-16 cs.RO Fernando Amodeo · h=2

Fernando Amodeo, Luis Merino, Fernando Caballero

Core Contributions

Incorporates Radar Cross Section (RCS) — a measure of how much radar energy an object reflects — into 3D Gaussian scene models, whereas prior 4D radar work typically discards RCS information during modeling and matching
Extends previous 3D Gaussian modeling and scan matching frameworks to model the physical behavior of RCS, enriching scene representations beyond just geometric position and Doppler velocity
Demonstrates that including RCS improves scan matching performance by providing additional discriminative information about surface materials and geometry, which is especially valuable when spatial features alone are ambiguous

Show abstract

4D millimeter-wave (mmWave) radars are increasingly used in robotics, as they offer robustness against adverse environmental conditions. Besides the usual XYZ position, they provide Doppler velocity measurements as well as Radar Cross Section (RCS) information for every point. While Doppler is widely used to filter out dynamic points, RCS is often overlooked and not usually used in modeling and scan matching processes. Building on previous 3D Gaussian modeling and scan matching work, we propose incorporating the physical behavior of RCS in the model, in order to further enrich the summarized information about the scene, and improve the scan matching process.

🤲 Manipulation & Grasping

4 h=27

Differentiable Object Pose Connectivity Metrics for Regrasp Sequence Optimization

2026-04-16 cs.RO Weiwei Wan · h=27

Liang Qin, Weiwei Wan, Kensuke Harada

Core Contributions

Replaces brittle discrete search over intermediate placements with a continuous, differentiable energy landscape for measuring pose-pair connectivity — enabling gradient-based optimization of intermediate object poses during regrasp planning
Models grasp feasibility under an object pose using an Energy-Based Model (EBM) and exploits energy additivity to construct a smooth connectivity metric, providing informative gradients that discrete approaches fundamentally cannot
Introduces an adaptive iterative deepening strategy that automatically determines the minimum number of intermediate regrasp steps, eliminating the need to pre-specify the number of regrasps
Demonstrates cross-end-effector transfer: a model trained with suction constraints can guide parallel gripper manipulation, suggesting the learned energy landscape captures general pose connectivity rather than gripper-specific feasibility

Show abstract

Regrasp planning is often required when one pick-and-place cannot transfer an object from an initial pose to a goal pose while maintaining grasp feasibility. The main challenge is to reason about shared-grasp connectivity across intermediate poses, where discrete search becomes brittle. We propose an implicit multi-step regrasp planning framework based on differentiable pose sequence connectivity metrics. We model grasp feasibility under an object pose using an Energy-Based Model (EBM) and leverage energy additivity to construct a continuous energy landscape that measures pose-pair connectivity, enabling gradient-based optimization of intermediate object poses. An adaptive iterative deepening strategy is introduced to determine the minimum number of intermediate steps automatically. Experiments show that the proposed cost formulation provides smooth and informative gradients, improving planning robustness over other alternatives. They also demonstrate generalization to unseen grasp poses and cross-end-effector transfer, where a model trained with suction constraints can guide parallel gripper grasp manipulation. The multi-step planning results further highlight the effectiveness of adaptive deepening and minimum-step search.

14 h=8

HRDexDB: A Large-Scale Dataset of Dexterous Human and Robotic Hand Grasps

2026-04-16 cs.RO cs.CV Hanbyul Joo · h=8

Jongbin Lim, Taeyun Ha, Mingi Choi, Jisoo Kim, Byungjun Kim

Core Contributions

Provides closely aligned captures of human dexterity and robotic execution on the same 100 objects under comparable grasping motions — enabling direct cross-domain comparison that existing datasets (which cover either humans or robots, not both) cannot support
Includes 1.4K grasping trials with synchronized high-resolution tactile signals, multi-view video, egocentric video, and high-precision 3D motion capture for both agent and manipulated object
Captures both successes and failures, providing negative examples that are critical for learning robust policies but are typically absent from curated datasets
Serves as a foundational benchmark for multi-modal policy learning and cross-domain dexterous manipulation transfer — the aligned human-robot format specifically enables studying how human grasping strategies translate to different robot hand embodiments

Show abstract

We present HRDexDB, a large-scale, multi-modal dataset of high-fidelity dexterous grasping sequences featuring both human and diverse robotic hands. Unlike existing datasets, HRDexDB provides a comprehensive collection of grasping trajectories across human hands and multiple robot hand embodiments, spanning 100 diverse objects. Leveraging state-of-the-art vision methods and a new dedicated multi-camera system, our HRDexDB offers high-precision spatiotemporal 3D ground-truth motion for both the agent and the manipulated object. To facilitate the study of physical interaction, HRDexDB includes high-resolution tactile signals, synchronized multi-view video, and egocentric video streams. The dataset comprises 1.4K grasping trials, encompassing both successes and failures, each enriched with visual, kinematic, and tactile modalities. By providing closely aligned captures of human dexterity and robotic execution on the same target objects under comparable grasping motions, HRDexDB serves as a foundational benchmark for multi-modal policy learning and cross-domain dexterous manipulation.

18 h=5

POMDP-based Object Search with Growing State Space and Hybrid Action Domain

2026-04-16 cs.RO Shoudong Huang · h=5

Yongbo Chen, Hesheng Wang, Shoudong Huang, Hanna Kurniawati

Core Contributions

Formulates object search as a high-dimensional POMDP with a growing state space and hybrid (continuous+discrete) action spaces in 3D — handling the realistic complexity that state and action spaces expand as the robot discovers new objects and areas
Proposes GNPF-kCT, a novel online POMDP solver combining belief tree reuse for growing state spaces, a neural process network to filter useless primitive actions, and k-center clustering for efficient high-dimensional action space refinement
Outperforms both POMDP-based baselines and state-of-the-art LLM-based object search methods in Gazebo simulations with Fetch and Stretch robots under the same computational and perception constraints
Introduces a "guessed target object" strategy with a grid-world model to handle limited-information scenarios, providing useful search guidance when reward signals are sparse or absent

Show abstract

Efficiently locating target objects in complex indoor environments with diverse furniture, such as shelves, tables, and beds, is a significant challenge for mobile robots. This difficulty arises from factors like localization errors, limited fields of view, and visual occlusion. We address this by framing the object-search task as a highdimensional Partially Observable Markov Decision Process (POMDP) with a growing state space and hybrid (continuous and discrete) action spaces in 3D environments. Based on a meticulously designed perception module, a novel online POMDP solver named the growing neural process filtered k-center clustering tree (GNPF-kCT) is proposed to tackle this problem. Optimal actions are selected using Monte Carlo Tree Search (MCTS) with belief tree reuse for growing state space, a neural process network to filter useless primitive actions, and k-center clustering hypersphere discretization for efficient refinement of high-dimensional action spaces. A modified upper-confidence bound (UCB), informed by belief differences and action value functions within cells of estimated diameters, guides MCTS expansion. Theoretical analysis validates the convergence and performance potential of our method. To address scenarios with limited information or rewards, we also introduce a guessed target object with a grid-world model as a key strategy to enhance search efficiency. Extensive Gazebo simulations with Fetch and Stretch robots demonstrate faster and more reliable target localization than POMDP-based baselines and state-of-the-art (SOTA) non-POMDP-based solvers, especially large language model (LLM) based methods, in object search under the same computational constraints and perception systems. Real-world tests in office environments confirm the practical applicability of our approach.

20 h=4

DEX-Mouse: A Low-cost Portable and Universal Interface with Force Feedback for Data Collection of Dexterous Robotic Hands

2026-04-16 cs.RO Wook Ko · h=4

Joonho Koh, Haechan Jung, Nayoung Kim, Wook Ko, Changjoo Nam

Core Contributions

Builds a complete dexterous hand teleoperation interface for under $150 from commercial off-the-shelf components — dramatically undercutting MoCap glove systems that cost thousands and require per-operator calibration
Achieves operator-agnostic, calibration-free operation with integrated kinesthetic force feedback, enabling immediate deployment across diverse environments and platforms without structural modification
Supports an "attached configuration" where the robot hand mounts directly on the operator's forearm, producing robot-aligned demonstration data and reducing perceived workload compared to spatially separated setups across all compared interfaces
Operators achieved 86.67% task completion rate across various dexterous manipulation tasks in the attached configuration, with the full hardware and software stack open-sourced for community adoption

Show abstract

Data-driven dexterous hand manipulation requires large-scale, physically consistent demonstration data. Simulation and video-based methods suffer from sim-to-real gaps and retargeting problems, while MoCap glove-based teleoperation systems require per-operator calibration and lack portability, as the robot hand is typically fixed to a stationary arm. Portable alternatives improve mobility but lack cross-platform and cross-operator compatibility. We present DEX-Mouse, a portable, calibration-free hand-held teleoperation interface with integrated kinesthetic force feedback, built from commercial off-the-shelf components under USD 150. The operator-agnostic design requires no calibration or structural modification, enabling immediate deployment across diverse environments and platforms. The interface supports a configuration in which the target robot hand is mounted directly on the forearm of an operator, producing robot-aligned data. In a comparative user study across various dexterous manipulation tasks, operators using the proposed system achieved an 86.67% task completion rate under the attached configuration. Also, we found that the attached configuration reduced the perceived workload of the operators compared to spatially separated teleoperation setups across all compared interfaces. The complete hardware and software stack, including bill of materials, CAD models, and firmware, is open-sourced at https://dex-mouse.github.io/ to facilitate replication and adoption.

🗺️ Navigation & Path Planning

13 h=8

NEAT-NC: NEAT guided Navigation Cells for Robot Path Planning

2026-04-16 cs.RO cs.AI cs.NE K. Slimani · h=8

Hibatallah Meliani, Khadija Slimani, Samira Khoulji

Core Contributions

Draws directly from neuroscience — using place cells, grid cells, head direction cells, border cells, and speed cells as inputs to an evolving neural network — to model hippocampal spatial cognition for robot path planning
Evolves recurrent neural network topologies using NEAT (Neuro-Evolution of Augmenting Topology) that grow in complexity to match the environment, rather than using fixed architectures that may be over- or under-specified
Evaluates across both static and dynamic scenarios, showing that the biological navigation cell representation improves NEAT's ability to adapt to complex and varying environments
Suggests the approach is well-suited for real-time dynamic path planning in robotics and games, where the environment changes faster than a fixed planner can re-plan

Show abstract

To navigate a space, the brain makes an internal representation of the environment using different cells such as place cells, grid cells, head direction cells, border cells, and speed cells. All these cells, along with sensory inputs, enable an organism to explore the space around it. Inspired by these biological principles, we developed NEATNC, a Neuro-Evolution of Augmenting Topology guided Navigation Cells. The goal of the paper is to improve NEAT algorithm performance in path planning in dynamic environments using spatial cognitive cells. This approach uses navigation cells as inputs and evolves recurrent neural networks, representing the hippocampus part of the brain. The performance of the proposed algorithm is evaluated in different static and dynamic scenarios. This study highlights NEAT's adaptability to complex and different environments, showcasing the utility of biological theories. This suggests that our approach is well-suited for real-time dynamic path planning for robotics and games.

19 h=4

Trajectory Planning for a Multi-UAV Rigid-Payload Cascaded Transportation System Based on Enhanced Tube-RRT*

2026-04-16 cs.RO eess.SY Tianhua Gao · h=4

Jianqiao Yu, Jia Li, Tianhua Gao

Core Contributions

Develops Enhanced Tube-RRT* with active hybrid sampling and adaptive expansion that achieves higher success and effective sampling rates than STube-RRT* and AETube-RRT* in densely cluttered environments
Explicitly incorporates trajectory smoothness cost into the edge cost function to reduce excessive turns — directly mitigating cable-induced oscillations that plague tethered multi-UAV payload systems
Formulates a convex quadratic program in Stage II that jointly considers payload translational/rotational dynamics, cable tension constraints, and collision safety to produce smooth, collision-free payload trajectories
Validates the complete two-stage framework with centralized geometric control, demonstrating practical feasibility for payload attitude maneuvering in dense obstacle fields where existing methods struggle

Show abstract

This paper presents a two-stage trajectory planning framework for a multi-UAV rigid-payload cascaded transportation system, aiming to address planning challenges in densely cluttered environments. In Stage I, an Enhanced Tube-RRT* algorithm is developed by integrating active hybrid sampling and an adaptive expansion strategy, enabling rapid generation of a safe and feasible virtual tube in environments with dense obstacles. Moreover, a trajectory smoothness cost is explicitly incorporated into the edge cost to reduce excessive turns and thereby mitigate cable-induced oscillations. Simulation results demonstrate that the proposed Enhanced Tube-RRT* achieves a higher success rate and effective sampling rate than mixed-sampling Tube-RRT* (STube-RRT*) and adaptive-extension Tube-RRT* (AETube-RRT*), while producing a shorter optimal path with a smaller cumulative turning angle. In Stage II, a convex quadratic program is formulated by considering payload translational and rotational dynamics, cable tension constraints, and collision-safety constraints, yielding a smooth, collision-free desired payload trajectory. Finally, a centralized geometric control scheme is applied to the cascaded system to validate the effectiveness and feasibility of the proposed planning framework, offering a practical solution for payload attitude maneuvering in densely cluttered environments.

22 h=3

Momentum-constrained Hybrid Heuristic Trajectory Optimization Framework with Residual-enhanced DRL for Visually Impaired Scenarios

2026-04-16 cs.RO Yongbin Yu · h=3

Yuting Zeng, Zhiwen Zheng, Jingya Wang, You Zhou, JiaLing Xiao

Core Contributions

Introduces momentum constraints into heuristic trajectory optimization to suppress abrupt velocity and acceleration changes — directly addressing the comfort requirements of assistive navigation for visually impaired users
A residual-enhanced DRL module refines candidate trajectories, combining the coverage of heuristic sampling with the temporal modeling and generalization capabilities of learned policies
Proposes a dual-stage cost mechanism: Frenet-space costs ensure trajectory consistency while Cartesian-space reward-driven adaptive weights integrate user preferences for interpretable, user-centric decision-making
Converges in nearly half the iterations of baselines while achieving lower and more stable costs, with stable velocity/acceleration profiles and reduced risk in complex dynamic scenarios

Show abstract

Safe and efficient assistive planning for visually impaired scenarios remains challenging, since existing methods struggle with multi-objective optimization, generalization, and interpretability. In response, this paper proposes a Momentum-Constrained Hybrid Heuristic Trajectory Optimization Framework (MHHTOF). To balance multiple objectives of comfort and safety, the framework designs a Heuristic Trajectory Sampling Cluster (HTSC) with a Momentum-Constrained Trajectory Optimization (MTO), which suppresses abrupt velocity and acceleration changes. In addition, a novel residual-enhanced deep reinforcement learning (DRL) module refines candidate trajectories, advancing temporal modeling and policy generalization. Finally, a dual-stage cost modeling mechanism (DCMM) is introduced to regulate optimization, where costs in the Frenet space ensure consistency, and reward-driven adaptive weights in the Cartesian space integrate user preferences for interpretability and user-centric decision-making. Experimental results show that the proposed framework converges in nearly half the iterations of baselines and achieves lower and more stable costs. In complex dynamic scenarios, MHHTOF further demonstrates stable velocity and acceleration curves with reduced risk, confirming its advantages in robustness, safety, and efficiency.

26 h=1

Benchmarking Classical Coverage Path Planning Heuristics on Irregular Hexagonal Grids for Maritime Coverage Scenarios

2026-04-16 cs.RO cs.AI math.OC Gonzalo A. Ruz · h=1

Carlos S. Sepúlveda, Gonzalo A. Ruz

Core Contributions

Provides a reproducible benchmark of 17 deterministic heuristics from 7 families on 10,000 Hamiltonian-feasible hexagonal graph instances — filling a gap where classical coverage methods were typically compared on small ad hoc examples or rectangular grids
Reveals that the strongest classical Hamiltonian baseline is a Warnsdorff variant using index-based tie-breaking with a terminal-inclusive residual-degree policy, reaching 79.0% Hamiltonian success — and that this underreported implementation detail materially affects performance
Shows that heuristics with explicit shortest-path reconnection reliably solve relaxed coverage but almost never produce zero-revisit tours, highlighting a fundamental limitation of greedy approaches on sparse geometric graphs with bottlenecks
Targets maritime surveillance, search-and-rescue, and environmental monitoring scenarios where hexagonal grids naturally model operational areas, providing the community with a controlled testbed for heuristic analysis

Show abstract

Coverage path planning on irregular hexagonal grids is relevant to maritime surveillance, search and rescue and environmental monitoring, yet classical methods are often compared on small ad hoc examples or on rectangular grids. This paper presents a reproducible benchmark of deterministic single-vehicle coverage path planning heuristics on irregular hexagonal graphs derived from synthetic but maritime-motivated areas of interest. The benchmark contains 10,000 Hamiltonian-feasible instances spanning compact, elongated, and irregular morphologies, 17 heuristics from seven families, and a common evaluation protocol covering Hamiltonian success, complete-coverage success, revisits, path length, heading changes, and CPU latency. Across the released dataset, heuristics with explicit shortest-path reconnection solve the relaxed coverage task reliably but almost never produce zero-revisit tours. Exact Depth-First Search confirms that every released instance is Hamiltonian-feasible. The strongest classical Hamiltonian baseline is a Warnsdorff variant that uses an index-based tie-break together with a terminal-inclusive residual-degree policy, reaching 79.0% Hamiltonian success. The dominant design choice is not tie-breaking alone, but how the residual degree is defined when the endpoint is reserved until the final move. This shows that underreported implementation details can materially affect performance on sparse geometric graphs with bottlenecks. The benchmark is intended as a controlled testbed for heuristic analysis rather than as a claim of operational optimality at fleet scale.

🦿 Humanoid & Legged Locomotion

9 h=13

Switch: Learning Agile Skills Switching for Humanoid Robots

2026-04-16 cs.RO Yinhuai Wang · h=13

Yuen-Fui Lau, Qihan Zhao, Yinhuai Wang, Runyi Yu, Hok Wai Tsui

Core Contributions

Addresses a critical safety gap in humanoid locomotion: existing approaches train individual agile skills well but struggle with flexible transitions between them, creating dangerous instability during skill changes
Builds a Skill Graph from kinematic similarity within multi-skill motion data, establishing which cross-skill transitions are physically feasible before training — rather than discovering transitions through trial and error
An online skill scheduler performs real-time graph search to find optimal feasible transition paths when switching skills or recovering from tracking deviations, ensuring stable execution without offline pre-computation
Demonstrates high success rates for agile skill transitions while maintaining strong motion imitation performance — showing that transition quality and individual skill quality are not in fundamental conflict

Show abstract

Recent advancements in whole-body control through deep reinforcement learning have enabled humanoid robots to achieve remarkable progress in real-world challenging locomotion skills. However, existing approaches often struggle with flexible transitions between distinct skills, creating safety concerns and practical limitations. To address this challenge, we introduce a hierarchical multi-skill system, Switch, enabling seamless skill transitions at any moment. Our approach comprises three key components: (1) a Skill Graph (SG) that establishes potential cross-skill transitions based on kinematic similarity within multi-skill motion data, (2) a whole-body tracking policy trained on this skill graph through deep reinforcement learning, and (3) an online skill scheduler to drive the tracking policy for robust skill execution and smooth transitions. For skill switching or significant tracking deviations, the scheduler performs online graph search to find the optimal feasible path, which ensures efficient, stable, and real-time execution of diverse locomotion skills. Comprehensive experiments demonstrate that Switch empowers humanoid to execute agile skill transitions with high success rates while maintaining strong motion imitation performance.

15 h=6

Model-Based Reinforcement Learning Exploits Passive Body Dynamics for High-Performance Biped Robot Locomotion

2026-04-16 cs.RO eess.SY Tomoya Kamimura · h=6

Tomoya Kamimura, Haruka Washiyama, Akihito Sano

Core Contributions

Shows that biped robots with passive elements (springs) trained via model-based deep RL converge to stable limit cycles through dynamic interaction with the ground — the attractor-driven learning produces locomotion that is both robust and energy-efficient
Compares robots with and without passive elements in simulation, finding that passive dynamics fundamentally change the learning landscape: trajectories converge quickly to limit cycles but take longer to achieve high rewards, suggesting a trade-off between stability and reward optimization speed
Demonstrates that implementing passive properties in the robot body is crucial for future embodied AI — the body's mechanical intelligence reduces the computational burden on the learned controller
Generates both walking and running gaits from the same framework, showing that passive dynamics enable multiple locomotion modes without separate skill-specific training

Show abstract

Embodiment is a significant keyword in recent machine learning fields. This study focused on the passive nature of the body of a biped robot to generate walking and running locomotion using model-based deep reinforcement learning. We constructed two models in a simulator, one with passive elements (e.g., springs) and the other, which is similar to general humanoids, without passive elements. The training of the model with passive elements was highly affected by the attractor of the system. This lead that although the trajectories quickly converged to limit cycles, it took a long time to obtain large rewards. However, thanks to the attractor-driven learning, the acquired locomotion was robust and energy-efficient. The results revealed that robots with passive elements could efficiently acquire high-performance locomotion by utilizing stable limit cycles generated through dynamic interaction between the body and ground. This study demonstrates the importance of implementing passive properties in the body for future embodied AI.

🛡️ Sim2Real, Control & Safety

5 h=22

Abstract Sim2Real through Approximate Information States

2026-04-16 cs.RO Josiah P. Hanna · h=22

Yunfu Deng, Yuhao Li, Josiah P. Hanna

Core Contributions

Formalizes the "abstract sim2real" problem — transferring policies from simulators that deliberately omit key task details — using the language of state abstraction from RL theory, providing a principled framework rather than ad hoc domain randomization
Shows theoretically that an abstract simulator can be grounded to match the target task if the grounded dynamics take the history of states into account, connecting approximate information states to sim2real transfer guarantees
Introduces a practical method that uses real-world task data to correct the dynamics of the abstract simulator, enabling successful policy transfer even when the simulator's abstraction level is intentionally coarse
Validates the approach in both sim2sim and sim2real settings, demonstrating that the formalism translates to practical improvements when detailed simulators are unavailable or prohibitively expensive to build

Show abstract

In recent years, reinforcement learning (RL) has shown remarkable success in robotics when a fast and accurate simulator is available for a given task. When using RL and simulation, more simulator realism is generally beneficial but becomes harder to obtain as robots are deployed in increasingly complex and widescale domains. In such settings, simulators will likely fail to model all relevant details of a given target task and this observation motivates the study of sim2real with simulators that leave out key task details. In this paper, we formalize and study the abstract sim2real problem: given an abstract simulator that models a target task at a coarse level of abstraction, how can we train a policy with RL in the abstract simulator and successfully transfer it to the real-world? Our first contribution is to formalize this problem using the language of state abstraction from the RL literature. This framing shows that an abstract simulator can be grounded to match the target task if the grounded abstract dynamics take the history of states into account. Based on the formalism, we then introduce a method that uses real-world task data to correct the dynamics of the abstract simulator. We then show that this method enables successful policy transfer both in sim2sim and sim2real evaluation.

11 h=9

Vision-Based Safe Human-Robot Collaboration with Uncertainty Guarantees

2026-04-16 cs.RO cs.CV Jakob Thumm · h=9

Jakob Thumm, Marian Frei, Tianle Ni, Matthias Althoff, Marco Pavone

Core Contributions

Combines aleatoric uncertainty estimation with out-of-distribution detection for vision-based human pose estimation, achieving high probabilistic confidence that is essential for certifiable safety — unlike approaches that estimate only one type of uncertainty
Proposes conformal prediction sets for human motion predictions with high, provably valid confidence intervals, enabling integration with formal safety verification frameworks
Bridges the gap between vision-based perception (which is inherently uncertain) and certifiable safety frameworks (which require guaranteed bounds), making formal safety assurances practical for real-world human-robot collaboration
Evaluates on both recorded human motion data and a real-world HRC setting, demonstrating that the uncertainty-aware pipeline provides meaningful safety guarantees without being overly conservative

Show abstract

We propose a framework for vision-based human pose estimation and motion prediction that gives conformal prediction guarantees for certifiably safe human-robot collaboration. Our framework combines aleatoric uncertainty estimation with OOD detection for high probabilistic confidence. To integrate our pipeline in certifiable safety frameworks, we propose conformal prediction sets for human motion predictions with high, valid confidence. We evaluate our pipeline on recorded human motion data and a real-world human-robot collaboration setting.

23 h=3

An Intelligent Robotic and Bio-Digestor Framework for Smart Waste Management

2026-04-16 cs.RO cs.LG M. Srinivas · h=3

Radhika Khatri, Adit Tewari, Nikhil Sharma, M. B. Srinivas

Core Contributions

Integrates a YOLOv8-based robotic waste segregation system (MyCobot 280 + Jetson Nano) with an optimized bio-digestor in a single end-to-end framework — connecting the sorting and processing stages that are typically treated separately
Achieves 98% sorting accuracy across four waste categories using ROS-based path planning with real-time YOLOv8 detection, reducing the need for manual intervention in the sorting stage
Uses Particle Swarm Optimization combined with a regression model to dynamically adjust bio-digestor parameters (temperature, pH, pressure, RPM), maximizing digestion efficiency under varying environmental conditions
Provides a scalable solution suitable for both residential and industrial applications, addressing the growing challenge of municipal waste management driven by rapid urbanization

Show abstract

Rapid urbanization and continuous population growth have made municipal solid waste management increasingly challenging. These challenges highlight the need for smarter and automated waste management solutions. This paper presents the design and evaluation of an integrated waste management framework that combines two connected systems, a robotic waste segregation module and an optimized bio-digestor. The robotic waste segregation system uses a MyCobot 280 Jetson Nano robotic arm along with YOLOv8 object detection and robot operating system (ROS)-based path planning to identify and sort waste in real time. It classifies waste into four different categories with high precision, reducing the need for manual intervention. After segregation, the biodegradable waste is transferred to a bio-digestor system equipped with multiple sensors. These sensors continuously monitor key parameters, including temperature, pH, pressure, and motor revolutions per minute. The Particle Swarm Optimization (PSO) algorithm, combined with a regression model, is used to dynamically adjust system parameters. This intelligent optimization approach ensures stable operation and maximizes digestion efficiency under varying environmental conditions. System testing under dynamic conditions demonstrates a sorting accuracy of 98% along with highly efficient biological conversion. The proposed framework offers a scalable, intelligent, and practical solution for modern waste management, making it suitable for both residential and industrial applications.

25 h=2

Energy-based Regularization for Learning Residual Dynamics in Neural MPC for Omnidirectional Aerial Robots

2026-04-16 eess.SY cs.RO Henrik Krauss · h=2

Johannes Kübel, Henrik Krauss, Jinjie Li, Moju Zhao

Core Contributions

Proposes a physics-inspired energy-based regularization loss that encourages the neural residual dynamics model to produce control corrections that stabilize the system's energy — injecting physical priors that standard neural MPC training lacks
Improves positional MAE by 23% over analytical MPC across three real-world omnidirectional aerial robot experiments, demonstrating that learned residual dynamics provide meaningful corrections to the nominal model
Achieves up to 15% lower MAE and significantly increased flight stability compared to standard neural MPC without energy regularization, showing that the regularization prevents the neural model from learning destabilizing corrections
The energy regularization acts as an implicit safety constraint during training, producing more conservative but stable control corrections without requiring explicit constraint formulation in the MPC optimization

Show abstract

Data-driven Model Predictive Control (MPC) has lately been the core research subject in the field of control theory. The combination of an optimal control framework with deep learning paradigms opens up the possibility to accurately track control tasks without the need for complex analytical models. However, the system dynamics are often nuanced and the neural model lacks the potential to understand physical properties such as inertia and conservation of energy. In this work, we propose a novel energy-based regularization loss function which is applied to the training of a neural model that learns the residual dynamics of an omnidirectional aerial robot. Our energy-based regularization encourages the neural network to cause control corrections that stabilize the energy of the system. The residual dynamics are integrated into the MPC framework and improve the positional mean absolute error (MAE) over three real-world experiments by 23% compared to an analytical MPC. We also compare our method to a standard neural MPC implementation without regularization and primarily achieve a significantly increased flight stability implicitly due to the energy regularization and up to 15% lower MAE. Our code is available under: https://github.com/johanneskbl/jsk_aerial_robot/tree/develop/neural_MPC.