🤖 Robotics arXiv Digest

Thursday, April 16, 2026

📄 28 papers 📂 7 research areas ✨ Generated by Claude

🔭 Research Landscape

Today's batch reveals a robotics community converging on policy learning architectures that reason beyond single-step action prediction. The World-Value-Action (WAV) model tackles the exponential decay problem of feasible trajectories in action space by moving planning into a learned latent space with trajectory value functions, while HiST-AT introduces hierarchical spatiotemporal tokenization for in-context imitation learning, and R3D diagnoses why 3D policy learning has historically underperformed — pinpointing missing data augmentation and Batch Normalization as culprits rather than fundamental architectural limitations. Together with ADAPT's affordance-aware planning and DockAnywhere's viewpoint-invariant demonstration generation, these papers collectively argue that the next leap in robot manipulation requires structured reasoning about what will happen and whether it should happen, not just faster imitation of demonstrations.

A second prominent thread is SLAM and localization in extreme or degraded environments. The CAVERS dataset provides the first multimodal SLAM benchmark inside a natural karstic cave with motion-capture ground truth, while CAL2M tackles kilometer-scale SLAM using Visual Geometry Foundation Models without any calibration. Meanwhile, two 4D radar papers — Graph Theoretical Outlier Rejection for open-pit mines and 4D Radar Gaussian Modeling with RCS — demonstrate that radar is maturing as a primary sensing modality for GPS-denied, visually degraded settings. The Dual Pose-Graph system for drone racing achieves 56–74% ATE reduction by fusing semantic landmark detection with odometry, showing that domain structure can compensate for sensor limitations at extreme speeds.

A cross-cutting observation is the growing investment in infrastructure and datasets as first-class research contributions. DigiForest deploys heterogeneous robots (aerial, legged, marsupial) for precision forestry across multiple European sites; HRDexDB provides 1.4K grasping trials with synchronized tactile, visual, and kinematic data across human and robotic hands; and the multi-platform LiDAR forestry dataset links point clouds with decades of ecological flux measurements. The DEX-Mouse open-source teleoperation interface (under $150) further lowers the barrier to collecting dexterous manipulation data. This infrastructure turn suggests the community recognizes that scaling robot capabilities requires not just better algorithms but better data pipelines and benchmarks.

🧠 VLA & Policy Learning

Latent-space planning, hierarchical action tokenization, 3D policy architectures, affordance reasoning, and viewpoint-invariant imitation.

  • #6 HiST-AT — Hierarchical spatiotemporal tokenizer
  • #10 WAV — World-Value-Action implicit planning
  • #12 ADAPT — Affordance-aware commonsense planning
  • #17 DockAnywhere — View-generalized mobile manipulation
  • #21 R3D — Revisiting 3D policy learning

📍 SLAM & Localization

Visual-inertial-ranging fusion, calibration-free large-scale SLAM, cave datasets, and semantic pose graphs for drone racing.

  • #2 Sylvester Pose — Efficient closed-form solvers
  • #3 Dual Pose-Graph — Semantic drone racing loc.
  • #16 CAVERS — Cave multimodal SLAM dataset
  • #27 CAL2M — Calibration-free km-scale SLAM
  • #28 CT-VIR — Continuous-time VIR fusion

🌲 Radar, LiDAR & Field Sensing

Precision forestry with heterogeneous robots, multi-platform LiDAR datasets, and 4D radar scan matching advances.

  • #1 DigiForest — Digital forestry with autonomous robots
  • #7 Multi-platform LiDAR — Forest inventory dataset
  • #8 Graph PCM — 4D radar outlier rejection
  • #24 4D Radar Gaussian — RCS-aware scan matching

🤲 Manipulation & Grasping

Differentiable regrasp planning, large-scale dexterous grasping datasets, low-cost teleoperation, and POMDP-based object search.

  • #4 Differentiable Regrasp — EBM pose connectivity
  • #14 HRDexDB — Human & robotic dexterous grasps
  • #18 GNPF-kCT — POMDP object search in 3D
  • #20 DEX-Mouse — $150 teleoperation interface

🗺️ Navigation & Path Planning

Bio-inspired path planning, coverage planning benchmarks, multi-UAV trajectory optimization, and assistive trajectory frameworks.

  • #13 NEAT-NC — Neuro-evolution navigation cells
  • #19 Enhanced Tube-RRT* — Multi-UAV cascaded transport
  • #22 MHHTOF — Assistive trajectory optimization
  • #26 Hex Coverage — Maritime CPP benchmark

🦿 Humanoid & Legged Locomotion

Multi-skill switching for humanoids and passive body dynamics for energy-efficient biped walking and running.

  • #9 Switch — Agile humanoid skill transitions
  • #15 Passive Biped — Body dynamics exploit for RL

🛡️ Sim2Real, Control & Safety

Abstract sim2real transfer, energy-regularized neural MPC, conformal-prediction HRC safety, and robotic waste management.

  • #5 Abstract Sim2Real — Coarse simulator transfer
  • #11 Safe HRC — Conformal prediction guarantees
  • #23 Smart Waste — Robotic bio-digestor framework
  • #25 Energy MPC — Regularized neural MPC for UAVs
🧠 VLA & Policy Learning
6 h=21
2026-04-16 cs.RO Quoc-Huy Tran · h=21
Fawad Javed Fateh, Ali Shah Ali, Murad Popattia, Usman Nizamani, Andrey Konin
Core Contributions
  • Introduces a two-level vector quantization hierarchy where a lower level assigns actions to fine-grained subclusters and an upper level groups these into coarser clusters — unlike flat VQ approaches, this preserves both local action precision and global structure
  • Extends the spatial tokenizer with temporal cues by jointly recovering input actions and their timestamps, enabling the model to capture motion dynamics rather than treating actions as unordered sets
  • Achieves new state-of-the-art on multiple simulation and real-robot manipulation benchmarks for in-context imitation learning, where the agent must generalize from a handful of demonstrations at test time
  • Demonstrates that the hierarchical design consistently outperforms its non-hierarchical counterpart, suggesting that multi-resolution action representations are key to efficient few-shot policy transfer
Show abstract
We present a novel hierarchical spatiotemporal action tokenizer for in-context imitation learning. We first propose a hierarchical approach, which consists of two successive levels of vector quantization. In particular, the lower level assigns input actions to fine-grained subclusters, while the higher level further maps fine-grained subclusters to clusters. Our hierarchical approach outperforms the non-hierarchical counterpart, while mainly exploiting spatial information by reconstructing input actions. Furthermore, we extend our approach by utilizing both spatial and temporal cues, forming a hierarchical spatiotemporal action tokenizer, namely HiST-AT. Specifically, our hierarchical spatiotemporal approach conducts multi-level clustering, while simultaneously recovering input actions and their associated timestamps. Finally, extensive evaluations on multiple simulation and real robotic manipulation benchmarks show that our approach establishes a new state-of-the-art performance in in-context imitation learning.
10 h=10
2026-04-16 cs.RO cs.LG Hongyin Zhang · h=10
Runze Li, Hongyin Zhang, Junxi Jin, Qixin Zeng, Zifeng Zhuang
Core Contributions
  • Provides a theoretical analysis showing that planning directly in action space suffers from exponential probability decay of feasible trajectories with increasing horizon — motivating the shift to latent-space inference
  • Unifies world model, trajectory value function, and action generation into a single framework where the model progressively concentrates probability mass on high-value, dynamically feasible trajectories
  • Unlike explicit trajectory optimizers like CEM or MPPI, WAV performs implicit planning through structured latent representations, avoiding the computational cost of forward rollouts during inference
  • Demonstrates significant improvements in task success rate, generalization, and robustness over state-of-the-art VLA methods, with particularly strong gains in long-horizon and compositional scenarios
Show abstract
Vision-Language-Action (VLA) models have emerged as a promising paradigm for building embodied agents that ground perception and language into action. However, most existing approaches rely on direct action prediction, lacking the ability to reason over long-horizon trajectories and evaluate their consequences, which limits performance in complex decision-making tasks. In this work, we introduce World-Value-Action (WAV) model, a unified framework that enables implicit planning in VLA systems. Rather than performing explicit trajectory optimization, WAV model learn a structured latent representation of future trajectories conditioned on visual observations and language instructions. A learned world model predicts future states, while a trajectory value function evaluates their long-horizon utility. Action generation is then formulated as inference in this latent space, where the model progressively concentrates probability mass on high-value and dynamically feasible trajectories. We provide a theoretical perspective showing that planning directly in action space suffers from an exponential decay in the probability of feasible trajectories as the horizon increases. In contrast, latent-space inference reshapes the search distribution toward feasible regions, enabling efficient long-horizon decision making. Extensive simulations and real-world experiments demonstrate that the WAV model consistently outperforms state-of-the-art methods, achieving significant improvements in task success rate, generalization ability, and robustness, especially in long-horizon and compositional scenarios.
12 h=9
2026-04-16 cs.AI cs.CL cs.CV cs.RO Jia-Fong Yeh · h=9
Pei-An Chen, Yong-Ching Liang, Jia-Fong Yeh, Hung-Ting Su, Yi-Ting Chen
Core Contributions
  • Introduces DynAfford, a benchmark where object affordances change dynamically over time and are never specified in the instruction — forcing agents to perceive states, infer preconditions, and adapt on the fly rather than blindly following commands
  • ADAPT is a plug-and-play module that augments any existing task planner with explicit affordance reasoning, checking whether target objects can actually be manipulated before committing to actions
  • A domain-adapted, LoRA-finetuned VLM used as the affordance inference backend outperforms GPT-4o, demonstrating that task-aligned fine-tuning beats scale for grounded physical reasoning
  • Shows significant robustness improvements across both seen and unseen environments, highlighting that affordance awareness is a missing piece in current embodied AI pipelines
Show abstract
Intelligent embodied agents should not simply follow instructions, as real-world environments often involve unexpected conditions and exceptions. However, existing methods usually focus on directly executing instructions, without considering whether the target objects can actually be manipulated, meaning they fail to assess available affordances. To address this limitation, we introduce DynAfford, a benchmark that evaluates embodied agents in dynamic environments where object affordances may change over time and are not specified in the instruction. DynAfford requires agents to perceive object states, infer implicit preconditions, and adapt their actions accordingly. To enable this capability, we introduce ADAPT, a plug-and-play module that augments existing planners with explicit affordance reasoning. Experiments demonstrate that incorporating ADAPT significantly improves robustness and task success across both seen and unseen environments. We also show that a domain-adapted, LoRA-finetuned vision-language model used as the affordance inference backend outperforms a commercial LLM (GPT-4o), highlighting the importance of task-aligned affordance grounding.
17 h=5
2026-04-16 cs.RO Ziyu Shan · h=5
Ziyu Shan, Yuheng Zhou, Gaoyuan Wu, Ziheng Ji, Zhenyu Wu
Core Contributions
  • Identifies the "view generalization problem" in mobile manipulation — where docking point shifts between training and deployment cause visuomotor policies to fail — and solves it by lifting a single demonstration to diverse feasible docking configurations
  • Decouples docking-dependent base motions from contact-rich manipulation skills that remain invariant across viewpoints, enabling structure-preserving augmentation of demonstrations
  • Synthesizes visual observations in 3D by representing robot and objects as point clouds and applying point-level spatial editing, ensuring observation-action consistency across viewpoints without requiring additional real demonstrations
  • Achieves substantial success rate improvements on both ManiSkill and real-world platforms, with strong generalization to completely novel docking points unseen during training
Show abstract
Mobile manipulation is a fundamental capability that enables robots to interact in expansive environments such as homes and factories. Most existing approaches follow a two-stage paradigm, where the robot first navigates to a docking point and then performs fixed-base manipulation using powerful visuomotor policies. However, real-world mobile manipulation often suffers from the view generalization problem due to shifts of docking points. To address this issue, we propose a novel low-cost demonstration generation framework named DockAnywhere, which improves viewpoint generalization under docking variability by lifting a single demonstration to diverse feasible docking configurations. Specifically, DockAnywhere lifts a trajectory to any feasible docking points by decoupling docking-dependent base motions from contact-rich manipulation skills that remain invariant across viewpoints. Feasible docking proposals are sampled under feasibility constraints, and corresponding trajectories are generated via structure-preserving augmentation. Visual observations are synthesized in 3D space by representing the robot and objects as point clouds and applying point-level spatial editing to ensure the consistency of observation and action across viewpoints. Extensive experiments on ManiSkill and real-world platforms demonstrate that DockAnywhere substantially improves policy success rates and easily generalizes to novel viewpoints from unseen docking points during training, significantly enhancing the generalization capability of mobile manipulation policy in real-world deployment.
21 h=3
2026-04-16 cs.CV cs.RO Zhengdong Hong · h=3
Zhengdong Hong, Shenrui Wu, Haozhe Cui, Boyi Zhao, Ran Ji
Core Contributions
  • Systematically diagnoses why 3D policy learning has underperformed expectations: the omission of 3D data augmentation and the adverse effects of Batch Normalization — rather than inherent limitations of 3D representations — are the primary culprits
  • Proposes a new architecture pairing a scalable transformer-based 3D encoder with a diffusion decoder, engineered for training stability at scale and designed to leverage large-scale 3D pre-training
  • Significantly outperforms state-of-the-art 3D baselines on challenging manipulation benchmarks, establishing that 3D policy learning can be competitive when training recipes are corrected
  • Opens the door to scaling 3D imitation learning by removing the instabilities and overfitting problems that previously prevented adopting powerful 3D perception models for robot control
Show abstract
3D policy learning promises superior generalization and cross-embodiment transfer, but progress has been hindered by training instabilities and severe overfitting, precluding the adoption of powerful 3D perception models. In this work, we systematically diagnose these failures, identifying the omission of 3D data augmentation and the adverse effects of Batch Normalization as primary causes. We propose a new architecture coupling a scalable transformer-based 3D encoder with a diffusion decoder, engineered specifically for stability at scale and designed to leverage large-scale pre-training. Our approach significantly outperforms state-of-the-art 3D baselines on challenging manipulation benchmarks, establishing a new and robust foundation for scalable 3D imitation learning. Project Page: https://r3d-policy.github.io/
📍 SLAM & Localization
2 h=39
2026-04-16 cs.CV cs.RO E. Malis · h=39
Jana Vráblíková, Ezio Malis, Laurent Busé
Core Contributions
  • Introduces a new class of resultant-based solvers that exploit Sylvester forms to reduce the algebraic complexity of closed-form pose estimation — unlike prior resultant approaches that use larger elimination matrices
  • Demonstrates numerical accuracy on par with state-of-the-art solvers while achieving faster computational times, which is critical for real-time applications like visual servoing and SLAM
  • Applies the framework to both 3D-to-3D and 3D-to-2D correspondence problems (PnP), showing the generality of the Sylvester form approach across different pose estimation variants
  • Leverages careful rotation parametrization to reduce the optimization to a polynomial system, then uses the structure of Sylvester matrices to avoid the full resultant computation
Show abstract
Solving non-linear least-squares problem for pose estimation (rotation and translation) is often a time consuming yet fundamental problem in several real-time computer vision applications. With an adequate rotation parametrization, the optimization problem can be reduced to the solution of a system of polynomial equations and solved in closed form. Recent advances in efficient closed form solvers utilizing resultant matrices have shown a promising research direction to decrease the computation time while preserving the estimation accuracy. In this paper, we propose a new class of resultant-based solvers that exploit Sylvester forms to further reduce the complexity of the resolution. We demonstrate that our proposed methods are numerically as accurate as the state-of-the-art solvers, and outperform them in terms of computational time. We show that this approach can be applied for pose estimation in two different types of problems: estimating a pose from 3D to 3D correspondences, and estimating a pose from 3D points to 2D points correspondences.
3 h=28
2026-04-16 cs.RO P. Campoy · h=28
David Perez-Saura, Miguel Fernandez-Cortizas, Alvaro J. Gaona, Pascual Campoy
Core Contributions
  • Proposes a dual pose-graph architecture where a temporary graph accumulates multiple gate observations between keyframes and optimizes them into a single refined constraint per landmark before promoting it to a persistent main graph — preventing graph growth from degrading real-time performance
  • Achieves 56% to 74% reduction in Absolute Trajectory Error compared to standalone VIO on the TII-RATM dataset, demonstrating that structured racing environments can be exploited for robust localization
  • Ablation study confirms the dual-graph design achieves 10–12% higher accuracy than a single-graph baseline at identical computational cost, validating the information-preserving design
  • Successfully deployed in the A2RL competition for real-time onboard localization during high-speed flight, reducing drift by up to 4.2 m per lap compared to the odometry baseline
Show abstract
Autonomous drone racing demands robust real-time localization under extreme conditions: high-speed flight, aggressive maneuvers, and payload-constrained platforms that often rely on a single camera for perception. Existing visual SLAM systems, while effective in general scenarios, struggle with motion blur and feature instability inherent to racing dynamics, and do not exploit the structured nature of racing environments. In this work, we present a dual pose-graph architecture that fuses odometry with semantic detections for robust localization. A temporary graph accumulates multiple gate observations between keyframes and optimizes them into a single refined constraint per landmark, which is then promoted to a persistent main graph. This design preserves the information richness of frequent detections while preventing graph growth from degrading real-time performance. The system is designed to be sensor-agnostic, although in this work we validate it using monocular visual-inertial odometry and visual gate detections. Experimental evaluation on the TII-RATM dataset shows a 56% to 74% reduction in ATE compared to standalone VIO, while an ablation study confirms that the dual-graph architecture achieves 10% to 12% higher accuracy than a single-graph baseline at identical computational cost. Deployment in the A2RL competition demonstrated that the system performs real-time onboard localization during flight, reducing the drift of the odometry baseline by up to 4.2 m per lap.
16 h=5
2026-04-16 cs.RO Marcello Chiaberge · h=5
Giacomo Franchini, David Rodríguez-Martínez, Alfonso Martínez-Petersen, C. J. Pérez-del-Pulgar, Marcello Chiaberge
Core Contributions
  • Provides the first multimodal SLAM dataset from a natural karstic cave with mm-accurate 6-DoF ground truth at 120 Hz from a motion capture system installed directly inside the cave — a uniquely challenging environment with irregular geometry, reflective wet surfaces, and zero ambient light
  • Combines RGB-D-I, thermal-IR, and LiDAR sensing across 24 sequences (~335 GB) in both handheld and rover configurations under full darkness and artificial illumination conditions
  • Benchmarks seven state-of-the-art SLAM/odometry algorithms spanning visual, visual-inertial, thermal-inertial, and LiDAR pipelines, establishing quantitative baselines for cave perception
  • Addresses a critical gap: existing subterranean datasets focus on mines or tunnels, which have regular geometry fundamentally different from the branching, irregular passages of natural caves
Show abstract
Autonomous robots operating in natural karstic caves face perception and navigation challenges that are qualitatively distinct from those encountered in mines or tunnels: irregular geometry, reflective wet surfaces, near-zero ambient light, and complex branching passages. Yet publicly available datasets targeting this environment remain scarce and offer limited sensing modalities and environmental diversity. We present CAVERS, a multimodal dataset acquired in two structurally distinct rooms of Cueva de la Victoria, Málaga, Spain, comprising 24 sequences totaling approximately 335 GB of recorded data. The sensor suite combines an Intel RealSense D435i RGB-D-I camera, an Optris PI640i near-IR thermal camera, and a Velodyne VLP-16 LiDAR, operated both handheld and mounted on a wheeled rover under full darkness and artificial illumination. For most of the sequences, mm-accurate 6-DoF ground truth pose and velocity at 120 Hz are provided by an Optirack motion capture system installed directly inside the cave. We benchmark seven state-of-the-art SLAM and odometry algorithms spanning visual, visual-inertial, thermal-inertial, and LiDAR-based pipelines, as well as a 3D reconstruction pipeline, demonstrating the dataset's usability.
27 h=1
2026-04-16 cs.RO Tianchen Deng · h=1
Tianjun Zhang, Fengyi Zhang, Tianchen Deng, Lin Zhang, Hesheng Wang
Core Contributions
  • Argues that single linear transforms (Sim3, SL4) are fundamentally insufficient for aligning Visual Geometry Foundation Model outputs at kilometer scale, because VGFMs introduce complex non-linear geometric distortions that accumulate into trajectory drift
  • Introduces an "assistant eye" that exploits the prior of constant physical spacing to eliminate scale ambiguity without any temporal or spatial pre-calibration — a completely novel approach to calibration-free SLAM
  • Proposes an epipolar-guided intrinsic and pose correction model that rectifies rotation and translation errors caused by inaccurate intrinsics through fundamental matrix decomposition
  • Uses anchor propagation for globally consistent mapping, applying nonlinear transformations to elastically align sub-maps rather than rigid transforms, effectively eliminating geometric misalignment at scale
Show abstract
Visual Geometry Foundation Models (VGFMs) demonstrate remarkable zero-shot capabilities in local reconstruction. However, deploying them for kilometer-level Simultaneous Localization and Mapping (SLAM) remains challenging. In such scenarios, current approaches mainly rely on linear transforms (e.g., Sim3 and SL4) for sub-map alignment, while we argue that a single linear transform is fundamentally insufficient to model the complex, non-linear geometric distortions inherent in VGFM outputs. Forcing such rigid alignment leads to the rapid accumulation of uncorrected residuals, eventually resulting in significant trajectory drift and map divergence. To address these limitations, we present CAL2M (Calibration-free Assistant-eye based Large-scale Localization and Mapping), a plug-and-play framework compatible with arbitrary VGFMs. Distinct from traditional systems, CAL2M introduces an "assistant eye" solely to leverage the prior of constant physical spacing, effectively eliminating scale ambiguity without any temporal or spatial pre-calibration. Furthermore, leveraging the assumption of accurate feature matching, we propose an epipolar-guided intrinsic and pose correction model. Supported by an online intrinsic search module, it can effectively rectify rotation and translation errors caused by inaccurate intrinsics through fundamental matrix decomposition. Finally, to ensure accurate mapping, we introduce a globally consistent mapping strategy based on anchor propagation. By constructing and fusing anchors across the trajectory, we establish a direct local-to-global mapping relationship. This enables the application of nonlinear transformations to elastically align sub-maps, effectively eliminating geometric misalignments and ensuring a globally consistent reconstruction. The source code of CAL2M will be publicly available at https://github.com/IRMVLab/CALM.
28 h=0
2026-04-16 cs.RO Yu-An Liu · h=0
Yu-An Liu, Li Zhang
Core Contributions
  • Addresses the practical constraint that high-accuracy UWB localization requires well-deployed anchors — which is often infeasible in narrow or low-power environments — by constructing virtual anchors from VIO motion priors and UWB measurements
  • Parameterizes the pose trajectory using B-splines in continuous time, enabling natural handling of asynchronous multi-sensor sampling without the discrete-time alignment issues that plague filtering and standard optimization methods
  • Formulates inertial, visual, and ranging constraints as factors in a sliding-window graph, jointly optimizing spline control points and auxiliary parameters for a continuous-time trajectory estimate
  • Demonstrates effectiveness on public datasets and real-world experiments, showing that spline-based continuous-time fusion can balance positioning accuracy, trajectory consistency, and computational efficiency
Show abstract
Visual-inertial odometry (VIO) is widely used for mobile robot localization, but its long-term accuracy degrades without global constraints. Incorporating ranging sensors such as ultra-wideband (UWB) can mitigate drift; however, high-accuracy ranging usually requires well-deployed anchors, which is difficult to ensure in narrow or low-power environments. Moreover, most existing visual-inertial-ranging (VIR) fusion methods rely on discrete time-based filtering or optimization, making it difficult to balance positioning accuracy, trajectory consistency, and fusion efficiency under asynchronous multi-sensor sampling. To address these issues, we propose a spline-based continuous-time state estimation method for VIR fusion localization. In the preprocessing stage, VIO motion priors and UWB ranging measurements are used to construct virtual anchors and reject outliers, thereby alleviating geometric degeneration and improving range reliability. In the estimation stage, the pose trajectory is parameterized in continuous time using a B-spline, while inertial, visual, and ranging constraints are formulated as factors in a sliding-window graph. The spline control points, together with a small set of auxiliary parameters, are then jointly optimized to obtain a continuous-time trajectory estimate. Evaluations on public datasets and real-world experiments demonstrate the effectiveness and practical potential of the proposed approach.
🌲 Radar, LiDAR & Field Sensing
1 h=81
2026-04-16 cs.RO C. Stachniss · h=81
Marco Camurri, Enrico Tomelleri, Matías Mattamala, Sebastián Barbas Laina, Martin Jacquet
Core Contributions
  • Deploys a complete precision forestry pipeline spanning autonomous data collection with heterogeneous robots (aerial, legged, and marsupial), automated tree trait extraction, a decision support system for growth forecasting, and autonomous selective logging harvesters
  • Validates all four components in real-world forests across Finland, UK, and Switzerland — not just in simulation — demonstrating that the full technology stack works under actual forestry conditions
  • Addresses a critical EU policy need: forests cover ~40% of European land and are central to climate neutrality and biodiversity goals, but current management practices lack the automation needed for tree-level precision at scale
  • The marsupial robot design — where a legged robot deploys from an aerial platform — represents a novel approach to sub-canopy data collection that avoids the limitations of either platform alone
Show abstract
Covering one third of Earth's land surface, forests are vital to global biodiversity, climate regulation, and human well-being. In Europe, forests and woodlands reach approximately 40% of land area, and the forestry sector is central to achieving the EU's climate neutrality and biodiversity goals; these emphasize sustainable forest management, increased use of long-lived wood products, and resilient forest ecosystems. To meet these goals and properly address their inherent challenges, current practices require further innovation. This chapter introduces DigiForest, a novel, large-scale precision forestry approach leveraging digital technologies and autonomous robotics. DigiForest is structured around four main components: (1) autonomous, heterogeneous mobile robots (aerial, legged, and marsupial) for tree-level data collection; (2) automated extraction of tree traits to build forest inventories; (3) a Decision Support System (DSS) for forecasting forest growth and supporting decision-making; and (4) low-impact selective logging using purpose-built autonomous harvesters. These technologies have been extensively validated in real-world conditions in several locations, including forests in Finland, the UK, and Switzerland.
7 h=19
2026-04-16 cs.RO K. Ellenrieder · h=19
Michael R. Chang, Anna Candotti, Karl von Ellenrieder, Enrico Tomelleri, Marco Camurri
Core Contributions
  • Integrates UAV-borne, terrestrial, and backpack mobile laser scanning from an ICOS forest plot into a single curated dataset — explicitly designed for calibration, benchmarking, and linking 3D structure with ecological observations and allometric models
  • Uses marker-free, SLAM-aware protocols to reduce both field and processing time while maintaining registration quality — a practical innovation for repeated forest inventories at scale
  • Situates acquisitions at a long-term ICOS monitoring site with decades of ecological and flux measurements, creating a unique bridge between remote sensing and established ecological science
  • Provides 333 million TLS points with complementary ULS/MLS data in LAZ and E57 formats with UTM coordinates, enabling benchmarks for registration, segmentation, quantitative structure models, and biomass estimation
Show abstract
We present a curated multi-platform LiDAR reference dataset from an instrumented ICOS forest plot, explicitly designed to support calibration, benchmarking, and integration of 3D structural data with ecological observations and standard allometric models. The dataset integrates UAV-borne laser scanning (ULS) to measure canopy coverage, terrestrial laser scanning (TLS) for detailed stem mapping, and backpack mobile laser scanning (MLS) with real-time SLAM for efficient sub-canopy acquisition. We focus on the control plot with the most complete and internally consistent registration, where TLS point clouds (~333 million points) are complemented by ULS and MLS data capturing canopy and understory strata. Marker-free, SLAM-aware protocols were used to reduce field and processing time, while manual and automated methods were combined. Final products are available in LAZ and E57 formats with UTM coordinates, together with registration reports for reproducibility. The dataset provides a benchmark for testing registration methods, evaluating scanning efficiency, and linking point clouds with segmentation, quantitative structure models, and allometric biomass estimation. By situating the acquisitions at a long-term ICOS site, it is explicitly linked to 3D structure with decades of ecological and flux measurements. More broadly, it illustrates how TLS, MLS, and ULS can be combined for repeated inventories and digital twins of forest ecosystems.
8 h=13
2026-04-16 cs.RO Daniel Adolfsson · h=13
Georg Dorndorf, Daniel Adolfsson, Masrur Doostdar
Core Contributions
  • Integrates graph-based pairwise consistency maximization (PCM) into the ICP loop for 4D radar, with a radar-adapted scoring function that incorporates anisotropic, per-detection uncertainty from a measurement model — unlike standard PCM that assumes isotropic noise
  • Reduces segment relative position error by 29.6% on 1 m segments and up to 55% on 100 m segments compared to the GICP baseline on real open-pit mine data — a setting where feature poverty makes correspondence reliability particularly poor
  • Uses a greedy heuristic to approximate maximum clique finding in the consistency graph, keeping the method suitable for online use despite the combinatorial nature of the outlier rejection problem
  • Specifically targets open-pit mines where the lack of distinctive landmarks compounds radar's inherent challenges of scan sparsity and multipath reflections, demonstrating practical viability in industrial settings
Show abstract
Automotive 4D imaging radar is well suited for operation in dusty and low-visibility environments, but scan registration remains challenging due to scan sparsity and spurious detections caused by noise and multipath reflections. This difficulty is compounded in feature-poor open-pit mines, where the lack of distinctive landmarks reduces correspondence reliability. We integrate graph-based pairwise consistency maximization (PCM) as an outlier rejection step within the iterative closest points (ICP) loop. We propose a radar-adapted pairwise distance-invariant scoring function for graph-based (PCM) that incorporates anisotropic, per-detection uncertainty derived from a radar measurement model. The consistency maximization problem is approximated with a greedy heuristic that finds a large clique in the pairwise consistency graph. The refined correspondence set improves robustness when the initial association set is heavily contaminated. We evaluate a standard Euclidean distance residual and our uncertainty-aware residual on an open-pit mine dataset collected with a 4D imaging radar. Compared to the generalized ICP (GICP) baseline without PCM, our method reduces segment relative position error (RPE) by 29.6% on 1 m segments and by up to 55% on 100 m segments. The presented method is intended for integration into localization pipelines and is suitable for online use due to the greedy heuristic in graph-based (PCM).
24 h=2
2026-04-16 cs.RO Fernando Amodeo · h=2
Fernando Amodeo, Luis Merino, Fernando Caballero
Core Contributions
  • Incorporates Radar Cross Section (RCS) — a measure of how much radar energy an object reflects — into 3D Gaussian scene models, whereas prior 4D radar work typically discards RCS information during modeling and matching
  • Extends previous 3D Gaussian modeling and scan matching frameworks to model the physical behavior of RCS, enriching scene representations beyond just geometric position and Doppler velocity
  • Demonstrates that including RCS improves scan matching performance by providing additional discriminative information about surface materials and geometry, which is especially valuable when spatial features alone are ambiguous
Show abstract
4D millimeter-wave (mmWave) radars are increasingly used in robotics, as they offer robustness against adverse environmental conditions. Besides the usual XYZ position, they provide Doppler velocity measurements as well as Radar Cross Section (RCS) information for every point. While Doppler is widely used to filter out dynamic points, RCS is often overlooked and not usually used in modeling and scan matching processes. Building on previous 3D Gaussian modeling and scan matching work, we propose incorporating the physical behavior of RCS in the model, in order to further enrich the summarized information about the scene, and improve the scan matching process.
🤲 Manipulation & Grasping
4 h=27
2026-04-16 cs.RO Weiwei Wan · h=27
Liang Qin, Weiwei Wan, Kensuke Harada
Core Contributions
  • Replaces brittle discrete search over intermediate placements with a continuous, differentiable energy landscape for measuring pose-pair connectivity — enabling gradient-based optimization of intermediate object poses during regrasp planning
  • Models grasp feasibility under an object pose using an Energy-Based Model (EBM) and exploits energy additivity to construct a smooth connectivity metric, providing informative gradients that discrete approaches fundamentally cannot
  • Introduces an adaptive iterative deepening strategy that automatically determines the minimum number of intermediate regrasp steps, eliminating the need to pre-specify the number of regrasps
  • Demonstrates cross-end-effector transfer: a model trained with suction constraints can guide parallel gripper manipulation, suggesting the learned energy landscape captures general pose connectivity rather than gripper-specific feasibility
Show abstract
Regrasp planning is often required when one pick-and-place cannot transfer an object from an initial pose to a goal pose while maintaining grasp feasibility. The main challenge is to reason about shared-grasp connectivity across intermediate poses, where discrete search becomes brittle. We propose an implicit multi-step regrasp planning framework based on differentiable pose sequence connectivity metrics. We model grasp feasibility under an object pose using an Energy-Based Model (EBM) and leverage energy additivity to construct a continuous energy landscape that measures pose-pair connectivity, enabling gradient-based optimization of intermediate object poses. An adaptive iterative deepening strategy is introduced to determine the minimum number of intermediate steps automatically. Experiments show that the proposed cost formulation provides smooth and informative gradients, improving planning robustness over other alternatives. They also demonstrate generalization to unseen grasp poses and cross-end-effector transfer, where a model trained with suction constraints can guide parallel gripper grasp manipulation. The multi-step planning results further highlight the effectiveness of adaptive deepening and minimum-step search.
14 h=8
2026-04-16 cs.RO cs.CV Hanbyul Joo · h=8
Jongbin Lim, Taeyun Ha, Mingi Choi, Jisoo Kim, Byungjun Kim
Core Contributions
  • Provides closely aligned captures of human dexterity and robotic execution on the same 100 objects under comparable grasping motions — enabling direct cross-domain comparison that existing datasets (which cover either humans or robots, not both) cannot support
  • Includes 1.4K grasping trials with synchronized high-resolution tactile signals, multi-view video, egocentric video, and high-precision 3D motion capture for both agent and manipulated object
  • Captures both successes and failures, providing negative examples that are critical for learning robust policies but are typically absent from curated datasets
  • Serves as a foundational benchmark for multi-modal policy learning and cross-domain dexterous manipulation transfer — the aligned human-robot format specifically enables studying how human grasping strategies translate to different robot hand embodiments
Show abstract
We present HRDexDB, a large-scale, multi-modal dataset of high-fidelity dexterous grasping sequences featuring both human and diverse robotic hands. Unlike existing datasets, HRDexDB provides a comprehensive collection of grasping trajectories across human hands and multiple robot hand embodiments, spanning 100 diverse objects. Leveraging state-of-the-art vision methods and a new dedicated multi-camera system, our HRDexDB offers high-precision spatiotemporal 3D ground-truth motion for both the agent and the manipulated object. To facilitate the study of physical interaction, HRDexDB includes high-resolution tactile signals, synchronized multi-view video, and egocentric video streams. The dataset comprises 1.4K grasping trials, encompassing both successes and failures, each enriched with visual, kinematic, and tactile modalities. By providing closely aligned captures of human dexterity and robotic execution on the same target objects under comparable grasping motions, HRDexDB serves as a foundational benchmark for multi-modal policy learning and cross-domain dexterous manipulation.
18 h=5
2026-04-16 cs.RO Shoudong Huang · h=5
Yongbo Chen, Hesheng Wang, Shoudong Huang, Hanna Kurniawati
Core Contributions
  • Formulates object search as a high-dimensional POMDP with a growing state space and hybrid (continuous+discrete) action spaces in 3D — handling the realistic complexity that state and action spaces expand as the robot discovers new objects and areas
  • Proposes GNPF-kCT, a novel online POMDP solver combining belief tree reuse for growing state spaces, a neural process network to filter useless primitive actions, and k-center clustering for efficient high-dimensional action space refinement
  • Outperforms both POMDP-based baselines and state-of-the-art LLM-based object search methods in Gazebo simulations with Fetch and Stretch robots under the same computational and perception constraints
  • Introduces a "guessed target object" strategy with a grid-world model to handle limited-information scenarios, providing useful search guidance when reward signals are sparse or absent
Show abstract
Efficiently locating target objects in complex indoor environments with diverse furniture, such as shelves, tables, and beds, is a significant challenge for mobile robots. This difficulty arises from factors like localization errors, limited fields of view, and visual occlusion. We address this by framing the object-search task as a highdimensional Partially Observable Markov Decision Process (POMDP) with a growing state space and hybrid (continuous and discrete) action spaces in 3D environments. Based on a meticulously designed perception module, a novel online POMDP solver named the growing neural process filtered k-center clustering tree (GNPF-kCT) is proposed to tackle this problem. Optimal actions are selected using Monte Carlo Tree Search (MCTS) with belief tree reuse for growing state space, a neural process network to filter useless primitive actions, and k-center clustering hypersphere discretization for efficient refinement of high-dimensional action spaces. A modified upper-confidence bound (UCB), informed by belief differences and action value functions within cells of estimated diameters, guides MCTS expansion. Theoretical analysis validates the convergence and performance potential of our method. To address scenarios with limited information or rewards, we also introduce a guessed target object with a grid-world model as a key strategy to enhance search efficiency. Extensive Gazebo simulations with Fetch and Stretch robots demonstrate faster and more reliable target localization than POMDP-based baselines and state-of-the-art (SOTA) non-POMDP-based solvers, especially large language model (LLM) based methods, in object search under the same computational constraints and perception systems. Real-world tests in office environments confirm the practical applicability of our approach.
20 h=4
2026-04-16 cs.RO Wook Ko · h=4
Joonho Koh, Haechan Jung, Nayoung Kim, Wook Ko, Changjoo Nam
Core Contributions
  • Builds a complete dexterous hand teleoperation interface for under $150 from commercial off-the-shelf components — dramatically undercutting MoCap glove systems that cost thousands and require per-operator calibration
  • Achieves operator-agnostic, calibration-free operation with integrated kinesthetic force feedback, enabling immediate deployment across diverse environments and platforms without structural modification
  • Supports an "attached configuration" where the robot hand mounts directly on the operator's forearm, producing robot-aligned demonstration data and reducing perceived workload compared to spatially separated setups across all compared interfaces
  • Operators achieved 86.67% task completion rate across various dexterous manipulation tasks in the attached configuration, with the full hardware and software stack open-sourced for community adoption
Show abstract
Data-driven dexterous hand manipulation requires large-scale, physically consistent demonstration data. Simulation and video-based methods suffer from sim-to-real gaps and retargeting problems, while MoCap glove-based teleoperation systems require per-operator calibration and lack portability, as the robot hand is typically fixed to a stationary arm. Portable alternatives improve mobility but lack cross-platform and cross-operator compatibility. We present DEX-Mouse, a portable, calibration-free hand-held teleoperation interface with integrated kinesthetic force feedback, built from commercial off-the-shelf components under USD 150. The operator-agnostic design requires no calibration or structural modification, enabling immediate deployment across diverse environments and platforms. The interface supports a configuration in which the target robot hand is mounted directly on the forearm of an operator, producing robot-aligned data. In a comparative user study across various dexterous manipulation tasks, operators using the proposed system achieved an 86.67% task completion rate under the attached configuration. Also, we found that the attached configuration reduced the perceived workload of the operators compared to spatially separated teleoperation setups across all compared interfaces. The complete hardware and software stack, including bill of materials, CAD models, and firmware, is open-sourced at https://dex-mouse.github.io/ to facilitate replication and adoption.
🗺️ Navigation & Path Planning
13 h=8
2026-04-16 cs.RO cs.AI cs.NE K. Slimani · h=8
Hibatallah Meliani, Khadija Slimani, Samira Khoulji
Core Contributions
  • Draws directly from neuroscience — using place cells, grid cells, head direction cells, border cells, and speed cells as inputs to an evolving neural network — to model hippocampal spatial cognition for robot path planning
  • Evolves recurrent neural network topologies using NEAT (Neuro-Evolution of Augmenting Topology) that grow in complexity to match the environment, rather than using fixed architectures that may be over- or under-specified
  • Evaluates across both static and dynamic scenarios, showing that the biological navigation cell representation improves NEAT's ability to adapt to complex and varying environments
  • Suggests the approach is well-suited for real-time dynamic path planning in robotics and games, where the environment changes faster than a fixed planner can re-plan
Show abstract
To navigate a space, the brain makes an internal representation of the environment using different cells such as place cells, grid cells, head direction cells, border cells, and speed cells. All these cells, along with sensory inputs, enable an organism to explore the space around it. Inspired by these biological principles, we developed NEATNC, a Neuro-Evolution of Augmenting Topology guided Navigation Cells. The goal of the paper is to improve NEAT algorithm performance in path planning in dynamic environments using spatial cognitive cells. This approach uses navigation cells as inputs and evolves recurrent neural networks, representing the hippocampus part of the brain. The performance of the proposed algorithm is evaluated in different static and dynamic scenarios. This study highlights NEAT's adaptability to complex and different environments, showcasing the utility of biological theories. This suggests that our approach is well-suited for real-time dynamic path planning for robotics and games.
19 h=4
2026-04-16 cs.RO eess.SY Tianhua Gao · h=4
Jianqiao Yu, Jia Li, Tianhua Gao
Core Contributions
  • Develops Enhanced Tube-RRT* with active hybrid sampling and adaptive expansion that achieves higher success and effective sampling rates than STube-RRT* and AETube-RRT* in densely cluttered environments
  • Explicitly incorporates trajectory smoothness cost into the edge cost function to reduce excessive turns — directly mitigating cable-induced oscillations that plague tethered multi-UAV payload systems
  • Formulates a convex quadratic program in Stage II that jointly considers payload translational/rotational dynamics, cable tension constraints, and collision safety to produce smooth, collision-free payload trajectories
  • Validates the complete two-stage framework with centralized geometric control, demonstrating practical feasibility for payload attitude maneuvering in dense obstacle fields where existing methods struggle
Show abstract
This paper presents a two-stage trajectory planning framework for a multi-UAV rigid-payload cascaded transportation system, aiming to address planning challenges in densely cluttered environments. In Stage I, an Enhanced Tube-RRT* algorithm is developed by integrating active hybrid sampling and an adaptive expansion strategy, enabling rapid generation of a safe and feasible virtual tube in environments with dense obstacles. Moreover, a trajectory smoothness cost is explicitly incorporated into the edge cost to reduce excessive turns and thereby mitigate cable-induced oscillations. Simulation results demonstrate that the proposed Enhanced Tube-RRT* achieves a higher success rate and effective sampling rate than mixed-sampling Tube-RRT* (STube-RRT*) and adaptive-extension Tube-RRT* (AETube-RRT*), while producing a shorter optimal path with a smaller cumulative turning angle. In Stage II, a convex quadratic program is formulated by considering payload translational and rotational dynamics, cable tension constraints, and collision-safety constraints, yielding a smooth, collision-free desired payload trajectory. Finally, a centralized geometric control scheme is applied to the cascaded system to validate the effectiveness and feasibility of the proposed planning framework, offering a practical solution for payload attitude maneuvering in densely cluttered environments.
22 h=3
2026-04-16 cs.RO Yongbin Yu · h=3
Yuting Zeng, Zhiwen Zheng, Jingya Wang, You Zhou, JiaLing Xiao
Core Contributions
  • Introduces momentum constraints into heuristic trajectory optimization to suppress abrupt velocity and acceleration changes — directly addressing the comfort requirements of assistive navigation for visually impaired users
  • A residual-enhanced DRL module refines candidate trajectories, combining the coverage of heuristic sampling with the temporal modeling and generalization capabilities of learned policies
  • Proposes a dual-stage cost mechanism: Frenet-space costs ensure trajectory consistency while Cartesian-space reward-driven adaptive weights integrate user preferences for interpretable, user-centric decision-making
  • Converges in nearly half the iterations of baselines while achieving lower and more stable costs, with stable velocity/acceleration profiles and reduced risk in complex dynamic scenarios
Show abstract
Safe and efficient assistive planning for visually impaired scenarios remains challenging, since existing methods struggle with multi-objective optimization, generalization, and interpretability. In response, this paper proposes a Momentum-Constrained Hybrid Heuristic Trajectory Optimization Framework (MHHTOF). To balance multiple objectives of comfort and safety, the framework designs a Heuristic Trajectory Sampling Cluster (HTSC) with a Momentum-Constrained Trajectory Optimization (MTO), which suppresses abrupt velocity and acceleration changes. In addition, a novel residual-enhanced deep reinforcement learning (DRL) module refines candidate trajectories, advancing temporal modeling and policy generalization. Finally, a dual-stage cost modeling mechanism (DCMM) is introduced to regulate optimization, where costs in the Frenet space ensure consistency, and reward-driven adaptive weights in the Cartesian space integrate user preferences for interpretability and user-centric decision-making. Experimental results show that the proposed framework converges in nearly half the iterations of baselines and achieves lower and more stable costs. In complex dynamic scenarios, MHHTOF further demonstrates stable velocity and acceleration curves with reduced risk, confirming its advantages in robustness, safety, and efficiency.
26 h=1
2026-04-16 cs.RO cs.AI math.OC Gonzalo A. Ruz · h=1
Carlos S. Sepúlveda, Gonzalo A. Ruz
Core Contributions
  • Provides a reproducible benchmark of 17 deterministic heuristics from 7 families on 10,000 Hamiltonian-feasible hexagonal graph instances — filling a gap where classical coverage methods were typically compared on small ad hoc examples or rectangular grids
  • Reveals that the strongest classical Hamiltonian baseline is a Warnsdorff variant using index-based tie-breaking with a terminal-inclusive residual-degree policy, reaching 79.0% Hamiltonian success — and that this underreported implementation detail materially affects performance
  • Shows that heuristics with explicit shortest-path reconnection reliably solve relaxed coverage but almost never produce zero-revisit tours, highlighting a fundamental limitation of greedy approaches on sparse geometric graphs with bottlenecks
  • Targets maritime surveillance, search-and-rescue, and environmental monitoring scenarios where hexagonal grids naturally model operational areas, providing the community with a controlled testbed for heuristic analysis
Show abstract
Coverage path planning on irregular hexagonal grids is relevant to maritime surveillance, search and rescue and environmental monitoring, yet classical methods are often compared on small ad hoc examples or on rectangular grids. This paper presents a reproducible benchmark of deterministic single-vehicle coverage path planning heuristics on irregular hexagonal graphs derived from synthetic but maritime-motivated areas of interest. The benchmark contains 10,000 Hamiltonian-feasible instances spanning compact, elongated, and irregular morphologies, 17 heuristics from seven families, and a common evaluation protocol covering Hamiltonian success, complete-coverage success, revisits, path length, heading changes, and CPU latency. Across the released dataset, heuristics with explicit shortest-path reconnection solve the relaxed coverage task reliably but almost never produce zero-revisit tours. Exact Depth-First Search confirms that every released instance is Hamiltonian-feasible. The strongest classical Hamiltonian baseline is a Warnsdorff variant that uses an index-based tie-break together with a terminal-inclusive residual-degree policy, reaching 79.0% Hamiltonian success. The dominant design choice is not tie-breaking alone, but how the residual degree is defined when the endpoint is reserved until the final move. This shows that underreported implementation details can materially affect performance on sparse geometric graphs with bottlenecks. The benchmark is intended as a controlled testbed for heuristic analysis rather than as a claim of operational optimality at fleet scale.
🦿 Humanoid & Legged Locomotion
9 h=13
2026-04-16 cs.RO Yinhuai Wang · h=13
Yuen-Fui Lau, Qihan Zhao, Yinhuai Wang, Runyi Yu, Hok Wai Tsui
Core Contributions
  • Addresses a critical safety gap in humanoid locomotion: existing approaches train individual agile skills well but struggle with flexible transitions between them, creating dangerous instability during skill changes
  • Builds a Skill Graph from kinematic similarity within multi-skill motion data, establishing which cross-skill transitions are physically feasible before training — rather than discovering transitions through trial and error
  • An online skill scheduler performs real-time graph search to find optimal feasible transition paths when switching skills or recovering from tracking deviations, ensuring stable execution without offline pre-computation
  • Demonstrates high success rates for agile skill transitions while maintaining strong motion imitation performance — showing that transition quality and individual skill quality are not in fundamental conflict
Show abstract
Recent advancements in whole-body control through deep reinforcement learning have enabled humanoid robots to achieve remarkable progress in real-world challenging locomotion skills. However, existing approaches often struggle with flexible transitions between distinct skills, creating safety concerns and practical limitations. To address this challenge, we introduce a hierarchical multi-skill system, Switch, enabling seamless skill transitions at any moment. Our approach comprises three key components: (1) a Skill Graph (SG) that establishes potential cross-skill transitions based on kinematic similarity within multi-skill motion data, (2) a whole-body tracking policy trained on this skill graph through deep reinforcement learning, and (3) an online skill scheduler to drive the tracking policy for robust skill execution and smooth transitions. For skill switching or significant tracking deviations, the scheduler performs online graph search to find the optimal feasible path, which ensures efficient, stable, and real-time execution of diverse locomotion skills. Comprehensive experiments demonstrate that Switch empowers humanoid to execute agile skill transitions with high success rates while maintaining strong motion imitation performance.
15 h=6
2026-04-16 cs.RO eess.SY Tomoya Kamimura · h=6
Tomoya Kamimura, Haruka Washiyama, Akihito Sano
Core Contributions
  • Shows that biped robots with passive elements (springs) trained via model-based deep RL converge to stable limit cycles through dynamic interaction with the ground — the attractor-driven learning produces locomotion that is both robust and energy-efficient
  • Compares robots with and without passive elements in simulation, finding that passive dynamics fundamentally change the learning landscape: trajectories converge quickly to limit cycles but take longer to achieve high rewards, suggesting a trade-off between stability and reward optimization speed
  • Demonstrates that implementing passive properties in the robot body is crucial for future embodied AI — the body's mechanical intelligence reduces the computational burden on the learned controller
  • Generates both walking and running gaits from the same framework, showing that passive dynamics enable multiple locomotion modes without separate skill-specific training
Show abstract
Embodiment is a significant keyword in recent machine learning fields. This study focused on the passive nature of the body of a biped robot to generate walking and running locomotion using model-based deep reinforcement learning. We constructed two models in a simulator, one with passive elements (e.g., springs) and the other, which is similar to general humanoids, without passive elements. The training of the model with passive elements was highly affected by the attractor of the system. This lead that although the trajectories quickly converged to limit cycles, it took a long time to obtain large rewards. However, thanks to the attractor-driven learning, the acquired locomotion was robust and energy-efficient. The results revealed that robots with passive elements could efficiently acquire high-performance locomotion by utilizing stable limit cycles generated through dynamic interaction between the body and ground. This study demonstrates the importance of implementing passive properties in the body for future embodied AI.
🛡️ Sim2Real, Control & Safety
5 h=22
2026-04-16 cs.RO Josiah P. Hanna · h=22
Yunfu Deng, Yuhao Li, Josiah P. Hanna
Core Contributions
  • Formalizes the "abstract sim2real" problem — transferring policies from simulators that deliberately omit key task details — using the language of state abstraction from RL theory, providing a principled framework rather than ad hoc domain randomization
  • Shows theoretically that an abstract simulator can be grounded to match the target task if the grounded dynamics take the history of states into account, connecting approximate information states to sim2real transfer guarantees
  • Introduces a practical method that uses real-world task data to correct the dynamics of the abstract simulator, enabling successful policy transfer even when the simulator's abstraction level is intentionally coarse
  • Validates the approach in both sim2sim and sim2real settings, demonstrating that the formalism translates to practical improvements when detailed simulators are unavailable or prohibitively expensive to build
Show abstract
In recent years, reinforcement learning (RL) has shown remarkable success in robotics when a fast and accurate simulator is available for a given task. When using RL and simulation, more simulator realism is generally beneficial but becomes harder to obtain as robots are deployed in increasingly complex and widescale domains. In such settings, simulators will likely fail to model all relevant details of a given target task and this observation motivates the study of sim2real with simulators that leave out key task details. In this paper, we formalize and study the abstract sim2real problem: given an abstract simulator that models a target task at a coarse level of abstraction, how can we train a policy with RL in the abstract simulator and successfully transfer it to the real-world? Our first contribution is to formalize this problem using the language of state abstraction from the RL literature. This framing shows that an abstract simulator can be grounded to match the target task if the grounded abstract dynamics take the history of states into account. Based on the formalism, we then introduce a method that uses real-world task data to correct the dynamics of the abstract simulator. We then show that this method enables successful policy transfer both in sim2sim and sim2real evaluation.
11 h=9
2026-04-16 cs.RO cs.CV Jakob Thumm · h=9
Jakob Thumm, Marian Frei, Tianle Ni, Matthias Althoff, Marco Pavone
Core Contributions
  • Combines aleatoric uncertainty estimation with out-of-distribution detection for vision-based human pose estimation, achieving high probabilistic confidence that is essential for certifiable safety — unlike approaches that estimate only one type of uncertainty
  • Proposes conformal prediction sets for human motion predictions with high, provably valid confidence intervals, enabling integration with formal safety verification frameworks
  • Bridges the gap between vision-based perception (which is inherently uncertain) and certifiable safety frameworks (which require guaranteed bounds), making formal safety assurances practical for real-world human-robot collaboration
  • Evaluates on both recorded human motion data and a real-world HRC setting, demonstrating that the uncertainty-aware pipeline provides meaningful safety guarantees without being overly conservative
Show abstract
We propose a framework for vision-based human pose estimation and motion prediction that gives conformal prediction guarantees for certifiably safe human-robot collaboration. Our framework combines aleatoric uncertainty estimation with OOD detection for high probabilistic confidence. To integrate our pipeline in certifiable safety frameworks, we propose conformal prediction sets for human motion predictions with high, valid confidence. We evaluate our pipeline on recorded human motion data and a real-world human-robot collaboration setting.
23 h=3
2026-04-16 cs.RO cs.LG M. Srinivas · h=3
Radhika Khatri, Adit Tewari, Nikhil Sharma, M. B. Srinivas
Core Contributions
  • Integrates a YOLOv8-based robotic waste segregation system (MyCobot 280 + Jetson Nano) with an optimized bio-digestor in a single end-to-end framework — connecting the sorting and processing stages that are typically treated separately
  • Achieves 98% sorting accuracy across four waste categories using ROS-based path planning with real-time YOLOv8 detection, reducing the need for manual intervention in the sorting stage
  • Uses Particle Swarm Optimization combined with a regression model to dynamically adjust bio-digestor parameters (temperature, pH, pressure, RPM), maximizing digestion efficiency under varying environmental conditions
  • Provides a scalable solution suitable for both residential and industrial applications, addressing the growing challenge of municipal waste management driven by rapid urbanization
Show abstract
Rapid urbanization and continuous population growth have made municipal solid waste management increasingly challenging. These challenges highlight the need for smarter and automated waste management solutions. This paper presents the design and evaluation of an integrated waste management framework that combines two connected systems, a robotic waste segregation module and an optimized bio-digestor. The robotic waste segregation system uses a MyCobot 280 Jetson Nano robotic arm along with YOLOv8 object detection and robot operating system (ROS)-based path planning to identify and sort waste in real time. It classifies waste into four different categories with high precision, reducing the need for manual intervention. After segregation, the biodegradable waste is transferred to a bio-digestor system equipped with multiple sensors. These sensors continuously monitor key parameters, including temperature, pH, pressure, and motor revolutions per minute. The Particle Swarm Optimization (PSO) algorithm, combined with a regression model, is used to dynamically adjust system parameters. This intelligent optimization approach ensures stable operation and maximizes digestion efficiency under varying environmental conditions. System testing under dynamic conditions demonstrates a sorting accuracy of 98% along with highly efficient biological conversion. The proposed framework offers a scalable, intelligent, and practical solution for modern waste management, making it suitable for both residential and industrial applications.
25 h=2
2026-04-16 eess.SY cs.RO Henrik Krauss · h=2
Johannes Kübel, Henrik Krauss, Jinjie Li, Moju Zhao
Core Contributions
  • Proposes a physics-inspired energy-based regularization loss that encourages the neural residual dynamics model to produce control corrections that stabilize the system's energy — injecting physical priors that standard neural MPC training lacks
  • Improves positional MAE by 23% over analytical MPC across three real-world omnidirectional aerial robot experiments, demonstrating that learned residual dynamics provide meaningful corrections to the nominal model
  • Achieves up to 15% lower MAE and significantly increased flight stability compared to standard neural MPC without energy regularization, showing that the regularization prevents the neural model from learning destabilizing corrections
  • The energy regularization acts as an implicit safety constraint during training, producing more conservative but stable control corrections without requiring explicit constraint formulation in the MPC optimization
Show abstract
Data-driven Model Predictive Control (MPC) has lately been the core research subject in the field of control theory. The combination of an optimal control framework with deep learning paradigms opens up the possibility to accurately track control tasks without the need for complex analytical models. However, the system dynamics are often nuanced and the neural model lacks the potential to understand physical properties such as inertia and conservation of energy. In this work, we propose a novel energy-based regularization loss function which is applied to the training of a neural model that learns the residual dynamics of an omnidirectional aerial robot. Our energy-based regularization encourages the neural network to cause control corrections that stabilize the energy of the system. The residual dynamics are integrated into the MPC framework and improve the positional mean absolute error (MAE) over three real-world experiments by 23% compared to an analytical MPC. We also compare our method to a standard neural MPC implementation without regularization and primarily achieve a significantly increased flight stability implicitly due to the energy regularization and up to 15% lower MAE. Our code is available under: https://github.com/johanneskbl/jsk_aerial_robot/tree/develop/neural_MPC.