arXiv Robotics Digest

April 7, 2026
30 Papers Analyzed
7 Research Categories
Max h-index: 36
Generated by Claude

Research Landscape

The dominant theme across this digest is the explosion of VLA model research — eight of the 30 papers directly address vision-language-action architectures, but the conversation has shifted fundamentally from "can we build VLAs?" to "can we make them safe, efficient, and trustworthy?" Three papers specifically probe VLA and world action model vulnerabilities: DAERT (#15) reduces baseline success rates from 93.33% to 5.85%, JailWAM (#18) achieves 84.2% attack success on LingBot-VA, and SafeGate (#13) introduces neurosymbolic pre-execution safety gates using Z3 theorem proving. Meanwhile, A1 (#22) and VLA-InfoEntropy (#11) shift focus to practical inference constraints—achieving 72% latency reduction through adaptive early termination. Action Images (#8) proposes a fundamentally new representation where video generation itself becomes the policy backbone. This safety-efficiency axis reflects field maturity: scaling VLAs is necessary but insufficient without trustworthiness guarantees.

The second thread is physically grounded manipulation moving beyond geometric planning. GraspSense (#4) introduces force maps that encode material-specific deformation into grasp selection using SAM3D and physics simulation. Delta6 (#7) democratizes high-quality sensing by offering an open-source $50 6-DOF force/torque sensor with 3.8% full-scale error. Contact-rich tasks reveal sensing innovation: soft scooping (#10) uses physics-based evolutionary optimization for deformable conical hands; button pressing (#21) leverages microphone-based contact detection as privileged supervision during training; deburring MPC (#25) combines diffusion-based motion priors with force feedback for collision avoidance in industrial settings. Together these point to a conceptual shift: manipulation is no longer solely about where to grasp, but how much force to apply, how to sense contact progression, and how to adapt reactively—requiring richer feedback loops than visual feedback alone provides.

A cross-cutting theme is bridging mathematical theory with deployment-ready algorithms. RSBM (#27) proves velocity structure invariance theorems that enable three-step generative policies for visual navigation, achieving 92% success without retraining. SI-OCP (#3) provides formal coverage guarantees for adaptive control systems on quadrotors with DNN components. Aggressive aerial maneuvers (#9) achieve repeatable 5cm-clearance gap traversal and 90° tilts—moving from simulation benchmarks to real physical feats. Formation control (#28) employs complex-number representations to make distributed leaderless control tractable and enabling translation, rotation, scaling, and shearing simultaneously. This pattern suggests the field increasingly recognizes that practical deployment requires bridging the theory-practice gap: proofs enable scaling, and real-world performance validates the theory's relevance.

VLA & Foundation Models

Vision-language-action models, world action models, and multimodal policy architectures

Robot Safety & Trustworthy AI

Safety verification, hazard analysis, jailbreaking, and explainability

Manipulation & Grasping

Dexterous hands, force sensing, contact-rich tasks, and mobile manipulation

Navigation & Localization

Visual navigation, UAV localization, and image-goal navigation

Aerial Robotics & Control

Quadrotor control, eVTOL systems, and safe adaptive control

Hardware Design & Surgical Robotics

Mechanical co-design, surgical instruments, and medical robotics

Multi-Agent & Swarm Systems

Multi-robot collaboration, formation control, and molecular swarms

VLA & Foundation Models

8 papers
cs.RO
Souren Pashangpour, Haitong Wang, Matthew Lisondra, Goldie Nejat

Core Contributions

  • Introduces first VLM-based planner for expressive robot behaviors during human-robot interaction, encoding interpersonal dynamics beyond task completion
  • Combines vision-language reasoning with visual-language-action (VLA) policies to generate socially-aware robot behaviors that maintain appropriate physical and social distance
  • Enables interruptible interactions where robots gracefully yield to human-initiated contact, improving collaborative safety and naturalness
  • Demonstrates how foundation models can encode social conventions previously requiring hand-coded interaction rules
Abstract: Enabling robots to adapt their behaviors during human-robot interaction (HRI) is a critical challenge in close-proximity collaboration. Despite significant advances in vision-language-action (VLA) models for understanding and executing human intentions, current methods lack explicit reasoning about expressive behaviors such as maintaining appropriate interpersonal distances or signaling through motion modulation. This paper presents ExpressMM, a framework that combines a VLM-based planner with VLA policies to generate expressive robot behaviors during collaborative mobile manipulation tasks. Our approach leverages natural language as an intermediate representation to reason about human intentions and robot expressiveness simultaneously. We demonstrate how our method can generate socially-aware behaviors including distance maintenance, gesture-based signaling, and graceful handling of unexpected human contact through interruptible interactions. Experiments with a mobile manipulator in collaborative tasks show that ExpressMM enables robots to communicate intent and maintain appropriate social distance while performing manipulation.
cs.CV cs.RO
Haoyu Zhen, Zixian Gao, Qiao Sun, Yilin Zhao, Yuncong Yang

Core Contributions

  • Proposes fundamentally new policy representation: translates 7-DoF robot actions into pixel-space "action images" usable as zero-shot policies with pretrained video backbones
  • Eliminates need for task-specific training by grounding actions in visual space, enabling policies that leverage massive pretrained video understanding models
  • Demonstrates zero-shot transfer across different robot morphologies by learning action-to-image mappings, bypassing robot-specific training
  • Shows how generative models trained on human video can predict robot behavior without seeing robot data during training
Abstract: We propose Action Images, a novel representation for learning robot manipulation policies via multiview video generation. The key insight is to represent robot actions (e.g., joint angles for a 7-DoF manipulator) as "action images"—visual frames that encode action information in pixel space. This allows us to leverage powerful pretrained video generation models as the backbone for policy learning. We train an encoder-decoder architecture where the encoder processes visual observations and the decoder generates action images that can be decoded into continuous action sequences. By operating in pixel space, our approach enables zero-shot policy transfer and learning from diverse video data without explicit robot supervision. We demonstrate results on multiple robotic manipulation tasks and show competitive performance compared to traditional action-based policy learning approaches.
cs.CV cs.RO
Chuhang Liu, Yayun He, Zuheng Kang, Xiaoyang Qu, Jianzong Wang

Core Contributions

  • Introduces training-free acceleration for VLA inference via dynamic token pruning based on image entropy and attention entropy, reducing computational cost without retraining
  • Identifies that low-entropy visual regions (background, redundant details) can be aggressively pruned without affecting action prediction accuracy
  • Combines two entropy signals: pixel-level image entropy and transformer attention entropy to identify which tokens contribute to action generation
  • Achieves practical inference speedup on resource-constrained robots where model retraining is infeasible
Abstract: Vision-Language-Action (VLA) models have shown promise for robot control, but their high computational cost limits real-time deployment on resource-constrained robotic platforms. We propose VLA-InfoEntropy, a training-free approach to accelerate VLA inference by leveraging information-theoretic measures of token importance. Our method dynamically prunes visual tokens based on image entropy (detecting low-information regions) and attention entropy (identifying which regions the model actually attends to). Unlike fine-tuning approaches, our method requires no additional training and can be applied to any VLA model. We demonstrate significant inference speedups (up to 40% latency reduction) while maintaining action prediction accuracy across multiple robotic tasks.
cs.RO
Jiyao Zhang, Zimu Han, Junhan Wang, Xionghao Wu, Shihong Lin

Core Contributions

  • Proposes hierarchical multi-frequency action chunking that simultaneously plans at multiple timescales, balancing long-horizon reasoning with fine-grained control
  • Uses entropy-guided execution that dynamically selects which action frequency to trust based on confidence estimates, enabling adaptive switching during task execution
  • Outperforms fixed action chunk sizes by allowing coarse-grained chunks for high-level planning and fine-grained chunks for precise control primitives
  • Demonstrates improved performance on both long-horizon assembly tasks and dexterous manipulation requiring precise contact control
Abstract: Action chunking—predicting sequences of actions rather than individual actions—has become important for sample efficiency in robot learning. However, fixed chunk sizes may not match the temporal structure of different task phases: some behaviors benefit from high-level planning (long chunks) while others require precise frame-by-frame control (short chunks). We propose HiPolicy, a hierarchical multi-frequency action chunking method that learns to operate at multiple temporal resolutions simultaneously. Our approach uses entropy-guided execution to dynamically weight different frequency levels, enabling the policy to adaptively select appropriate action granularity during deployment. Experiments on long-horizon and manipulation tasks show consistent improvements over single-frequency baselines.
cs.RO cs.CV
Baoshun Tong, Haoran He, Ling Pan, Yang Liu, Liang Lin

Core Contributions

  • First systematic adversarial testing framework (DAERT) exposing linguistic vulnerabilities in VLA models—success rate drops from 93.33% to 5.85% under adversarial prompts
  • Shows VLA models are brittle to instruction paraphrasing, semantic negation, and instruction shuffling despite perfect performance on canonical prompts
  • Introduces diversity-aware red teaming that generates adversarial instructions while maintaining distributional coverage, avoiding unrealistic attack scenarios
  • Highlights critical deployment risk: semantic equivalent instructions cause failure, requiring robustness improvements before real-world deployment
Abstract: Vision-Language-Action (VLA) models have demonstrated impressive capabilities on robot control tasks, but their robustness to natural language variations remains understudied. We introduce DAERT (Diversity-Aware Adversarial Red Teaming), a framework for systematically evaluating VLA model robustness to linguistic perturbations. Our approach generates adversarial instructions that preserve task semantics while varying linguistic form (paraphrases, negations, instruction reordering). We evaluate several state-of-the-art VLA models and find dramatic performance drops (from 93.33% to 5.85% success rate) under adversarial prompts, indicating significant linguistic fragility. We analyze failure modes and provide insights into which semantic variations cause robustness degradation, establishing a benchmark for improving VLA robustness.
cs.RO cs.CV
Jiahua Ma, Yiran Qin, Xin Wen, Yixiong Li, Yuyu Sun

Core Contributions

  • Proposes ReV: a closed-loop policy that adapts to sparse referring points (spatial annotations) provided by humans during task execution, enabling interactive task refinement
  • Uses coupled diffusion heads that jointly model visual observation dynamics and human referring semantics, allowing on-the-fly trajectory adjustment based on human input
  • First work enabling robots to accept mid-task spatial guidance from humans without requiring complete policy retraining or replanning
  • Demonstrates human-in-the-loop manipulation where reference points effectively steer robot behavior mid-trajectory
Abstract: Closed-loop manipulation policies must balance autonomy with human guidance. We propose ReV (Referring-aware Visuomotor), a policy learning framework that incorporates sparse human referring points (spatial annotations) as conditional inputs. Our approach uses coupled diffusion models to jointly predict future visual observations and adapt actions based on human spatial guidance. This enables humans to provide mid-task corrections without explicit replanning. We demonstrate ReV on complex manipulation tasks where human referring points improve task success rates and trajectory quality compared to fully autonomous policies.
cs.RO
Kaidong Zhang, Jian Zhang, Rongtao Xu, Yu Sun, Shuoshuo Xue

Core Contributions

  • First fully open-source VLA model with complete transparency: released weights, architecture, and training code enabling full reproducibility and community iteration
  • Introduces adaptive inference techniques including early termination and truncated flow matching, reducing latency up to 72% without accuracy loss
  • Achieves practical robotics performance: 29% latency reduction on RoboChallenge benchmark while maintaining action prediction accuracy
  • Enables researchers without institutional resources to build and customize VLAs, democratizing foundation model development in robotics
Abstract: We introduce A1, a fully transparent and open-source Vision-Language-Action (VLA) model designed for robot control. Unlike existing VLA models which are either proprietary or partially open, A1 provides complete transparency: open weights, architecture specifications, training data, and training code. We implement adaptive inference strategies including early termination of diffusion processes and truncated flow matching to reduce computational overhead. On robotics benchmarks, A1 achieves up to 72% latency reduction in inference compared to standard VLA inference while maintaining action quality. Our work aims to democratize VLA development and enable community-driven improvements in open-source robotics.
cs.RO
Theodor Wulff, Federico Tavella, Rahul Singh Maharjan, Manith Adikari, Angelo Cangelosi

Core Contributions

  • Addresses semantic grounding problem in hierarchical VLAs through explicit language-trajectory alignment using contrastive learning
  • Uses offline preference learning to align high-level language descriptions with low-level action trajectories, reducing compounding errors in hierarchical policies
  • Demonstrates that explicit alignment improves both language understanding and action execution quality compared to end-to-end training without grounding
  • Enables clearer interpretation of what each hierarchical level learns about language-action relationships
Abstract: Hierarchical Vision-Language-Action models improve sample efficiency by decomposing tasks into high-level language understanding and low-level action execution. However, semantic grounding between language descriptions and actual action trajectories remains challenging. We propose a contrastive learning approach that explicitly aligns language embeddings with trajectory embeddings using offline preference learning. Our method learns to identify which language descriptions best match specific behaviors, improving the coherence of hierarchical policies. Experiments show improved task success rates and more interpretable policy hierarchies compared to end-to-end learning baselines.

Robot Safety & Trustworthy AI

4 papers
cs.RO eess.SY
Ioannis Stefanakos, Roisin Bradley, Radu Calinescu, Beverley Townsend, Tianyuan Wang

Core Contributions

  • Applies SHARD and STPA (Systems-Theoretic Process Analysis) hazard analysis methods to medical robotic systems, establishing systematic safety assurance for MammoBot
  • Identifies critical failure modes in robot-assisted medical procedures where patient harm could result from mechanical, control, or procedural failures
  • Demonstrates how formal hazard analysis methods scale to real medical robots, providing comprehensive safety documentation required for clinical deployment
  • Bridges gap between robotics research and medical device regulation by applying structured hazard identification techniques
Abstract: Robot-assisted medical procedures require rigorous safety assurance due to direct patient contact and clinical consequences. This paper applies SHARD (Systematic Hazard Analysis and Resolution Design) and STPA (Systems-Theoretic Process Analysis) to conduct comprehensive hazard analysis for MammoBot, a robotic system designed to assist mammography procedures. We identify critical failure modes, assess their potential consequences, and propose mitigation strategies. Our systematic approach demonstrates how formal hazard analysis methods, traditionally used in aerospace and automotive, can assure safety for clinical robotic systems. The analysis informs design improvements and provides documentation required for regulatory approval.
cs.RO
Ike Obi, Vishnunandan L. N. Venkatesh, Weizheng Wang, Ruiqi Wang, Dayoon Suh

Core Contributions

  • Introduces SafeGate: a neurosymbolic architecture that validates LLM-generated robot commands before execution using Z3 SMT solver
  • Defines task safety contracts specifying resource constraints, geometric constraints, and allowable actions—enabling machine-verifiable safety properties
  • Prevents unsafe commands from reaching actuators by catching constraint violations at planning time, reducing failure modes compared to reactive safety monitors
  • Demonstrates first formal safety validation framework for LLM-robot systems, enabling safe deployment despite LLM brittleness
Abstract: Large language models (LLMs) are increasingly used for high-level robot planning, but their tendency to produce invalid or unsafe commands limits deployment. We propose SafeGate, a neurosymbolic system combining LLM planning with formal verification. SafeGate uses task safety contracts (formal specifications of resource and geometric constraints) and Z3 SMT solving to verify that proposed actions satisfy safety properties before execution. Our approach prevents unsafe commands from reaching robotic actuators, providing formal assurance despite LLM brittleness. We demonstrate SafeGate on navigation and manipulation tasks, showing that pre-execution verification significantly improves system safety without sacrificing task performance.
cs.RO cs.HC
Yifan Xu, Xiao Zhan, Akilu Yunusa Kaltungo, Ming Shan Ng, Tsukasa Ishizawa

Core Contributions

  • First dialogue-based framework enabling humans to query robot safety decisions in human-robot collaboration through natural language
  • Supports three explanation types: causal explanations (why safety decision was made), contrastive explanations (why alternative was rejected), and counterfactual explanations (what would change decision)
  • Enables human understanding of robot safety policies through interactive dialogue rather than opaque black-box decisions
  • Improves trust in collaborative robots by making safety reasoning transparent and interpretable
Abstract: In human-robot collaboration (HRC), robots make safety decisions that directly impact worker safety and productivity. When these decisions are opaque, human operators cannot understand or contest them. We propose an interactive dialogue framework enabling humans to query robot safety decisions and receive explanations in natural language. Our system supports three explanation types: causal (why was this decision made?), contrastive (why not this alternative?), and counterfactual (what would change the decision?). We demonstrate the framework on a collaborative manipulation task and show that dialogue-based explanations improve human understanding of robot safety policies compared to static documentation.
cs.RO
Hanqing Liu, Songping Wang, Jiahuan Long, Jiacheng Hou, Jialiang Sun

Core Contributions

  • First systematic jailbreak attack framework targeting World Action Models (WAMs) used in robot control, achieving 84.2% attack success rate on LingBot-VA
  • Shows WAMs are vulnerable to adversarial prompts despite being trained on safety-filtered data, revealing fundamental limitations in model-based safety
  • Demonstrates that WAM vulnerabilities can cause incorrect world state predictions, leading to unsafe robot behaviors not caught by downstream safety checks
  • Highlights need for robust WAM development and defense mechanisms beyond prompt-level safety filtering
Abstract: World Action Models (WAMs) predict how the world changes in response to robot actions, serving as planning and learning backbones for autonomous systems. We present the first jailbreak attack framework for WAMs, demonstrating that these models are vulnerable to adversarial prompts that cause incorrect world state predictions. Our attacks achieve 84.2% success rate on LingBot-VA, causing the model to predict unsafe or impossible world states. We analyze failure modes and show how WAM vulnerabilities can cascade into unsafe robot behaviors. Our work establishes a security benchmark for WAM development and motivates defenses against adversarial inputs.

Manipulation & Grasping

6 papers
cs.RO eess.SY
Elizaveta Semenyakina, Ivan Snegirev, Mariya Lezina, Miguel Altamirano Cabrera, Safina Gulyamova

Core Contributions

  • Introduces force maps that encode material properties (stiffness, deformability) into grasp planning, moving beyond geometric-only grasp selection
  • Uses SAM3D (segmentation model) plus Isaac Sim physics simulation to predict material-specific contact forces and structural safety
  • Enables dexterous hands to select grasps that won't crush fragile objects or slip on slippery surfaces by reasoning about contact forces
  • Demonstrates how physics simulation can provide grounding for grasp planning, reducing real-world failures from material assumptions
Abstract: Selecting grasps that respect material properties is critical for dexterous manipulation of everyday objects. We introduce GraspSense, a framework that grounds grasp planning in physics-based force analysis. Our approach uses segmentation models (SAM3D) to identify object materials and Isaac Sim to simulate contact forces under candidate grasps. We compute force maps that represent expected pressure distributions, enabling selection of grasps safe for the specific material. For fragile objects, our system avoids excessive crushing forces; for slippery surfaces, it ensures sufficient normal force. Experiments on diverse objects demonstrate improved grasp success rates and material safety compared to geometry-only planning.
cs.RO
Yue Feng, Weicheng Huang, Chen Qiu, Huixu Dong, I-Ming Chen

Core Contributions

  • Designs low-cost 6-DOF force/torque sensor using 3D printing and antagonistic springs with magnetic encoders, democratizing force sensing on resource-constrained robots
  • Achieves 3.8% full-scale error—competitive with industrial sensors costing 10x more, making force feedback practical for research robotics
  • Open-sources design and code, enabling rapid adoption in labs and small companies previously unable to afford commercial sensors
  • Demonstrates that novel mechanical designs using off-the-shelf components can match sensor accuracy of expensive commercial alternatives
Abstract: Force/torque sensing is essential for contact-rich manipulation but commercial 6-DOF sensors are expensive (>$10,000), limiting their use in academic robotics. We introduce Delta6, an open-source 6-DOF force-sensing end-effector that costs approximately 50 USD to manufacture. Our design uses 3D-printed flexible components combined with antagonistic spring mechanisms and low-cost magnetic encoders. Despite its simplicity, Delta6 achieves 3.8% full-scale error, competitive with commercial sensors. We provide complete CAD models, assembly instructions, and calibration software to enable widespread adoption. Delta6 democratizes force feedback for robot manipulation research and enables new applications previously limited by sensor cost.
cs.RO
Yongliang Wang, Cristian C. Beltran-Hernandez, Tomoya Takahashi, Masashi Hamaya

Core Contributions

  • Combines soft robot hand simulation with evolutionary optimization to learn scooping trajectories for granular materials, a contact-rich task traditionally difficult to model
  • Uses physics-based simulation of deformable conical hand geometry interacting with granular materials to predict scooping success
  • Evolutionary optimization discovers non-intuitive trajectories that maximize granule capture by reasoning about deformation and flow dynamics
  • Demonstrates sim-to-real transfer showing learned scooping policies work on physical robot despite material variability
Abstract: Scooping granular materials is a challenging contact-rich task where traditional geometric planning fails due to complex material dynamics. We propose a simulation-driven approach that learns scooping trajectories through evolutionary optimization of parametric motion primitives. Our physics simulator accurately models a soft conical hand interacting with granular materials, allowing us to evaluate candidate trajectories. Evolutionary algorithms search the trajectory space for motions maximizing material capture. We demonstrate that evolved trajectories outperform hand-designed baselines and show successful transfer to real robot hardware. Our approach illustrates how physics-based simulation can guide learning for contact-rich manipulation tasks.
cs.RO
Raman Talwar, Remko Proesmans, Thomas Lips, Andreas Verleysen, Francis Wyffels

Core Contributions

  • Uses instrumented training—adding microphone to robot fingertip during learning—as privileged supervision to teach gentle contact control without crushing buttons
  • Shows that audio-based contact sensing during training (privileged information not available at test time) improves learned force control policies
  • Demonstrates novel approach to contact learning: bootstrap robot learning with additional sensing, then transfer to standard visual-only deployment
  • Achieves more reliable and gentle button pressing compared to purely vision-based learning, improving real-world usability
Abstract: Robots often need to interact with delicate objects without applying excessive force, yet learning appropriate force control from vision alone is challenging. We propose instrumented learning: augmenting the robot with additional sensing (microphone on fingertip) during the training phase to provide privileged supervision about contact events. The robot learns to predict contact from visual and proprioceptive cues while supervised by audio signals. At deployment, the extra sensing is removed, leaving a policy that controls force based on vision and proprioception. We demonstrate the approach on gentle button pressing tasks, showing that audio-supervised training produces more reliable and gentler contact behaviors than vision-only baselines.
cs.RO
Krzysztof Wojciechowski, Ege Gursoy, Arthur Haffemayer, Sebastien Kleff, Vincent Bonnet

Core Contributions

  • Combines diffusion-based motion priors with force-feedback MPC for industrial deburring, enabling reactive force control with learned task guidance
  • Uses learned diffusion models to suggest trajectories while MPC enforces force constraints and avoids obstacles in real-time
  • Addresses industrial manipulation challenge: maintaining target contact force (deburring pressure) while avoiding collisions in manufacturing environments
  • Demonstrates how learning and control can be integrated: learning captures task structure, MPC provides real-time safety guarantees
Abstract: Robotic deburring requires maintaining precise contact forces while navigating obstacles in tight manufacturing spaces. We propose a hybrid approach combining learning and control: diffusion-based motion priors suggest promising trajectories for deburring, while model predictive control (MPC) enforces force feedback constraints and collision avoidance in real-time. The learned prior captures task-specific knowledge about appropriate deburring motions, while MPC provides reactive safety guarantees. Our approach outperforms both learning-only and control-only baselines on simulated and real deburring tasks, demonstrating the benefit of combining learned task priors with principled real-time control.
cs.RO
Chengkai Wu, Ruilin Wang, Yixin Zeng, Jiayuan Wang, Mingjie Zhang

Core Contributions

  • Proposes unified framework for successive mobile manipulation (repeating manipulation tasks while moving between locations) with reliability-aware planning
  • Models trade-off between task efficiency (speed) and reliability (success rate) through probabilistic prediction of action success
  • Achieves 26-82% success rate improvement compared to baselines by adapting trajectory timing and parameters based on estimated reliability
  • Addresses practical mobile manipulation challenge: long task sequences fail if individual actions have modest success rates
Abstract: Mobile manipulation robots often perform long sequences of manipulation tasks while navigating between locations. In this setting, even actions with high individual success rates compound into failures over long sequences. We propose a unified framework that adapts manipulation strategies based on predicted action reliability. Our approach estimates success probability for each action and jointly optimizes trajectory parameters to maximize total task reliability while maintaining efficiency. We demonstrate on sequential manipulation benchmarks that reliability-aware planning achieves 26-82% improvement in overall task success compared to efficiency-only baselines, enabling practical long-horizon mobile manipulation.

Navigation & Localization

3 papers
cs.RO cs.CV
Yijie Deng, Shuaihang Yuan, Yi Fang

Core Contributions

  • Proposes training-free 6-DoF pose recovery for image-goal navigation using any-view geometry, enabling precise final-meter localization to visual targets
  • Recovers full 6D pose (position + orientation) from single goal image without learning or fine-tuning, enabling rapid deployment to new environments
  • Achieves 93.1% navigation success on Gibson benchmark with 0.27m position error, demonstrating practical precision for embodied navigation
  • Enables navigation to goals where exact match isn't possible by recovering precise pose relative to goal image
Abstract: Image-goal navigation requires robots to reach a location photographed from a specific viewpoint. Standard approaches navigate to the location but struggle with precise orientation alignment. We propose AnyImageNav, which recovers 6-DoF pose (position and orientation) relative to a goal image using geometric constraints, without requiring training or environment-specific fine-tuning. Our approach leverages visual correspondences and epipolar geometry to recover the camera pose that best aligns with the goal viewpoint. On Gibson and other benchmarks, AnyImageNav achieves 93.1% success with 0.27m position error, demonstrating precise last-meter localization. The training-free nature enables rapid deployment to novel environments.
cs.CV cs.RO
Xiang Zhang, Tengfei Wang, Fang Xu, Xin Wang, Zongqian Zhan

Core Contributions

  • Adapts 3D Gaussian Splatting (3DGS) for UAV localization by addressing scale ambiguity—critical for aerial scenarios where meters matter
  • Introduces scale-aware pose initialization that determines metric scale from sparse LiDAR measurements, solving fundamental limitation of monocular methods
  • Uses Laplacian reliability masking to identify and downweight unreliable gaussian features (sky regions, moving objects), improving robustness in outdoor scenes
  • Enables metric-accurate UAV localization in large-scale environments where centimeter precision is required for safe autonomous flight
Abstract: UAV localization requires metric-accurate pose estimation in large-scale environments. While 3D Gaussian Splatting (3DGS) offers fast rendering, applying it to localization is non-trivial due to scale ambiguity inherent in monocular vision. We propose LSGS-Loc, which extends 3DGS-based localization to UAV scenarios by solving two key challenges: (1) scale ambiguity via sparse LiDAR-informed initialization, and (2) robustness to dynamic elements via Laplacian reliability masking. Our method identifies and deweights unreliable gaussian features (sky, moving objects) during localization. Experiments on large-scale UAV datasets demonstrate improved localization accuracy compared to standard visual localization methods.
cs.RO cs.AI
Wuyang Luan, Junhui Li, Weiguang Zhao, Wenjian Zhang, Tieru Wu

Core Contributions

  • Proposes RSBM: rectified Schrödinger bridge matching for visual navigation that generates action sequences in just 3 steps, enabling real-time robot control
  • Proves that velocity structure invariance—the principle that action sequences should have similar velocity profiles across environments—enables this dramatic reduction
  • Achieves 92% navigation success by learning policies that maintain consistent velocity structure rather than memorizing specific trajectories
  • Enables deployment on resource-constrained robots where long-horizon generation is infeasible
Abstract: Generative models for visual navigation typically require many diffusion steps to produce action sequences, limiting real-time deployment. We propose RSBM (Rectified Schrödinger Bridge Matching), which generates navigation action sequences in just 3 steps. Our key insight is that navigation behaviors exhibit velocity structure invariance: action sequences maintain similar velocity profiles across diverse environments. This principle enables aggressive step reduction without sacrificing performance. We demonstrate 92% navigation success on visual navigation benchmarks with only 3 generation steps. RSBM enables practical deployment of generative policies on resource-constrained robotic platforms.

Aerial Robotics & Control

3 papers
eess.SY cs.RO
Daniel M. Cherenson, Dimitra Panagou

Core Contributions

  • Proposes SI-OCP: staggered integral online conformal prediction providing formal uncertainty quantification for neural network-based adaptive control
  • Proves coverage guarantees: the uncertainty estimates are mathematically guaranteed to contain the true prediction error with specified probability
  • Enables safe use of deep neural networks in control loops by wrapping DNNs with formal confidence bounds, validated on quadrotor with DNN adaptive controller
  • Demonstrates practical deployment: robust tube MPC can enforce safety constraints using uncertainty estimates from SI-OCP
Abstract: Neural networks in control loops provide superior performance but lack formal guarantees, limiting deployment in safety-critical systems. We propose SI-OCP (Staggered Integral Online Conformal Prediction), which provides formally-guaranteed uncertainty quantification for neural network predictions in online control settings. SI-OCP computes confidence sets that are guaranteed to contain prediction errors with specified probability (e.g., 95%), enabling the controller to account for neural network uncertainty. We validate SI-OCP on quadrotor control with a DNN-based adaptive controller and show that robust tube MPC can enforce safety constraints despite neural network uncertainty.
cs.RO
Tianyue Wu, Guangtong Xu, Zihan Wang, Junxiao Lin, Tianyang Chen

Core Contributions

  • Develops RL-based sensorimotor policies enabling quadrotors to navigate 5cm-clearance gaps at 90-degree tilt angles—boundary of aerodynamic stability
  • Policies are reactive to moving gaps, adjusting trajectories in real-time rather than using pre-computed paths, enabling dynamic obstacle navigation
  • Demonstrates reproducible deployment of aggressive maneuvers on physical quadrotors, moving beyond simulation-only results to repeatable real-world performance
  • Shows how end-to-end learning can discover control policies that humans would struggle to design manually due to extreme nonlinearity
Abstract: Navigating constrained spaces with aerial robots requires aggressive control policies at the edge of stability. We develop sensorimotor policies trained with reinforcement learning to achieve extreme maneuvers: quadrotors navigating 5cm-clearance gaps while tilted 90 degrees. Our policies react in real-time to moving gaps, adjusting mid-flight rather than executing pre-planned paths. We validate repeatability on physical quadrotors, demonstrating that learned policies achieve consistent performance on boundary-case maneuvers. Our results show that end-to-end reinforcement learning can discover control strategies that achieve feats approaching aerodynamic limits, advancing the capabilities of aerial robots in constrained environments.
eess.SY cs.LG cs.RO
Alex Zongo, Peng Wei

Core Contributions

  • Analyzes energy overhead caused by conflict resolution maneuvers in eVTOL flight corridors, critical for urban air mobility mission planning
  • Uses mean-value principle (MVP) to deconvolve how trajectory modifications for deconfliction affect battery consumption
  • Shows median energy overhead less than 1.5% for typical air traffic, enabling accurate energy reserve planning and extending flight range predictions
  • Develops ML models for energy reserve estimation, enabling eVTOL path planning that accounts for realistic conflict resolution constraints
Abstract: Urban air mobility (UAM) systems require reliable energy planning under realistic constraints including conflict avoidance with other aircraft. We analyze the energy overhead caused by deconfliction maneuvers in eVTOL operations using mean-value principle optimization. Our analysis quantifies how trajectory modifications to avoid conflicts affect battery consumption, showing median overhead less than 1.5% for typical traffic densities. We develop machine learning models that predict required energy reserves given expected traffic density, enabling eVTOL mission planning that accounts for conflict resolution. Our results inform flight planning systems and energy management for emerging urban air mobility networks.

Hardware Design & Surgical Robotics

3 papers
cs.RO
Doina Pisla, Ionut Zima, Calin Vaida, Andrei Cailean, Marius Miclaus

Core Contributions

  • Designs 4-DOF flexible laparoscopic surgical instrument with 10mm diameter, enabling minimally invasive access to constrained surgical spaces
  • Develops scissor-linkage kinematic model enabling precise control of instrument tip despite complex 4-DOF coupling
  • Validates design using ATHENA parallel robot as test platform, demonstrating repeatability and accuracy needed for surgical applications
  • Advances surgical robotics hardware: minimal diameter enables widespread adoption in existing laparoscopic surgical suites
Abstract: Laparoscopic surgical instruments require compact, flexible designs to navigate tight anatomical spaces while maintaining precise control. We develop a 4-DOF surgical robotic instrument with 10mm outer diameter compatible with standard surgical cannulas. The instrument employs scissor-linkage mechanisms to achieve dexterous control within extreme size constraints. We derive kinematic models enabling precise tip control despite the complex linkage coupling. Experimental validation using the ATHENA robot confirms repeatability within surgical accuracy requirements. Our design represents advancement in surgical robotics hardware, enabling integration with existing minimally invasive surgical systems.
cs.RO
Aastha Mishra, Aman Singh, Shishir Kolathaya

Core Contributions

  • Proposes co-design framework jointly optimizing mechanics (linkage geometry, materials), actuation (motor/gearbox selection), and control for jumping robots
  • Achieves 42% jump distance improvement by optimizing all three dimensions simultaneously rather than sequential design iterations
  • Demonstrates that hardware constraints (motor torque, gearbox ratio, spring stiffness) are as important as control algorithms for achieving extreme behaviors
  • Provides methodology for future robotics design: co-optimization outperforms traditional design pipelines where mechanics and control are separate
Abstract: High-performance robot behaviors like jumping require simultaneous optimization of mechanical design, actuator selection, and control strategies. We present a co-design framework for a five-bar monoped that jointly optimizes: (1) linkage geometry and materials, (2) motor and gearbox selection, and (3) control parameters. Our approach uses multi-objective optimization to balance competing objectives: maximum jump distance, minimal energy consumption, and robustness to impact. Results demonstrate 42% improvement in jump distance through co-design compared to traditional sequential design iterations. Our methodology establishes principles for holistic robot design where hardware and software constraints are optimized together.
cs.RO cs.CV
Russell H. Taylor, Gregory D. Hager, et al.

Core Contributions

  • Comprehensive historical report on NSF Engineering Research Center (ERC) for medical robotics and image-guided intervention spanning multiple decades
  • Documents foundational work in surgical robotics, image-guided surgery, and computer-assisted intervention that influenced modern medical robotics
  • Highlights evolution of technologies: from early computer vision for surgery to modern robotic systems with haptic feedback and autonomous capabilities
  • Provides perspective on sustained interdisciplinary research and technology transfer from academic research to clinical practice
Abstract: This report documents the research and accomplishments of the Johns Hopkins University Engineering Research Center for Computer-Integrated Surgical Systems and Technology (CISST ERC). Spanning multiple decades, the center has pioneered foundational work in surgical robotics, computer vision for image-guided surgery, and autonomous surgical systems. The report describes key technologies including da Vinci surgical robot integration, image-guided interventions, haptic feedback systems, and emerging autonomous surgical capabilities. We review technology transfer to clinical practice, discussing how academic innovations have translated to FDA-cleared surgical systems. The report reflects on lessons learned and future directions for computer-assisted and robotic surgery.

Multi-Agent & Swarm Systems

3 papers
cs.RO
Tom Bachard, Gong Yiming, Ibuki Kawamata, Akira Kakugo, Nathanael Aubert-Kato

Core Contributions

  • First semantic analysis of molecular swarm behaviors using DNA-functionalized microtubules, extending swarm robotics principles to molecular scale
  • Uses semantic embeddings to learn and characterize collective behaviors without hand-crafted features, enabling discovery of behavior classes
  • Demonstrates how machine learning can extract meaningful behavior patterns from molecular swarms, bridging biology and robotics
  • Opens new research direction: understanding and controlling behavior at molecular scale using swarm robotics concepts
Abstract: Molecular swarms—collections of DNA-functionalized microtubules exhibiting collective behavior—present opportunities to understand coordination at biological scales. We apply semantic embedding techniques to analyze behaviors in DNA-functionalized molecular swarms, discovering interpretable behavior classes without requiring hand-crafted features. Our approach uses video analysis of microtubule dynamics to generate semantic representations capturing collective motion patterns. We identify distinct behavioral regimes (clustering, collective motion, phase transitions) and analyze their dependence on DNA-functionalization parameters. Our work demonstrates how machine learning can extract biological insight from molecular swarms and explores principles of coordination at sub-cellular scales.
cs.RO cs.CV
Li Kang, Yutao Fan, Rui Li, Heng Zhou, Yiran Qin

Core Contributions

  • Proposes compositional environment (CoEnv) framework enabling multi-arm manipulation by combining real-to-sim reconstruction with VLM-based planning
  • Real-to-sim reconstruction captures environment geometry and object pose from visual input, enabling physics-based collaboration planning
  • VLM planner decomposes collaborative manipulation into subtasks, assigning them to robots and ensuring physical feasibility
  • Demonstrates first framework for automated multi-robot task planning that grounds plans in reconstructed physics-based environments
Abstract: Multi-robot manipulation requires reasoning about physical interactions and task decomposition. We introduce CoEnv, a compositional environment framework that bridges visual perception and physics-based planning. Our approach combines: (1) real-to-sim reconstruction that captures object geometry and poses from RGB-D input, and (2) VLM-based task planning that decomposes manipulation into collaborative subtasks. The VLM reasons about task structure and proposes robot-arm assignments, while physics simulation validates feasibility. We demonstrate CoEnv on multi-arm manipulation benchmarks, showing how compositional reasoning enables complex collaborative tasks like coordinated object rearrangement and assembly.
cs.RO eess.SY
Jesus Bautista, Enric Morella, Lili Wang, Hector Garcia de Marina

Core Contributions

  • Develops leaderless formation control using complex-number representations enabling simultaneous translation, rotation, scaling, and shearing of formations
  • No central coordinator required: each robot uses only local neighbor information, enabling scalability to large swarms
  • Complex-number formalism provides elegant mathematical framework making distributed control analysis tractable and provably stable
  • Enables flexible formation morphing where formation shape evolves dynamically while maintaining local communication constraints
Abstract: Distributed formation control for robot swarms typically requires either a designated leader or global coordination, limiting scalability. We propose a leaderless formation control approach using complex-number representations of robot positions. Our method enables formations to undergo arbitrary affine transformations (translation, rotation, scaling, shearing) while maintaining stability with only local neighbor communication. The complex-number formalism elegantly captures formation geometry and enables closed-form stability proofs. We demonstrate scalability to large swarms and adaptive formation morphing in simulation and on physical robot teams.