Today's batch of 30 papers centers on three interlocking questions that define where robotics is headed. The first is how to make large models actually work on physical robots. Four papers address VLA models from orthogonal directions: SOLE-R1 replaces reward engineering with a video-language reasoning model as the sole RL reward signal; FocusVLA identifies that current architectures waste visual token computation on task-irrelevant regions; StreamingVLA decouples observation, generation, and execution stages to eliminate serial stalling; and ManipArena demonstrates that top simulation performers fail in real-world evaluation β a finding that reframes what "state of the art" means. Together, these papers argue that the VLA paradigm is transitioning from proof-of-concept to engineering discipline, where reliability under physical constraints matters more than benchmark scores.
The second theme is closing the sensing gap in manipulation. Tac2Real enables GPU-parallelized visuotactile simulation fast enough for online RL, while TAG provides a low-cost 21-DoF glove with high-resolution tactile feedback for teleoperation data collection β two papers that attack the same problem from opposite directions (sim-first vs. human-demonstration-first). Tele-Catch bridges them with a shared-autonomy framework that blends glove teleoperation into a diffusion policy for dynamic catching tasks. Meanwhile, the active stereo camera ablation on a Unitree G1 humanoid challenges the assumption that sensor richness improves learning, finding the opposite in data-limited regimes. Collectively, these papers suggest that the bottleneck for dexterous manipulation is no longer algorithms but sensing infrastructure and data pipelines.
The third theme is evaluation and standardization β a meta-question running through otherwise disparate papers. The START position statement on thrombectomy robotics standardizes testbed tiers and metrics for surgical AI. The WoZ interface study reveals that the choice of wizard interface shapes what human-robot interaction data looks like. The egocentric vs. allocentric navigation study shows that safety evaluations from bird's-eye perspectives systematically miss pedestrian discomfort. And ManipArena's real-world evaluation exposes the simulation-to-reality gap quantitatively. The field appears to be grappling with a collective measurement problem: progress claims are hard to compare because evaluation setups are not standardized, and today's batch reflects a growing effort to fix that.
Vision-language-action models pushing toward real-world robot intelligence
Contact-aware manipulation via haptic sensing, teleoperation, and sim-to-real tactile transfer
Embodied agents navigating unstructured environments using semantic and topological maps
Motion planning and control for UAVs, ground vehicles, and maritime systems
Studies on human-robot trust, sociability, collaboration interfaces, and workflow orchestration
Unconventional robot designs from soft electromagnetic crawlers to self-rotating UAVs
Surgical robotics standards, industrial disassembly, and LLM-based multi-agent coordination
Vision-language-action models pushing toward real-world robot intelligence
Contact-aware manipulation via haptic sensing, teleoperation, and sim-to-real tactile transfer
Embodied agents navigating unstructured environments using semantic and topological maps
Motion planning and control for UAVs, ground vehicles, and maritime systems
Studies on human-robot trust, sociability, collaboration interfaces, and workflow orchestration
Unconventional robot designs from soft electromagnetic crawlers to self-rotating UAVs
Surgical robotics standards, industrial disassembly, and LLM-based multi-agent coordination