-
VLP: Video Language Planning, arXiv 2023. [Paper] [Website] [Code] [Google DeepMind]
-
AVDC: Learning to Act from Actionless Videos through Dense Correspondences, arXiv 2023. [Paper] [Website] [Code]
-
ATM: Any-point Trajectory Modeling for Policy Learning, RSS 2024. [Paper] [Website] [Code] [UC Berkeley]
-
Track2Act: Predicting Point Tracks from Internet Videos enables Generalizable Robot Manipulation, ECCV 2024. [Paper] [Website] [Code] [CMU]
-
Dreamitate: Real-World Visuomotor Policy Learning via Video Generation, arXiv 2024. [Paper] [Website] [Code] [Columbia University]
-
ARDuP: Active Region Video Diffusion for Universal Policies, arXiv 2024. [Paper]
-
This&That: Language-Gesture Controlled Video Generation for Robot Planning, arXiv 2024. [Paper] [Website] [Code]
-
Im2Flow2Act: Flow as the Cross-Domain Manipulation Interface, CoRL 2024. [Paper] [Website] [Code] [REAL-Stanford]
-
CLOVER: Closed-Loop Visuomotor Control with Generative Expectation for Robotic Manipulation, NeurIPS 2024. [Paper] [Code] [OpenDriveLab]
-
Gen2Act: Human Video Generation in Novel Scenarios enables Generalizable Robot Manipulation, arXiv 2024. [Paper] [Website] [Google DeepMind]
-
DynaMo: In-Domain Dynamics Pretraining for Visuo-Motor Control, arXiv 2024. [Paper] [Website] [Code]
-
GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation, arXiv 2024. [Paper] [Website] [Robotics Research Team, ByteDance Research]
-
VLM See, Robot Do: Human Demo Video to Robot Action Plan via Vision Language Model, arXiv 2024. [Paper] [Website] [Code]
-
Towards Synergistic, Generalized, and Efficient Dual-System for Robotic Manipulation, arXiv 2024. [Paper] [Website]
-
LAPA: Latent Action Pretraining from Videos, CoRL 2024. [Paper] [Website] [Code]
-
Differentiable Robot Rendering, CoRL 2024. [Paper] [Website] [Code] [cvlab-columbia]
-
OKAMI: Teaching Humanoid Robots Manipulation Skills through Single Video Imitation, CoRL 2024. [Paper] [Website]
-
Robots Pre-train Robots: Manipulation-Centric Robotic Representation from Large-Scale Robot Datasets, arXiv 2024. [Paper] [Website] [Code]
-
VideoAgent: Self-Improving Video Generation, arXiv 2024. [Paper] [Code]
-
IGOR: Image-GOal Representations are the Atomic Control Units for Foundation Models in Embodied AI, arXiv 2024. [Paper] [Website]
-
VidMan: Exploiting Implicit Dynamics from Video Diffusion Model for Effective Robot Manipulation, NeurIPS 2024. [Paper]
-
Grounding Video Models to Actions through Goal Conditioned Exploration, arXiv 2024. [Paper] [Website] [Code]
-
SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities, CVPR 2024. [Paper] [Website] [Unofficial Code]
-
PIVOT: Iterative Visual Prompting Elicits Actionable Knowledge for VLMs, arXiv 2024.02. [Paper] [Website] [Demo]
-
OmniManip: Towards General Robotic Manipulation via Object-Centric Interaction Primitives as Spatial Constraints, arXiv 2025.01. [Paper] [Website] [Code]
- RT-1: Robotics Transformer for Real-World Control at Scale, arXiv 2022. [Paper] [Website] [Code] [Robotics at Google]
- PaLM-E: An Embodied Multimodal Language Model, arXiv 2023. [Paper] [Website] [Robotics at Google]
- VQ-BeT: Behavior Generation with Latent Actions, ICML 2024 Spotlight. [Paper] [Website] [Code]
- RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control, arXiv 2023. [Paper] [Website] [Unofficial Code] [Google DeepMind]
- ALOHA: Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware, RSS 2023. [Paper] [Code] [Website]
- ACT: Action Chunking with Transformers, RSS 2023. [Paper] [Code] [Website]
- LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning, NeurIPS 2023. [Paper] [Website] [Code]
- UniSim: Learning Interactive Real-World Simulators, ICLR 2024 (Outstanding Paper Award). [Paper] [Website] [Google DeepMind]
- ALOHA 2: An Enhanced Low-Cost Hardware for Bimanual Teleoperation, arXiv 2024. [Paper] [Code] [Website] [Google DeepMind]
- Octo: An Open-Source Generalist Robot Policy, arXiv 2024. [Paper] [Website] [Code] [UC Berkeley]
- HPT: Scaling Proprioceptive-Visual Learning with Heterogeneous Pre-trained Transformers, NeurIPS 2024. [Paper] [Website] [Code] [Kaiming He, MIT ]
- RDT-1B: A Diffusion Foundation Model for Bimanual Manipulation, arXiv 2024. [Paper] [Code] [Website] [Jun Zhu, THU]
- GR-1: Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation, ICLR 2024. [Paper] [Website] [Code] [ByteDance Research]
- SimplerEnv: Simulated Manipulation Policy Evaluation Environments for Real Robot Setups, arXiv 2024. [Paper] [Website] [Code]
- 🔥 π0: A Vision-Language-Action Flow Model for General Robot Control, arXiv 2024. [Paper] [Website] [Physical Intelligence] [Unofficial Code]
- Scaling Up and Distilling Down: Language-Guided Robot Skill Acquisition, CoRL 2023. [Paper] [Website] [Code]
- "Data Scaling Laws in Imitation Learning for Robotic Manipulation", arXiv 2024. [Paper] [Website] [Code]
8 A100
- 3D-VLA: A 3D Vision-Language-Action Generative World Model, ICML 2024. [Paper] [Code] [Website] [UMass Foundation Model]
- A Joint Modeling of Vision-Language-Action for Target-oriented Grasping in Clutter, ICRA 2023. [Paper] [Code]
- CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation, arXiv 2024. [Paper] [Website] [Code]
- BYOVLA: Bring Your Own Vision-Language-Action Model, arXiv 2024. [Paper] [Website] [Code]
- VLMPC: Vision-Language Model Predictive Control for Robotic Manipulation, RSS 2024. [Paper] [Code] [Ran Song, Shandong University]
- Diffusion Policy: Visuomotor Policy Learning via Action Diffusion, RSS 2023. [Paper] [Website] [Code] [REAL-Stanford]
- 3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations, RSS 2024. [Paper] [Website] [Code]
- iDP3: Generalizable Humanoid Manipulation with Improved 3D Diffusion Policies, arXiv 2024. [Paper] [Website] [Code]
- PointFlowMatch: Learning Robotic Manipulation Policies from Point Clouds with Conditional Flow Matching, CoRL 2024. [Paper] [Website] [Code]
- Lift3D Foundation Policy: Lifting 2D Large-Scale Pretrained Models for Robust 3D Robotic Manipulation, arXiv 2024. [Paper] [Website] [Code]
- OpenVLA: An Open-Source Vision-Language-Action Model, arXiv 2024. [Paper] [Code] [Website] [Stanford University]
64 A100
- VLABench: A Large-Scale Benchmark for Language-Conditioned Robotics Manipulation with Long-Horizon Reasoning Tasks, arXiv 2024.12. [Paper] [Website] [Code]
- AgiBot World Colosseum. [Code] [Website]
- RoboVLMs: Towards Generalist Robot Policies: What Matters in Building Vision-Language-Action Models, arXiv 2024.12. [Paper] [Website] [Code]
- [Robo-VLM], An open-source implementation tailored for utilizing VLMs in instruction-based robot control.