Large Manipulation Model

LLM/VLM Guidance
Structured Instructions
Vision-Language-Action

Video Generation

VLP: Video Language Planning, arXiv 2023. [Paper] [Website] [Code] [Google DeepMind]
AVDC: Learning to Act from Actionless Videos through Dense Correspondences, arXiv 2023. [Paper] [Website] [Code]
ATM: Any-point Trajectory Modeling for Policy Learning, RSS 2024. [Paper] [Website] [Code] [UC Berkeley]
Track2Act: Predicting Point Tracks from Internet Videos enables Generalizable Robot Manipulation, ECCV 2024. [Paper] [Website] [Code] [CMU]
Dreamitate: Real-World Visuomotor Policy Learning via Video Generation, arXiv 2024. [Paper] [Website] [Code] [Columbia University]
ARDuP: Active Region Video Diffusion for Universal Policies, arXiv 2024. [Paper]
This&That: Language-Gesture Controlled Video Generation for Robot Planning, arXiv 2024. [Paper] [Website] [Code]
Im2Flow2Act: Flow as the Cross-Domain Manipulation Interface, CoRL 2024. [Paper] [Website] [Code] [REAL-Stanford]
CLOVER: Closed-Loop Visuomotor Control with Generative Expectation for Robotic Manipulation, NeurIPS 2024. [Paper] [Code] [OpenDriveLab]
Gen2Act: Human Video Generation in Novel Scenarios enables Generalizable Robot Manipulation, arXiv 2024. [Paper] [Website] [Google DeepMind]
DynaMo: In-Domain Dynamics Pretraining for Visuo-Motor Control, arXiv 2024. [Paper] [Website] [Code]
GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation, arXiv 2024. [Paper] [Website] [Robotics Research Team, ByteDance Research]
VLM See, Robot Do: Human Demo Video to Robot Action Plan via Vision Language Model, arXiv 2024. [Paper] [Website] [Code]
Towards Synergistic, Generalized, and Efficient Dual-System for Robotic Manipulation, arXiv 2024. [Paper] [Website]
LAPA: Latent Action Pretraining from Videos, CoRL 2024. [Paper] [Website] [Code]
Differentiable Robot Rendering, CoRL 2024. [Paper] [Website] [Code] [cvlab-columbia]
OKAMI: Teaching Humanoid Robots Manipulation Skills through Single Video Imitation, CoRL 2024. [Paper] [Website]
Robots Pre-train Robots: Manipulation-Centric Robotic Representation from Large-Scale Robot Datasets, arXiv 2024. [Paper] [Website] [Code]
VideoAgent: Self-Improving Video Generation, arXiv 2024. [Paper] [Code]
IGOR: Image-GOal Representations are the Atomic Control Units for Foundation Models in Embodied AI, arXiv 2024. [Paper] [Website]
VidMan: Exploiting Implicit Dynamics from Video Diffusion Model for Effective Robot Manipulation, NeurIPS 2024. [Paper]
Grounding Video Models to Actions through Goal Conditioned Exploration, arXiv 2024. [Paper] [Website] [Code]

LLM/VLM Guidance

SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities, CVPR 2024. [Paper] [Website] [Unofficial Code]
PIVOT: Iterative Visual Prompting Elicits Actionable Knowledge for VLMs, arXiv 2024.02. [Paper] [Website] [Demo]
OmniManip: Towards General Robotic Manipulation via Object-Centric Interaction Primitives as Spatial Constraints, arXiv 2025.01. [Paper] [Website] [Code]

Structured Instructions

Vision-Language-Action

RT-1: Robotics Transformer for Real-World Control at Scale, arXiv 2022. [Paper] [Website] [Code] [Robotics at Google]
PaLM-E: An Embodied Multimodal Language Model, arXiv 2023. [Paper] [Website] [Robotics at Google]
VQ-BeT: Behavior Generation with Latent Actions, ICML 2024 Spotlight. [Paper] [Website] [Code]
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control, arXiv 2023. [Paper] [Website] [Unofficial Code] [Google DeepMind]
ALOHA: Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware, RSS 2023. [Paper] [Code] [Website]
ACT: Action Chunking with Transformers, RSS 2023. [Paper] [Code] [Website]
LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning, NeurIPS 2023. [Paper] [Website] [Code]
UniSim: Learning Interactive Real-World Simulators, ICLR 2024 (Outstanding Paper Award). [Paper] [Website] [Google DeepMind]
ALOHA 2: An Enhanced Low-Cost Hardware for Bimanual Teleoperation, arXiv 2024. [Paper] [Code] [Website] [Google DeepMind]
Octo: An Open-Source Generalist Robot Policy, arXiv 2024. [Paper] [Website] [Code] [UC Berkeley]
HPT: Scaling Proprioceptive-Visual Learning with Heterogeneous Pre-trained Transformers, NeurIPS 2024. [Paper] [Website] [Code] [Kaiming He, MIT ]
RDT-1B: A Diffusion Foundation Model for Bimanual Manipulation, arXiv 2024. [Paper] [Code] [Website] [Jun Zhu, THU]
GR-1: Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation, ICLR 2024. [Paper] [Website] [Code] [ByteDance Research]
SimplerEnv: Simulated Manipulation Policy Evaluation Environments for Real Robot Setups, arXiv 2024. [Paper] [Website] [Code]
🔥 π0: A Vision-Language-Action Flow Model for General Robot Control, arXiv 2024. [Paper] [Website] [Physical Intelligence] [Unofficial Code]
Scaling Up and Distilling Down: Language-Guided Robot Skill Acquisition, CoRL 2023. [Paper] [Website] [Code]
"Data Scaling Laws in Imitation Learning for Robotic Manipulation", arXiv 2024. [Paper] [Website] [Code] 8 A100
3D-VLA: A 3D Vision-Language-Action Generative World Model, ICML 2024. [Paper] [Code] [Website] [UMass Foundation Model]
A Joint Modeling of Vision-Language-Action for Target-oriented Grasping in Clutter, ICRA 2023. [Paper] [Code]
CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation, arXiv 2024. [Paper] [Website] [Code]
BYOVLA: Bring Your Own Vision-Language-Action Model, arXiv 2024. [Paper] [Website] [Code]
VLMPC: Vision-Language Model Predictive Control for Robotic Manipulation, RSS 2024. [Paper] [Code] [Ran Song, Shandong University]

Diffusion Policy

Diffusion Policy: Visuomotor Policy Learning via Action Diffusion, RSS 2023. [Paper] [Website] [Code] [REAL-Stanford]
3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations, RSS 2024. [Paper] [Website] [Code]
iDP3: Generalizable Humanoid Manipulation with Improved 3D Diffusion Policies, arXiv 2024. [Paper] [Website] [Code]
PointFlowMatch: Learning Robotic Manipulation Policies from Point Clouds with Conditional Flow Matching, CoRL 2024. [Paper] [Website] [Code]
Lift3D Foundation Policy: Lifting 2D Large-Scale Pretrained Models for Robust 3D Robotic Manipulation, arXiv 2024. [Paper] [Website] [Code]

Benchmark and Dataset

OpenVLA: An Open-Source Vision-Language-Action Model, arXiv 2024. [Paper] [Code] [Website] [Stanford University] 64 A100
VLABench: A Large-Scale Benchmark for Language-Conditioned Robotics Manipulation with Long-Horizon Reasoning Tasks, arXiv 2024.12. [Paper] [Website] [Code]
AgiBot World Colosseum. [Code] [Website]
RoboVLMs: Towards Generalist Robot Policies: What Matters in Building Vision-Language-Action Models, arXiv 2024.12. [Paper] [Website] [Code]
[Robo-VLM], An open-source implementation tailored for utilizing VLMs in instruction-based robot control.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large_Manipulation_Model.md

Large_Manipulation_Model.md

Large Manipulation Model

Video Generation

LLM/VLM Guidance

Structured Instructions

Vision-Language-Action

Diffusion Policy

Benchmark and Dataset

Files

Large_Manipulation_Model.md

Latest commit

History

Large_Manipulation_Model.md

File metadata and controls

Large Manipulation Model

Video Generation

LLM/VLM Guidance

Structured Instructions

Vision-Language-Action

Diffusion Policy

Benchmark and Dataset