Skip to content

Latest commit

 

History

History
102 lines (68 loc) · 14.1 KB

Large_Manipulation_Model.md

File metadata and controls

102 lines (68 loc) · 14.1 KB

Large Manipulation Model

Video Generation

  • VLP: Video Language Planning, arXiv 2023. [Paper] [Website] [Code] [Google DeepMind]

  • AVDC: Learning to Act from Actionless Videos through Dense Correspondences, arXiv 2023. [Paper] [Website] [Code]

  • ATM: Any-point Trajectory Modeling for Policy Learning, RSS 2024. [Paper] [Website] [Code] [UC Berkeley]

  • Track2Act: Predicting Point Tracks from Internet Videos enables Generalizable Robot Manipulation, ECCV 2024. [Paper] [Website] [Code] [CMU]

  • Dreamitate: Real-World Visuomotor Policy Learning via Video Generation, arXiv 2024. [Paper] [Website] [Code] [Columbia University]

  • ARDuP: Active Region Video Diffusion for Universal Policies, arXiv 2024. [Paper]

  • This&That: Language-Gesture Controlled Video Generation for Robot Planning, arXiv 2024. [Paper] [Website] [Code]

  • Im2Flow2Act: Flow as the Cross-Domain Manipulation Interface, CoRL 2024. [Paper] [Website] [Code] [REAL-Stanford]

  • CLOVER: Closed-Loop Visuomotor Control with Generative Expectation for Robotic Manipulation, NeurIPS 2024. [Paper] [Code] [OpenDriveLab]

  • Gen2Act: Human Video Generation in Novel Scenarios enables Generalizable Robot Manipulation, arXiv 2024. [Paper] [Website] [Google DeepMind]

  • DynaMo: In-Domain Dynamics Pretraining for Visuo-Motor Control, arXiv 2024. [Paper] [Website] [Code]

  • GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation, arXiv 2024. [Paper] [Website] [Robotics Research Team, ByteDance Research]

  • VLM See, Robot Do: Human Demo Video to Robot Action Plan via Vision Language Model, arXiv 2024. [Paper] [Website] [Code]

  • Towards Synergistic, Generalized, and Efficient Dual-System for Robotic Manipulation, arXiv 2024. [Paper] [Website]

  • LAPA: Latent Action Pretraining from Videos, CoRL 2024. [Paper] [Website] [Code]

  • Differentiable Robot Rendering, CoRL 2024. [Paper] [Website] [Code] [cvlab-columbia]

  • OKAMI: Teaching Humanoid Robots Manipulation Skills through Single Video Imitation, CoRL 2024. [Paper] [Website]

  • Robots Pre-train Robots: Manipulation-Centric Robotic Representation from Large-Scale Robot Datasets, arXiv 2024. [Paper] [Website] [Code]

  • VideoAgent: Self-Improving Video Generation, arXiv 2024. [Paper] [Code]

  • IGOR: Image-GOal Representations are the Atomic Control Units for Foundation Models in Embodied AI, arXiv 2024. [Paper] [Website]

  • VidMan: Exploiting Implicit Dynamics from Video Diffusion Model for Effective Robot Manipulation, NeurIPS 2024. [Paper]

  • Grounding Video Models to Actions through Goal Conditioned Exploration, arXiv 2024. [Paper] [Website] [Code]

LLM/VLM Guidance

  • SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities, CVPR 2024. [Paper] [Website] [Unofficial Code]

  • PIVOT: Iterative Visual Prompting Elicits Actionable Knowledge for VLMs, arXiv 2024.02. [Paper] [Website] [Demo]

  • OmniManip: Towards General Robotic Manipulation via Object-Centric Interaction Primitives as Spatial Constraints, arXiv 2025.01. [Paper] [Website] [Code]

Structured Instructions

Vision-Language-Action

Diffusion Policy

  • Diffusion Policy: Visuomotor Policy Learning via Action Diffusion, RSS 2023. [Paper] [Website] [Code] [REAL-Stanford]
  • 3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations, RSS 2024. [Paper] [Website] [Code]
  • iDP3: Generalizable Humanoid Manipulation with Improved 3D Diffusion Policies, arXiv 2024. [Paper] [Website] [Code]
  • PointFlowMatch: Learning Robotic Manipulation Policies from Point Clouds with Conditional Flow Matching, CoRL 2024. [Paper] [Website] [Code]
  • Lift3D Foundation Policy: Lifting 2D Large-Scale Pretrained Models for Robust 3D Robotic Manipulation, arXiv 2024. [Paper] [Website] [Code]

Benchmark and Dataset

  • OpenVLA: An Open-Source Vision-Language-Action Model, arXiv 2024. [Paper] [Code] [Website] [Stanford University] 64 A100
  • VLABench: A Large-Scale Benchmark for Language-Conditioned Robotics Manipulation with Long-Horizon Reasoning Tasks, arXiv 2024.12. [Paper] [Website] [Code]
  • AgiBot World Colosseum. [Code] [Website]
  • RoboVLMs: Towards Generalist Robot Policies: What Matters in Building Vision-Language-Action Models, arXiv 2024.12. [Paper] [Website] [Code]
  • [Robo-VLM], An open-source implementation tailored for utilizing VLMs in instruction-based robot control.