Skip to content

Awesome diffusion Video-to-Video (V2V). A collection of paper on diffusion model-based video editing, aka. video-to-video (V2V) translation. And a video editing benchmark code.

License

Notifications You must be signed in to change notification settings

wenhao728/awesome-diffusion-v2v

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Diffusion Model-Based Video Editing: A Survey

GitHub last commit

Wenhao Sun, Rong-Cheng Tu, Jingyi Liao, Dacheng Tao
Nanyang Technological University

teaser.mp4

๐Ÿป Citation

If you find this repository helpful, please consider citing our paper:

@article{sun2024v2vsurvey,
    author = {Wenhao Sun and Rong-Cheng Tu and Jingyi Liao and Dacheng Tao},
    title = {Diffusion Model-Based Video Editing: A Survey},
    journal = {CoRR},
    volume = {abs/2407.07111},
    year = {2024}
}

๐Ÿ“Œ Introduction


Overview of diffusion-based video editing model components.

Tip

The papers are listed in reverse chronological order, formatted as follows: (Conference/Journal Year) Title, Authors

Network and Training Paradigm

Temporal Adaption

  • (Preprint 24') VIA: Unified Spatiotemporal Video Adaptation Framework for Global and Local Video Editing, Gu et al.
  • (ICML 24') Slicedit: Zero-Shot Video Editing With Text-to-Image Diffusion Models Using Spatio-Temporal Slices, Cohen et al.
  • (ECCV 24') Video Editing via Factorized Diffusion Distillation, Singer et al.
  • (CVPR 24') MaskINT: Video Editing via Interpolative Non-autoregressive Masked Transformers, Ma et al.
  • (Preprint 23') Fairy: Fast Parallelized Instruction-Guided Video-to-Video Synthesis, Wu et al.
  • (CVPR 24') VidToMe: Video Token Merging for Zero-Shot Video Editing, Li et al.
  • (CVPR 24') SimDA: Simple Diffusion Adapter for Efficient Video Generation, Xing et al.
  • (NeurIPS 23') Towards Consistent Video Editing with Text-to-Image Diffusion Models, Zhang et al.
  • (ICCV 23') Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation, Wu et al.

(back to top)

Structure Conditioning

  • (Preprint '24) EVA: Zero-shot Accurate Attributes and Multi-Object Video Editing, Yang et al.
  • (IJCAI '24) Diffutoon: High-Resolution Editable Toon Shading via Diffusion Models, Duan et al.
  • (CVPR '24) FlowVid: Taming Imperfect Optical Flows for Consistent Video-to-Video Synthesis, Liang et al.
  • (Preprint '23) Motion-Conditioned Image Animation for Video Editing, Yan et al.
  • (CVPR '24) LAMP: Learn A Motion Pattern for Few-Shot-Based Video Generation, Wu et al.
  • (ICLR '24) Ground-A-Video: Zero-shot Grounded Video Editing using Text-to-image Diffusion Models, Jeong and Ye
  • (Preprint '23) CCEdit: Creative and Controllable Video Editing via Diffusion Models, Feng et al.
  • (Preprint '23) MagicEdit: High-Fidelity and Temporally Coherent Video Editing, Liew et al.
  • (Preprint '23) VideoControlNet: A Motion-Guided Video-to-Video Translation Framework by Using Diffusion Model with ControlNet, Hu and Xu
  • (NeurIPS '23) VideoComposer: Compositional Video Synthesis with Motion Controllability, Wang et al.
  • (Preprint '23) Structure and Content-Guided Video Synthesis with Diffusion Models, Esser et al.

(back to top)

Training Modification

  • (Preprint 24') Generative Video Propagation, Liu et al.
  • (Preprint 24') Movie Gen: A Cast of Media Foundation Models, Polyak et al.
  • (Preprint 24') Still-Moving: Customized Video Generation without Customized Video Data, Chefer et al.
  • (Preprint 24') EffiVED: Efficient Video Editing via Text-instruction Diffusion Models, Zhang et al.
  • (ECCV 24') Customize-A-Video: One-Shot Motion Customization of Text-to-Video Diffusion Models, Ren et al.
  • (Preprint 24') VASE: Object-Centric Appearance and Shape Manipulation of Real Videos, Peruzzo et al.
  • (Preprint 23') Customizing Motion in Text-to-Video Diffusion Models, Materzynska et al.
  • (ECCV 24') SAVE: Protagonist Diversification with Structure Agnostic Video Editing, Song et al.
  • (CVPR 24') VMC: Video Motion Customization using Temporal Attention Adaption for Text-to-Video Diffusion Models, Jeong et al.
  • (CVPR 24') DreamVideo: Composing Your Dream Videos with Customized Subject and Motion, Wei et al.
  • (ICLR 24') Consistent Video-to-Video Transfer Using Synthetic Dataset, Cheng et al.
  • (Preprint 23') VIDiff: Translating Videos via Multi-Modal Instructions with Diffusion Models, Xing et al.
  • (ECCV 24') MotionDirector: Motion Customization of Text-to-Video Diffusion Models, Zhao et al.
  • (ICME 24') InstructVid2Vid: Controllable Video Editing with Natural Language Instructions, Qin et al.
  • (Preprint 23') Dreamix: Video Diffusion Models are General Video Editors, Molad et al.

(back to top)

Attention Feature Injection

Inversion-Based Feature Injection

  • (TMLR 24') AnyV2V: A Tuning-Free Framework For Any Video-to-Video Editing Tasks, Ku et al.
  • (ECCV 24') Object-Centric Diffusion: Efficient Video Editing, Kahatapitiya et al.
  • (Preprint 24') UniEdit: A Unified Tuning-Free Framework for Video Motion and Appearance Editing, Bai et al.
  • (Preprint 23') Make-A-Protagonist: Generic Video Editing with An Ensemble of Experts, Zhao et al.
  • (Preprint 23') Zero-Shot Video Editing Using Off-The-Shelf Image Diffusion Models, Wang et al.
  • (ICCV 23') FateZero: Fusing Attentions for Zero-shot Text-based Video Editing, Qi et al.
  • (ACML 23') Edit-A-Video: Single Video Editing with Object-Aware Consistency, Shin et al.
  • (Preprint 23') Video-P2P: Video Editing with Cross-attention Control, Liu et al.

(back to top)

Motion-Based Feature Injection

  • (CVPR 24') FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video Translation, Yang et al.
  • (ICLR 24') FLATTEN: optical FLow-guided ATTENtion for consistent text-to-video editing, Cong et al.
  • (ICLR 24') TokenFlow: Consistent Diffusion Features for Consistent Video Editing, Geyer et al.

(back to top)

Diffusion Latents Manipulation

Latent Initialization

  • (CVPR 24') A Video is Worth 256 Bases: Spatial-Temporal Expectation-Maximization Inversion for Zero-Shot Video Editing, Li et al.
  • (Preprint 23') Video ControlNet: Towards Temporally Consistent Synthetic-to-Real Video Translation Using Conditional Image Diffusion Models, Chu et al.
  • (Preprint 23') Control-A-Video: Controllable Text-to-Video Generation with Diffusion Models, Chen et al.
  • (ICCV 23') Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators, Khachatryan et al.

(back to top)

Latent Transition

  • (ICML 24') FRAG: Frequency Adapting Group for Diffusion Video Editing, Yoon et al.
  • (CVPR 24') GenVideo: One-shot target-image and shape-aware video editing using T2I diffusion models, Harsha et al.
  • (Preprint 24') MotionClone: Training-Free Motion Cloning for Controllable Video Generation, Ling et al.
  • (CVPR 24') RAVE: Randomized Noise Shuffling for Fast and Consistent Video Editing with Diffusion Models, Kara et al.
  • (CVPR 24') Space-Time Diffusion Features for Zero-Shot Text-Driven Motion Transfer, Yatim et al.
  • (Preprint 23') DiffSynth: Latent In-Iteration Deflickering for Realistic Video Synthesis, Duan et al.
  • (SIGGRAPH 23') Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation, Yang et al.
  • (ICLR 23') ControlVideo: Training-free Controllable Text-to-Video Generation, Zhang et al.
  • (ICCV 23') Pix2Video: Video Editing using Image Diffusion, Ceylan et al.

(back to top)

Canonical Representation

  • (Preprint 23') Neural Video Fields Editing: Neural Video Fields Editing, Yang et al.
  • (Preprint 23') DiffusionAtlas: High-Fidelity Consistent Diffusion Video Editing, Chang et al.
  • (ICCV 23') StableVideo: Text-driven Consistency-aware Diffusion Video Editing, Chai et al.
  • (CVPR 24') CoDeF: Content Deformation Fields for Temporally Consistent Video Processing, Ouyang et al.
  • (TMLR 24') VidEdit: Zero-Shot and Spatially Aware Text-Driven Video Editing, Couairon et al.
  • (CVPR 23') Shape-aware Text-driven Layered Video Editing, Lee et al.

(back to top)

Novel Conditioning

Point-Based Editing

  • (SIGGRAPH 24') MotionCtrl: A Unified and Flexible Motion Controller for Video Generation, Wang et al.
  • (Preprint 23') Drag-A-Video: Non-rigid Video Editing with Point-based Interaction, Teng et al.
  • (ECCV 24') DragVideo: Interactive Drag-style Video Editing, Deng et al.
  • (CVPR 24') VideoSwap: Customized Video Subject Swapping with Interactive Semantic Point Correspondence, Placeholder et al.

(back to top)

Pose-Guided Human Action Editing

  • (Preprint 24') StableAnimator: High-Quality Identity-Preserving Human Image Animation, Tu et al.
  • (SINGGRAPH Asia 24') Fashion-VDM: Video Diffusion Model for Virtual Try-On, Karras et al.
  • (Preprint 24') Animate-X: Universal Character Image Animation with Enhanced Motion Representation, Tan et al.
  • (SCIS 24') UniAnimate: Taming Unified Video Diffusion Models for Consistent Human Image Animation, Wang et al.
  • (Preprint 24') Zero-shot High-fidelity and Pose-controllable Character Animation, Zhu et al.
  • (Preprint 23') Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation, Hu et al.
  • (Preprint 23') MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model, Xu et al.
  • (ICML 24') MagicPose: Realistic Human Poses and Facial Expressions Retargeting with Identity-aware Diffusion, Chang et al.
  • (CVPR 24') DisCo: Disentangled Control for Realistic Human Dance Generation, Wang et al.
  • (ICCV 23') DreamPose: Fashion Image-to-Video Synthesis via Stable Diffusion, Karras et al.
  • (AAAI 23') Follow Your Pose: Pose-Guided Text-to-Video Generation using Pose-Free Videos, Ma et al.

(back to top)

๐Ÿ“œ Change Log

  • [28 Nov 2024] Update the list format to enhance clarity.

About

Awesome diffusion Video-to-Video (V2V). A collection of paper on diffusion model-based video editing, aka. video-to-video (V2V) translation. And a video editing benchmark code.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages