Diffusion Model-Based Video Editing: A Survey

Wenhao Sun, Rong-Cheng Tu, Jingyi Liao, Dacheng Tao
Nanyang Technological University

teaser.mp4

🍻 Citation

If you find this repository helpful, please consider citing our paper:

@article{sun2024v2vsurvey,
    author = {Wenhao Sun and Rong-Cheng Tu and Jingyi Liao and Dacheng Tao},
    title = {Diffusion Model-Based Video Editing: A Survey},
    journal = {CoRR},
    volume = {abs/2407.07111},
    year = {2024}
}

📌 Introduction

Overview of diffusion-based video editing model components.

Network and Training Paradigm | network and data modifications
Attention Feature Injection | a class of training-free techniques
- Inversion-Based Feature Injection
- Motion-Based Feature Injection
Diffusion Latents Manipulation | manipulate diffusion process
- Latent Initialization
- Latent Transition
Canonical Representation | the efficient video representation
Novel Conditioning | condtional video editing
- Point-Based Editing
- Pose-Guided Human Action Editing

Tip

The papers are listed in reverse chronological order, formatted as follows: (Conference/Journal Year) Title, Authors

Network and Training Paradigm

Temporal Adaption

(Preprint 24') VIA: Unified Spatiotemporal Video Adaptation Framework for Global and Local Video Editing, Gu et al.
(ICML 24') Slicedit: Zero-Shot Video Editing With Text-to-Image Diffusion Models Using Spatio-Temporal Slices, Cohen et al.
(ECCV 24') Video Editing via Factorized Diffusion Distillation, Singer et al.
(CVPR 24') MaskINT: Video Editing via Interpolative Non-autoregressive Masked Transformers, Ma et al.
(Preprint 23') Fairy: Fast Parallelized Instruction-Guided Video-to-Video Synthesis, Wu et al.
(CVPR 24') VidToMe: Video Token Merging for Zero-Shot Video Editing, Li et al.
(CVPR 24') SimDA: Simple Diffusion Adapter for Efficient Video Generation, Xing et al.
(NeurIPS 23') Towards Consistent Video Editing with Text-to-Image Diffusion Models, Zhang et al.
(ICCV 23') Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation, Wu et al.

(back to top)

Structure Conditioning

(Preprint '24) EVA: Zero-shot Accurate Attributes and Multi-Object Video Editing, Yang et al.
(IJCAI '24) Diffutoon: High-Resolution Editable Toon Shading via Diffusion Models, Duan et al.
(CVPR '24) FlowVid: Taming Imperfect Optical Flows for Consistent Video-to-Video Synthesis, Liang et al.
(Preprint '23) Motion-Conditioned Image Animation for Video Editing, Yan et al.
(CVPR '24) LAMP: Learn A Motion Pattern for Few-Shot-Based Video Generation, Wu et al.
(ICLR '24) Ground-A-Video: Zero-shot Grounded Video Editing using Text-to-image Diffusion Models, Jeong and Ye
(Preprint '23) CCEdit: Creative and Controllable Video Editing via Diffusion Models, Feng et al.
(Preprint '23) MagicEdit: High-Fidelity and Temporally Coherent Video Editing, Liew et al.
(Preprint '23) VideoControlNet: A Motion-Guided Video-to-Video Translation Framework by Using Diffusion Model with ControlNet, Hu and Xu
(NeurIPS '23) VideoComposer: Compositional Video Synthesis with Motion Controllability, Wang et al.
(Preprint '23) Structure and Content-Guided Video Synthesis with Diffusion Models, Esser et al.

(back to top)

Training Modification

(Preprint 24') Generative Video Propagation, Liu et al.
(Preprint 24') Movie Gen: A Cast of Media Foundation Models, Polyak et al.
(Preprint 24') Still-Moving: Customized Video Generation without Customized Video Data, Chefer et al.
(Preprint 24') EffiVED: Efficient Video Editing via Text-instruction Diffusion Models, Zhang et al.
(ECCV 24') Customize-A-Video: One-Shot Motion Customization of Text-to-Video Diffusion Models, Ren et al.
(Preprint 24') VASE: Object-Centric Appearance and Shape Manipulation of Real Videos, Peruzzo et al.
(Preprint 23') Customizing Motion in Text-to-Video Diffusion Models, Materzynska et al.
(ECCV 24') SAVE: Protagonist Diversification with Structure Agnostic Video Editing, Song et al.
(CVPR 24') VMC: Video Motion Customization using Temporal Attention Adaption for Text-to-Video Diffusion Models, Jeong et al.
(CVPR 24') DreamVideo: Composing Your Dream Videos with Customized Subject and Motion, Wei et al.
(ICLR 24') Consistent Video-to-Video Transfer Using Synthetic Dataset, Cheng et al.
(Preprint 23') VIDiff: Translating Videos via Multi-Modal Instructions with Diffusion Models, Xing et al.
(ECCV 24') MotionDirector: Motion Customization of Text-to-Video Diffusion Models, Zhao et al.
(ICME 24') InstructVid2Vid: Controllable Video Editing with Natural Language Instructions, Qin et al.
(Preprint 23') Dreamix: Video Diffusion Models are General Video Editors, Molad et al.

(back to top)

Attention Feature Injection

Inversion-Based Feature Injection

(TMLR 24') AnyV2V: A Tuning-Free Framework For Any Video-to-Video Editing Tasks, Ku et al.
(ECCV 24') Object-Centric Diffusion: Efficient Video Editing, Kahatapitiya et al.
(Preprint 24') UniEdit: A Unified Tuning-Free Framework for Video Motion and Appearance Editing, Bai et al.
(Preprint 23') Make-A-Protagonist: Generic Video Editing with An Ensemble of Experts, Zhao et al.
(Preprint 23') Zero-Shot Video Editing Using Off-The-Shelf Image Diffusion Models, Wang et al.
(ICCV 23') FateZero: Fusing Attentions for Zero-shot Text-based Video Editing, Qi et al.
(ACML 23') Edit-A-Video: Single Video Editing with Object-Aware Consistency, Shin et al.
(Preprint 23') Video-P2P: Video Editing with Cross-attention Control, Liu et al.

(back to top)

Motion-Based Feature Injection

(CVPR 24') FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video Translation, Yang et al.
(ICLR 24') FLATTEN: optical FLow-guided ATTENtion for consistent text-to-video editing, Cong et al.
(ICLR 24') TokenFlow: Consistent Diffusion Features for Consistent Video Editing, Geyer et al.

(back to top)

Diffusion Latents Manipulation

Latent Initialization

(CVPR 24') A Video is Worth 256 Bases: Spatial-Temporal Expectation-Maximization Inversion for Zero-Shot Video Editing, Li et al.
(Preprint 23') Video ControlNet: Towards Temporally Consistent Synthetic-to-Real Video Translation Using Conditional Image Diffusion Models, Chu et al.
(Preprint 23') Control-A-Video: Controllable Text-to-Video Generation with Diffusion Models, Chen et al.
(ICCV 23') Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators, Khachatryan et al.

(back to top)

Latent Transition

(ICML 24') FRAG: Frequency Adapting Group for Diffusion Video Editing, Yoon et al.
(CVPR 24') GenVideo: One-shot target-image and shape-aware video editing using T2I diffusion models, Harsha et al.
(Preprint 24') MotionClone: Training-Free Motion Cloning for Controllable Video Generation, Ling et al.
(CVPR 24') RAVE: Randomized Noise Shuffling for Fast and Consistent Video Editing with Diffusion Models, Kara et al.
(CVPR 24') Space-Time Diffusion Features for Zero-Shot Text-Driven Motion Transfer, Yatim et al.
(Preprint 23') DiffSynth: Latent In-Iteration Deflickering for Realistic Video Synthesis, Duan et al.
(SIGGRAPH 23') Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation, Yang et al.
(ICLR 23') ControlVideo: Training-free Controllable Text-to-Video Generation, Zhang et al.
(ICCV 23') Pix2Video: Video Editing using Image Diffusion, Ceylan et al.

(back to top)

Canonical Representation

(Preprint 23') Neural Video Fields Editing: Neural Video Fields Editing, Yang et al.
(Preprint 23') DiffusionAtlas: High-Fidelity Consistent Diffusion Video Editing, Chang et al.
(ICCV 23') StableVideo: Text-driven Consistency-aware Diffusion Video Editing, Chai et al.
(CVPR 24') CoDeF: Content Deformation Fields for Temporally Consistent Video Processing, Ouyang et al.
(TMLR 24') VidEdit: Zero-Shot and Spatially Aware Text-Driven Video Editing, Couairon et al.
(CVPR 23') Shape-aware Text-driven Layered Video Editing, Lee et al.

(back to top)

Novel Conditioning

Point-Based Editing

(SIGGRAPH 24') MotionCtrl: A Unified and Flexible Motion Controller for Video Generation, Wang et al.
(Preprint 23') Drag-A-Video: Non-rigid Video Editing with Point-based Interaction, Teng et al.
(ECCV 24') DragVideo: Interactive Drag-style Video Editing, Deng et al.
(CVPR 24') VideoSwap: Customized Video Subject Swapping with Interactive Semantic Point Correspondence, Placeholder et al.

(back to top)

Pose-Guided Human Action Editing

(Preprint 24') StableAnimator: High-Quality Identity-Preserving Human Image Animation, Tu et al.
(SINGGRAPH Asia 24') Fashion-VDM: Video Diffusion Model for Virtual Try-On, Karras et al.
(Preprint 24') Animate-X: Universal Character Image Animation with Enhanced Motion Representation, Tan et al.
(SCIS 24') UniAnimate: Taming Unified Video Diffusion Models for Consistent Human Image Animation, Wang et al.
(Preprint 24') Zero-shot High-fidelity and Pose-controllable Character Animation, Zhu et al.
(Preprint 23') Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation, Hu et al.
(Preprint 23') MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model, Xu et al.
(ICML 24') MagicPose: Realistic Human Poses and Facial Expressions Retargeting with Identity-aware Diffusion, Chang et al.
(CVPR 24') DisCo: Disentangled Control for Realistic Human Dance Generation, Wang et al.
(ICCV 23') DreamPose: Fashion Image-to-Video Synthesis via Stable Diffusion, Karras et al.
(AAAI 23') Follow Your Pose: Pose-Guided Text-to-Video Generation using Pose-Free Videos, Ma et al.

(back to top)

📜 Change Log

[28 Nov 2024] Update the list format to enhance clarity.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
asset		asset
checkpoints		checkpoints
doc		doc
third_party		third_party
v2vbench		v2vbench
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Diffusion Model-Based Video Editing: A Survey

🍻 Citation

📌 Introduction

Network and Training Paradigm

Temporal Adaption

Structure Conditioning

Training Modification

Attention Feature Injection

Inversion-Based Feature Injection

Motion-Based Feature Injection

Diffusion Latents Manipulation

Latent Initialization

Latent Transition

Canonical Representation

Novel Conditioning

Point-Based Editing

Pose-Guided Human Action Editing

📜 Change Log

About

Releases

Packages

Contributors 2

Languages

License

wenhao728/awesome-diffusion-v2v

Folders and files

Latest commit

History

Repository files navigation

Diffusion Model-Based Video Editing: A Survey

🍻 Citation

📌 Introduction

Network and Training Paradigm

Temporal Adaption

Structure Conditioning

Training Modification

Attention Feature Injection

Inversion-Based Feature Injection

Motion-Based Feature Injection

Diffusion Latents Manipulation

Latent Initialization

Latent Transition

Canonical Representation

Novel Conditioning

Point-Based Editing

Pose-Guided Human Action Editing

📜 Change Log

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages