We propose SketchVideo, which aim to achieve sketch-based spatial and motion control for video generation and support fine-grained editing of real or synthetic videos. Please check our project page and paper for more information.
Input frame | Generated video | Input frame | Generated video |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Input frame | Generated video | Input frame | Generated video |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Input sketch | Original video | Generated video |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Input sketch 1 | Input sketch 2 | Original video | Generated video |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
- [2025.04.01]: π₯π₯ Release code and model weights.
- [2025.03.30]: Launch the project page and update the arXiv preprint.
Model | Resolution | GPU Mem. & Inference Time (A100, ddim 50steps) | Checkpoint |
---|---|---|---|
SketchGen | 720x480 | ~21G & 95s | Hugging Face |
SketchEdit | 720x480 | ~23G & 230s | Hugging Face |
Our method is built based on pretrained CogVideo-2b model. We add an additional sketch conditional network for sketch-based generation and editing.
Currently, our SketchVideo can support generating videos of up to 49 frames with a resolution of 720x480. For generation, we assume the sketches have a resolution of 720x480. For editing, we assume the input video has 49 frames with a resolution of 720x480.
The inference time can be reduced by using fewer DDIM steps.
conda create -n sketchvideo python=3.10
conda activate sketchvideo
pip install -r requirements.txt
Notably, diffusers==0.30.1
is required.
Download pretrained SketchGen network [hugging face] and pretrained CogVideo-2b [hugging face] video generation model. Then, modify the --control_checkpoint_path
and --cogvideo_checkpoint_path
in scripts to corresponding paths.
Generate video based on single keyframe sketch.
cd generation
sh scripts/test_sketch_gen_single.sh
Generate video based on two keyframe sketches.
cd generation
sh scripts/test_sketch_gen_two.sh
Download pretrained SketchEdit network [hugging face] and pretrained CogVideo-2b [hugging face] video generation model. Then, for each editing example, modify the config.py
in editing/editing_exp
folder. Change controlnet_path
into SketchEdit weights path, and vae_path, pipeline_path
into CogVideo weights path.
Edit video based on keyframe sketches.
cd editing
sh scripts/test_sketch_edit.sh
It contains the editing examples based on one or two keyframe sketches.
Please consider citing our paper if our code is useful:
@inproceedings{Liu2025sketchvideo,
author = {Liu, Feng-Lin and Fu, Hongbo and Wang, Xintao and Ye, Weicai and Wan, Pengfei and Zhang, Di and Gao, Lin},
title = {SketchVideo: Sketch-based Video Generation and Editing},
booktitle = {{IEEE/CVF} Conference on Computer Vision and Pattern Recognition},
publisher = {{IEEE}},
year = {2025},
}
We thanks the projects of video generation models CogVideoX and ControlNet. Our code introduction is modified from ToonCrafter template.
Our framework achieves interesting sketch-based video generation and editing, but due to the variaity of generative video prior, the success rate is not guaranteed. Different random seeds can be tried to generate the best video generation results.
This project strives to impact the domain of AI-driven video generation positively. Users are granted the freedom to create videos using this tool, but they are expected to comply with local laws and utilize it responsibly. The developers do not assume any responsibility for potential misuse by users.