Skip to content

ldynx/SAVE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SAVE: Protagonist Diversification with Structure Agnostic Video Editing (ECCV 2024)

This repository contains the official implementation of SAVE: Protagonist Diversification with Structure Agnostic Video Editing.

Project Website arXiv 2312.02503

Teaser

🐱 A cat is roaring ➜ 🐶 A dog is < Smot > / 🐯 A tiger is < Smot >

😎 A man is skiing ➜ 🐻 A bear is < Smot > / 🐭 Mickey-Mouse is < Smot >

SAVE reframes the video editing task as a motion inversion problem, seeking to find the motion word < Smot > in textual embedding space to well represent the motion in a source video. The video editing task can be achieved by isolating the motion from a single source video with < Smot > and then modifying the protagonist accordingly.

Setup

Requirements

pip install -r requirements.txt

Weights

We use Stable Diffusion v1-4 as our base text-to-image model and fine-tune it on a reference video for text-to-video generation. Example video weights are available at GoogleDrive.

Training

To fine-tune the text-to-image diffusion models on a custom video, run this command:

python run_train.py --config configs/<video-name>-train.yaml

Configuration file <video-name>-train.yaml contains the following arguments:

  • output_dir - Directory to save the weights.
  • placeholder_tokens - Pseudo words separated by | e.g., <s1>|<s2>.
  • initializer_tokens - Initialization words separated by | e.g., cat|roaring.
  • sentence_component - Use <o> for appearance words and <v> for motion words e.g., <o>|<v>.
  • num_s1_train_epochs - Number of epochs for appearance pre-registration.
  • exp_localization_weight - Weight for the cross-attention loss (recommended range is 1e-4 to 5e-4).
  • train_data: video_path - Path to the source video.
  • train_data: prompt - Source prompt that includes the pseudo words in placeholder_tokens e.g., a <s1> cat is <s2>.
  • n_sample_frames - Number of frames.

Video Editing

Once the updated weights are prepared, run this command:

python run_inference.py --config configs/<video-name>-inference.yaml

Configuration file <video-name>-inference.yaml contains the following arguments:

  • pretrained_model_path - Directory to the saved weights.
  • image_path - Path to the source video.
  • placeholder_tokens - Pseudo words separated by | e.g., <s1>|<s2>.
  • sentence_component - Use <o> for appearance words and <v> for motion words e.g., <o>|<v>.
  • prompt - Source prompt that includes the pseudo words in placeholder_tokens e.g., a <s1> cat is <s2>.
  • prompts - List of source and editing prompts e.g., [a <s1> cat is <s2>, a dog is <s2>].
  • blend_word - List of protagonists in the source and edited videos e.g., [cat, dog].

Citation

@inproceedings{song2025save,
  title={Save: Protagonist diversification with structure agnostic video editing},
  author={Song, Yeji and Shin, Wonsik and Lee, Junsoo and Kim, Jeesoo and Kwak, Nojun},
  booktitle={European Conference on Computer Vision},
  pages={41--57},
  year={2025},
  organization={Springer}
}

Acknowledgements

This code builds upon diffusers, Tune-A-Video and Video-P2P. Thank you for open-sourcing!

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages