Efstathios Karypidis1,3, Ioannis Kakogeorgiou1, Spyros Gidaris2, Nikos Komodakis1,4,5
1Archimedes/Athena RC 2valeo.ai
3National Technical University of Athens 4University of Crete 5IACM-Forth
This repository contains the official implementation of the paper: Advancing Semantic Future Prediction through Multimodal Visual Sequence Transformers
- News-ToDos
- Installation
- Dataset Preparation
- Futurist Training
- Evaluation
- Demo
- Citation
- Acknowledgements
2025-1-14: Arxiv Preprint and GitHub repository are released!
- Add new branches with code for training with vq-vae & separate tokens for each modality
The code is tested with Python 3.11 and PyTorch 2.0.1+cu121 on Ubuntu 22.04.05 LTS. Create a new conda environment:
conda create -n futurist python=3.11
conda activate futurist
Clone the repository and install the required packages:
pip install torch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0 --index-url https://download.pytorch.org/whl/cu121
git clone https://github.com/Sta8is/FUTURIST
cd FUTURIST
pip install -r requirements.txt
We use Cityscapes dataset for our experiments. Especially, we use the leftImg8bit_sequence_trainvaltest
sequences. In order to extract segmentation maps we utilize Segmenter. In order to extract depth maps we utilize DepthAnythingV2. You can skip downloading leftImg8bit_sequence_trainvaltest
and preprocessing and simply download the precomputed segmentation maps from here and depth maps from here. Also, in order to evaluate futurist gtFine
needs to be processed using cityscapesScripts. Alternatively, you can download the processed dataset from here. The final structure of the dataset should be as follow.
cityscapes
│
├───leftImg8bit_sequence_depthv2
│ ├───train
│ ├───val
├───leftImg8bit_sequence_segmaps_ids
│ ├───train
│ ├───val
├───gtFine
│ ├───train
│ ├───val
│ ├───test
To train Futurist with default parameters use the following command:
python train_futurist.py --num_gpus=8 --precision 16-mixed --eval_freq 10 --batch_size 2 --max_epochs 3200 --lr_base 4e-5 --patch_size 16 \
--eval_mode_during_training --evaluate --single_step_sample_train --masking "simple_replace" --seperable_attention --random_horizontal_flip \
--random_crop --use_fc_bias --data_path="/path/to/cityscapes/leftImg8bit_sequence_segmaps_ids" --modality segmaps_depth \
--sequence_length 5 --num_classes 19 --emb_dim 10,10 --accum_iter 4 --w_s 0.85 \
--dst_path "/logdir/futurist" --masking_strategy "par_shared_excl" --modal_fusion "concat"
You can also download the pre-trained model from here. To evaluate Futurist trained model use the following command:
python train_futurist.py --num_gpus=4 --precision 16-mixed --eval_freq 10 --batch_size 2 --max_epochs 3200 --lr_base 4e-5 --patch_size 16 \
--eval_mode_during_training --evaluate --single_step_sample_train --masking "simple_replace" --seperable_attention --random_horizontal_flip \
--random_crop --use_fc_bias --data_path="/path/to/cityscapes/leftImg8bit_sequence_segmaps_ids" --modality segmaps_depth \
--sequence_length 5 --num_classes 19 --emb_dim 10,10 --accum_iter 4 --w_s 0.85 \
--dst_path "/logdir/futurist" --masking_strategy "par_shared_excl" --modal_fusion "concat" \
--eval_ckpt_only --ckpt "/path/to/futurist.ckpt"
We provide 2 quick demos.
- Demo.
If you found Futurist useful in your research, please consider starring ⭐ us on GitHub and citing 📚 us in your research!
@article{karypidis2025advancing,
title={Advancing Semantic Future Prediction through Multimodal Visual Sequence Transformers},
author={Karypidis, Efstathios and Kakogeorgiou, Ioannis and Gidaris, Spyros and Komodakis, Nikos},
journal={arXiv preprint arXiv:2501.08303},
year={2025}
}
Our code is partially based on Maskgit-pytorch, DepthAnythingV2, Segmenter for their work and open-source code.