🏠 Project Page | Paper | Model | Online Demo
MIDI is a 3D generative model for single image to compositional 3D scene generation. Unlike existing methods that rely on reconstruction or retrieval techniques or recent approaches that employ multi-stage object-by-object generation, MIDI extends pre-trained image-to-3D object generation models to multi-instance diffusion models, enabling the simultaneous generation of multiple high-quality 3D instances with accurate spatial relationships and high generalizability.
- High Quality: It produces diverse 3D scenes at high quality with intricate shape.
- High Generalizability: It generalizes to real image and stylized image inputs although trained only on synthetic data.
- High Efficiency: It generates 3D scenes from segmented instance images, without lengthy steps or time-consuming per-scene optimization.
- [2025-03] Release model weights, gradio demo, inference scripts of MIDI-3D.
Clone the repo first:
git clone https://github.com/VAST-AI-Research/MIDI-3D.git
cd MIDI-3D
(Optional) Create a fresh conda env:
conda create -n midi python=3.10
conda activate midi
Install necessary packages (torch > 2):
# pytorch (select correct CUDA version)
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118
# other dependencies
pip install -r requirements.txt
The following running scripts will automatically download model weights from VAST-AI/MIDI-3D to local directory pretrained_weights/MIDI-3D
.
python gradio_demo.py
Important!! Please check out our instructional video!
How_to_use_MIDI.mp4
The web demo is also available on Hugging Face Spaces!
If running MIDI with command lines, you need to obtain the segmentation map of the scene image firstly. We provide a script to run Grounded SAM in scripts/grounding_sam.py
. The following example command will produce a segmentation map in the ./segmentation.png
.
python -m scripts.grounding_sam --image assets/example_data/Cartoon-Style/04_rgb.png --labels lamp sofa table dog --output ./
Then you can run MIDI with the rgb image and segmentation map, using our provided inference script scripts/inference_midi.py
. The following command will save the generated 3D scene output.glb
in the output dir.
python -m scripts.inference_midi --rgb assets/example_data/Cartoon-Style/00_rgb.png --seg assets/example_data/Cartoon-Style/00_seg.png --output-dir "./"
Important!!!
- We recommend using the interactive demo to get a segmentation map of moderate granularity.
- If instances in your image are too close to the image border, please add
--do-image-padding
to the running scripts of MIDI.
@article{huang2024midi,
title={MIDI: Multi-Instance Diffusion for Single Image to 3D Scene Generation},
author={Huang, Zehuan and Guo, Yuanchen and An, Xingqiao and Yang, Yunhan and Li, Yangguang and Zou, Zixin and Liang, Ding and Liu, Xihui and Cao, Yanpei and Sheng, Lu},
journal={arXiv preprint arXiv:2412.03558},
year={2024}
}