In recent years, there has been a surge of interest in open-vocabulary 3D scene reconstruction facilitated by visual language models (VLMs), which showcase remarkable capabilities in open-set retrieval. However, existing methods face some limitations: they either focus on learning point-wise features, resulting in blurry semantic understanding, or solely tackle object-level reconstruction, thereby overlooking the intricate details of the object's interior. To address these challenges, we introduce OpenObj, an innovative approach to build open-vocabulary object-level Neural Radiance Fields (NeRF) with fine-grained understanding. In essence, OpenObj establishes a robust framework for efficient and watertight scene modeling and comprehension at the object-level. Moreover, we incorporate part-level features into the neural fields, enabling a nuanced representation of object interiors. This approach captures object-level instances while maintaining a fine-grained understanding. The results on multiple datasets demonstrate that OpenObj achieves superior performance in zero-shot semantic segmentation and retrieval tasks. Additionally, OpenObj supports real-world robotics tasks at multiple scales, including global movement and local manipulation. connectivity to construct a hierarchical graph.
Use conda to install the required environment. To avoid problems, it is recommended to follow the instructions below to set up the environment.
conda env create -f environment.yml
Follow the instructions to install the CropFormer model and download the pretrained weights CropFormer_hornet_3x.
Follow the instructions to install the TAP model and download the pretrained weights here.
pip install -U sentence-transformers
Download pretrained weights
git clone https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
git clone https://github.com/BIT-DYN/OpenObj
cd OpenObj
OpenObj has completed validation on Replica (as same with vMap) and Scannet. Please download the following datasets.
- Replica Demo - Replica Room 0 only for faster experimentation.
- Replica - All Pre-generated Replica sequences.
- ScanNet - Official ScanNet sequences.
Run the following command to identifie and comprehend object instances from color images.
cd maskclustering
python3 mask_gen.py --input /data/dyn/object/vmap/room_0/imap/00/rgb/*.png --input_depth /data/dyn/object/vmap/room_0/imap/00/depth/*.png --output results/room_0/mask/ --opts MODEL.WEIGHTS CropFormer_hornet_3x_03823a.pth
You can see a visualization of the results in the results/vis
folder.
Run the following command to ensure consistent object association across frames.
python3 mask_graph.py --config_file ./configs/room_0.yaml --input_mask results/room_0/mask/mask_init_all.pkl --input_depth /data/dyn/object/vmap/room_0/imap/00/depth/*.png --input_pose /data/dyn/object/vmap/room_0/imap/00/traj_w_c.txt --output_graph results/room_0/mask/graph/ --input_rgb /data/dyn/object/vmap/room_0/imap/00/rgb/*.png --output_dir /data/dyn/object/vmap/room_0/imap/00/ --input_semantic /data/dyn/object/vmap/room_0/imap/00/semantic_class/*.png
You can see a visualization of the results in the results/graph
folder.
And this will generate some folders (class_our/
instance_our/
) and documents (object_clipfeat.pkl
object_capfeat.pkl
object_caption.pkl
) in the data directory, which are necessary for the follow-up process.
Run the following command to distinguish parts and extracts their visual features.
cd ../partlevel
python sam_clip_dir.py --input_image /data/dyn/object/vmap/room_0/imap/00/rgb/*.png --output_dir /data/dyn/object/vmap/room_0/imap/00/partlevel --down_sample 5
This will generate a folder (partlevel/
) in the data directory, which is necessary for the follow-up process.
Run the following command to vectorize the training of NeRFs for all objects.
cd ../nerf
python train.py --config ./configs/Replica/room_0.json --logdir results/room_0
This will generate a folder (ckpt/
) in the result directory containing the network parameters for all objects.
Run the following command to generate the vis documents.
cd ../nerf
python gen_map_vis.py --scene_name room_0 --dataset_name Replica
Interactions can be made using our visualization files.
cd ../nerf
python vis_interaction.py --scene_name room_0 --dataset_name Replica --is_partcolor
Then in the open3d visualizer window, you can use the following key callbacks to change the visualization.
Press C
to toggle the ceiling.
Press S
to color the meshes by the object class.
Press R
to color the meshes by RGB.
Press I
to color the meshes by object instance ID.
Press O
to color the meshes by part-level feature.
Press F
and type object text and num in the terminal, and the meshes will be colored by the similarity.
Press P
and type object text and num and part text in the terminal, and the meshes will be colored by the similarity.
If you find our work helpful, please cite:
@article{openobj,
title={OpenObj: Open-Vocabulary Object-Level Neural Radiance Fields with Fine-Grained Understanding},
author={Deng, Yinan and Wang, Jiahui and Zhao, Jingyu and Dou, Jianyu and Yang, Yi and Yue, Yufeng},
journal={arXiv preprint arXiv:2406.08009},
year={2024}
}
We would like to express our gratitude to the open-source projects and their contributors vMap. Their valuable work has greatly contributed to the development of our codebase.