A PyTorch adapted implementation of the video-to-command model described in the paper:
"Sequence to Sequence – Video to Text".
"Translating Videos to Commands for Robotic Manipulation with Deep Recurrent Neural Networks".
"V2CNet: A Deep Learning Framework to Translate Videos to Commands for Robotic Manipulation".
"Watch and Act: Learning Robotic Manipulation From Visual Demonstration".
This model was replicated by Thanh Tuan
- Using both CNNs and RNNs are not enough to understand captured actions and interacted objects. According to
Watch and Act
, a new model was proposed some new methods for video understanding problems.
-
You first create a new Anaconda environment:
conda create -n c2v python=3.12.4
-
Activate the new environment using:
conda activate c2v
-
Install all required libraries with:
pip install -r requirements.txt
-
The video2command model is an Encoder-Decoder neural network that learns to generate a short sentence which can be used to command a robot to perform various manipulation tasks. The architecture of the network is listed below:
-
The left side branch uses Mask-RCNN to isolate unimportance and irrelevance objects in that scene by subtracting 2 mask frames (the current feature and the next feature) and then fedding it through a CNN to extract the new feature. This feature will complement our model by providing a visual change map that focuses solely on the altered objects
-
In most imitation testing cases in practice, the robot's vision comes from a fixed camera. Based on this, we know that the object's size that the robot has to handle will be less than 0.55 or 0.6 of the captured image. The robot will interact with the object that has the highest confidence score in object detection and the highest probability of being a pickable object.
- The new branch was adding to run simultaneously with the translation one. The main function of this branch is action classification
To repeat the video2command experiment:
-
Clone the repository.
-
Download the IIT-V2C dataset, extract the dataset and setup the directory path as
datasets/IIT-V2C
. -
For CNN features, two options are provided:
-
Use the pre-extracted ResNet50 features provided by the original author.
-
Perform feature extraction yourself. Firstly run
avi2frames.py
under folderexperiments/experiment_IIT-V2C
to convert all videos into images. Download the *.pth weights for ResNet50 converted from Caffe. Runextract_features.py
under folderexperiments/experiment_IIT-V2C
afterwards. -
Download the " " for Mask-RCNN pretrained checkpoint.
-
Note that the author's pre-extracted features seem to have a better quality and lead to a possible 1~2% higher metric scores.
-
-
To begin training, run
train_iit-v2c.py
.- NOTE: You need more than 100GB free space for this process if you choose training from scratch with
IIT-V2C
dataset.
- NOTE: You need more than 100GB free space for this process if you choose training from scratch with
-
For evaluation, firstly run
evaluate_iit-v2c.py
to generate predictions given all saved checkpoints. Runcocoeval_iit-v2c.py
to calculate scores for the predictions.-
In case, Java is not existed in your environment. Open your terminal and run these following lines:
-
For Ubuntu/Debian:
sudo apt update sudo apt install default-jre
-
For macOS:
brew update brew install openjdk
-
If Java is installed but not in your PATH, you can add it. On Linux or macOS, you can add the following lines to your ~/.bashrc or ~/.zshrc file:
export JAVA_HOME=$(dirname $(dirname $(readlink -f $(which java)))) export PATH=$JAVA_HOME/bin:$PATH
-
-
For practical testing, access
PracticalTesting
folder and runvideo->frame.py
,video extraction.py
,evaluation.py
step by step, don't forget to change your directory path
If you have any questions or comments, please send an to [email protected]