Real-world Image Classification with Distributed PyTorch Training
A robust foundation model for visual inference and classification tasks built on real-world images. This model leverages ResNet50 with Bayesian Inference for uncertainty estimation, distributed training with NCCL backend, and logging through Weights and Biases.
- Distributed Training: Multi-GPU training across devices using PyTorch's NCCL backend.
- Bayesian Inference: Real-time uncertainty estimation for robust model outputs.
- Efficient Data Handling: Supports large-scale real-world image datasets.
- Automatic Logging: Logs training metrics and model checkpoints with Weights & Biases.
- Python 3.8+
- CUDA 11.0+ for GPU-based training
- PyTorch >= 1.10
- NCCL backend for distributed training
- Weights & Biases for logging metrics
Install dependencies with:
pip install -r requirements.txt
-
Clone the repository
git clone https://github.com/gaga1313/vfm.git cd visual-foundation-model
-
Set up your dataset
Place your dataset under
data/dataset/
. The dataset should contain images and corresponding labels. -
Environment Setup
Create an
.env
file with your Weights & Biases credentials if you’d like to enable logging.
1. Initialize Distributed Training
Ensure your GPUs are correctly set up and launch training with:
pyython -m torch.distributed.run --nproc_per_node=<num_gpus> train.py
Our model uses the ResNet50 architecture, customized with:
- Bayesian Inference Layer: Enables uncertainty estimation for each prediction.
- Metric Logger: Logs losses, accuracies, and Bayesian confidence.
- Multi-Process DataLoader: Optimized for distributed data loading in multi-GPU environments.
The training loop is distributed across available GPUs using DistributedSampler. Metrics are logged for every epoch via Weights & Biases.
Run the training script:
torchrun --nproc_per_node=<num_gpus> main.py --mode train
Flags:
--nproc_per_node
: Number of GPUs--batch_size
: Batch size for each process--epochs
: Total epochs--log_interval
: Logging frequency
Evaluate the trained model on a separate test dataset. To run the test evaluation, use:
torchrun --nproc_per_node=<num_gpus> main.py --mode test
Sample results and logged metrics can be found in the results/
directory. Metrics such as accuracy, loss, and Bayesian confidence are logged for detailed analysis.
Example Weights & Biases Dashboard (replace <your_wandb_project_link>
):
This project is licensed under the MIT License - see the LICENSE file for details.
Feel free to submit issues or pull requests. For major changes, please open an issue first to discuss what you would like to change.
Happy Training! 💪