DTU Machine Learning Operations
- Lukas Korinek (s246710)
- Frederik Sartov Olsen (s204118)
- Konstantinos-Athanasios Papagoras PhD (s230068)
- Yessin Moakher (s250283)
- Stiina Salumets (s250088)
The primary goal of this project is to classify pneumonia using chest X-ray images.
We will use the PyTorch Image Models (timm) framework for this project.
This project aims to classify chest X-ray images into two categories: Pneumonia and Normal, using the Kaggle dataset "Chest X-Ray Images (Pneumonia)". The dataset contains 5,863 pediatric chest X-rays from Guangzhou Women and Children’s Medical Center, all captured as part of routine clinical care and graded by expert physicians. For the initial stages of the project, we will use a subset of the dataset to verify that everything is running smoothly, before scaling up to the full dataset. The dataset is split into training, validation, and testing folders, making it ideal for this task. It was chosen for its simplicity and suitability for beginner-level image classification projects, especially in healthcare, and it seems feasible to implement within a short timeframe.
For this project, we will begin with a baseline model using a simple convolutional neural network (CNN) to establish a reference performance. We will then leverage pre-trained models from PyTorch’s image models framework, such as ResNet50, VGG16, and DenseNet, to improve classification accuracy. These models will be fine-tuned for our specific task by adapting the final layers to classify X-ray images into two categories: Pneumonia and Normal. We will use torchvision for accessing pre-trained models and data augmentation, and torch.optim for optimization, evaluating the models based on accuracy, precision, recall, and F1-score.
This tasks.py
file contains automation tasks. It uses the invoke
library.
- Install
invoke
:pip install invoke
- Create a
.env
file with required environment variables (e.g.,WANDB_API_KEY
).
-
Create Environment
invoke create-environment
Creates a new Conda environment for the project (don't forget to activate it before installing requirements).
-
Install Requirements
invoke requirements
Installs the project dependencies from
requirements.txt
and localpip
configuration. -
Install Development Requirements
invoke dev-requirements
Installs development dependencies.
-
Preprocess Data
invoke preprocess-data --percentage=<float>
Preprocesses raw data and stores it in the processed directory. Use the
--percentage
argument to specify a fraction of data to process (default is1.0
). -
Train Model
invoke train
Executes the model training script.
-
Run Tests
invoke test
Runs tests using
pytest
and generates a coverage report. -
Test Coverage Report
invoke test-coverage
Runs tests and displays a detailed coverage report.
-
Build Docker Image
invoke docker-build --progress=<plain|auto>
Builds the Docker image for the project using the specified Dockerfile.
-
Run Docker Training
invoke docker-train
Runs the training process in a Docker container. Requires
WANDB_API_KEY
in the environment.
- Run W&B Sweep
Creates a Weights & Biases sweep from the specified config file and programmatically starts the agent. The default configuration path is
invoke wandb-sweep --config-path=<path>
configs/sweep.yaml
. You can specify a different path using the--config-path
argument if needed.
- Format Code with Ruff
Formats the project files using
invoke ruff-format
ruff
.
-
Build Documentation
invoke build-docs
Builds the project documentation using
mkdocs
. -
Serve Documentation
invoke serve-docs
Serves the project documentation locally for preview.
- Ensure all required tools and dependencies are properly installed before running tasks.
- For additional details, refer to the
tasks.py
source code.
The directory structure of the project looks like this:
├── .dvc/ # Data version control
├── .github/ # GitHub actions and dependabot
│ ├── dependabot.yaml
│ └── workflows/
│ └── tests.yaml
├── configs/ # Configuration files
├── data/ # Data directory
│ ├── processed
│ └── raw
├── dockerfiles/ # Dockerfiles
├── docs/ # Documentation
├── models/ # Trained models
├── notebooks/ # Jupyter notebooks
├── reports/ # Reports
│ └── figures/
├── src/ # Source code
│ └── mlops_project/
├── tests/ # Tests
├── .dvcignore
├── .gitignore
├── .pre-commit-config.yaml
├── LICENSE
├── pyproject.toml # Python project file
├── README.md # Project README
├── requirements.txt # Project requirements
├── requirements_dev.txt # Development requirements
└── tasks.py # Project tasks