Anomaly Detection using Evolutionary Learning and Ensemble Scoring

Objective

This project tackles the challenge of fraud detection using the IEEE-CIS Fraud Detection Dataset. By leveraging evolutionary algorithms through DEAP, we create and optimize a diverse population of classifiers to detect fraudulent transactions. The ultimate goal is to develop an ensemble scoring system that combines predictions from multiple classifiers, offering robust and adaptive fraud detection.

Key Features & Benefits

Diverse Classifier Population: Maintains a heterogeneous mix of classifiers (e.g., Random Forest, SVM, XGBoost) through mutation-based evolutionary optimization.
Ensemble Scoring: Combines predictions using weighted voting or confidence aggregation, dynamically optimized via a secondary DEAP process.
Adaptive Optimization: Continuously evolves hyperparameters and classifier weights to improve detection accuracy.
Real-World Applicability: Designed for large-scale, production-ready fraud detection pipelines.

Implementation

Tech Stack

TensorFlow Extended (TFX): Defines and manages the ML pipeline, covering data ingestion, preprocessing, training, and evaluation.
Airflow: Orchestrates the workflow and schedules pipeline tasks.
MySQL: Stores metadata and orchestrates task state tracking.
Docker Compose: Containerizes Airflow and MySQL services, isolating them in consistent environments for streamlined management.
DEAP: Performs evolutionary optimization for hyperparameters, classifier selection, and ensemble scoring.
Dataset: IEEE-CIS Fraud Detection data, containing transaction and identity features with fraud labels.

Pipeline Overview

Data Ingestion: Load transaction and identity data using TFX ExampleGen.
Preprocessing: Apply feature engineering and data transformations using Transform.
Classifier Optimization:
- Use DEAP to optimize a population of classifiers, maintaining diversity through mutation.
- Evaluate individual classifiers based on fraud detection performance.
Ensemble Scoring:
- Aggregate predictions from classifiers to produce fraud scores.
- Use DEAP again to optimize weights for combining classifiers.
Model Training: Train the best-performing ensemble and evaluate its fraud detection performance.
Deployment: Export the trained model for deployment in production.

Why This Approach?

Fraud detection demands a system capable of adapting to evolving patterns and edge cases. By leveraging evolutionary optimization with DEAP, this project aims to ensure:

Robustness: Diverse classifiers mitigate the risk of overfitting to specific fraud patterns.
Accuracy: Optimized ensemble scoring improves fraud detection metrics like F1-score and AUC.
Scalability: The pipeline supports large datasets and adapts to changing fraud behavior.

"Card denied. Would you like to try another one?"

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
dags		dags
src		src
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
airflow.cfg		airflow.cfg
docker-compose.yml		docker-compose.yml
entrypoint.sh		entrypoint.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Anomaly Detection using Evolutionary Learning and Ensemble Scoring

Objective

Key Features & Benefits

Implementation

Tech Stack

Pipeline Overview

Why This Approach?

About

Releases

Packages

Languages

hutchisonkim/tfx_deap_anomalies

Folders and files

Latest commit

History

Repository files navigation

Anomaly Detection using Evolutionary Learning and Ensemble Scoring

Objective

Key Features & Benefits

Implementation

Tech Stack

Pipeline Overview

Why This Approach?

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages