This repository presents a comprehensive framework for fraud detection in financial transactions using both supervised and semi-supervised learning techniques. The dataset contains transaction details with labels indicating fraudulent or legitimate transactions.
-
Exploratory Data Analysis (EDA):
- In-depth analysis of numerical and categorical features.
- Temporal and categorical trend visualizations.
- Identification of data imbalances and patterns.
-
Supervised Learning:
- Tabular models using engineered features.
- Oversampling techniques (SMOTE) for handling class imbalance.
- Comparison of models with and without feature engineering.
-
Semi-Supervised Learning:
- Graph-based models (Graph Attention Network (GAT) and GraphSAGE).
- Hybrid loss combining classification and graph reconstruction objectives.
- Alternative approaches: Autoencoders, Isolation Forest, Local Outlier Factor, and Gaussian Mixture Models.
-
Visualization:
- Temporal analysis of fraudulent transactions.
- Network graph visualizations showing source-destination-agent relationships.
- Distribution plots for transaction amounts and fraud probabilities.
The dataset consists of financial transaction records with the following columns:
date
: Transaction timestamp in milliseconds.user
: Unique user identifier.source_prefix
,source_postfix
: Attributes related to the transaction's source.dest_prefix
,dest_postfix
: Attributes related to the transaction's destination.agent
: Categorical variable indicating the handler of the transaction.amount
: Transaction value in the smallest monetary unit.status
: Indicates the transaction's result (e.g., success or fail).label
: Binary variable indicating fraud (1
) or non-fraud (0
).
- 106,036 rows, 10 columns.
- No missing values.
- Highly imbalanced, with ~3.5% fraudulent transactions.
- Insights into data distribution, trends, and relationships.
- Fraud vs. non-fraud analysis for key features like
amount
,agent
, andstatus
.
- Models: Tabular Transformers, Neural Networks.
- Feature Engineering:
- Transaction-specific features:
is_high_risk_pair
,fraud_rate
. - Temporal features:
is_night_high_risk
.
- Transaction-specific features:
- Oversampling: SMOTE for handling class imbalance.
- Evaluation Metrics: Precision, Recall, F1-Score.
- Graph-Based Models:
- Graph Attention Networks (GAT).
- GraphSAGE.
- Hybrid loss combining node classification and reconstruction.
- Note: Only 15% of the data is labeled for training.
- Other Methods:
- Autoencoders.
- Isolation Forest.
- Local Outlier Factor.
- Gaussian Mixture Models.
- Temporal distribution of fraud cases.
- Network graphs of fraudulent transactions.
- Feature importance and correlation heatmaps.
Method | Precision | Recall | F1-Score |
---|---|---|---|
Supervised (Engineered) | 0.87 | 0.84 | 0.85 |
Supervised (SMOTE) | 0.79 | 0.91 | 0.85 |
GAT (15% labeled data) | 0.82 | 0.88 | 0.85 |
GraphSAGE (15% labeled data) | 0.80 | 0.85 | 0.82 |
Note: Results may vary based on parameter tuning and dataset changes.
This project is licensed under the MIT License. See the LICENSE file for details.
- Dataset Source: Provided during an interview task.
- Referenced Papers: