Skip to content

multiple projects including the entire data science lifecycle, such as web scraping, data cleaning, preprocessing, exploratory data analysis (EDA), data visualization, and applying clustering, classification, and regression models using various machine learning methods.

Notifications You must be signed in to change notification settings

mmd-nemati/Data-Science-Course-Projects-S2024

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data Science Course

CA0: Data Scrapping

Overview

This project focuses on web scraping and introductory data analysis using Ethereum blockchain transaction data from Etherscan.io.

Tasks

  1. Data Collection

    • Using web scraping techniques, we collect transaction data from Etherscan, focusing on transactions from the last 10 blocks.
  2. Data Analysis

    • Load the Data: Import the transaction data into a pandas DataFrame.
    • Data Cleaning: Clean the data by converting data types, removing irrelevant information, and handling duplicates.
    • Statistical Analysis: Calculate the mean and standard deviation of the population. Plot histograms, normal distribution plots, box plots, and violin plots for transaction values and fees.
    • Visualization: Create visual representations to aid in the analysis of transaction values.
  3. Data Sampling

    • Simple Random Sampling (SRS): Randomly select a subset of data.
    • Stratified Sampling: Divide the data into strata based on transaction value and randomly select samples from each stratum.
    • Comparison: Compare the mean and standard deviation of the samples with the population statistics.

CA1: Statistics

Overview

In this project, we get acquainted with and implement some tools for statistical analysis like Monte Carlo Simulation, CLT and T-Test.

Tasks

  1. Monte Carlo Simulation

    • Pi Calculation: Estimate Pi by generating random points within a square and counting those inside an inscribed circle. Repeat with different point counts and analyze.
    • Mensch Game: Simulate a simplified version of the Mensch game to calculate the probability of winning for each player.
  2. Central Limit Theorem (CLT)

    • Select three different probability distributions.
    • Generate random samples, calculate their means, and plot histograms overlaid with expected normal distributions.
    • Repeat for increasing sample sizes and observe the changes.
  3. Hypothesis Testing

    • Unfair Coin

      • Simulate a biased coin. Perform hypothesis testing to determine fairness using confidence interval and p-value approaches with sample sizes of 30, 100, and 1000.
    • T-Test

      • Calculate t-statistic and degrees of freedom for two groups.
      • Determine p-value and report results using manual calculations and the SciPy library.
    • Job Placement

      • Test if working alongside studying affects grades using a job placement dataset. Perform hypothesis tests manually and with SciPy, then compare results.

CA2: EDA

Overview

In this assignment we investigate open-ended questions. The open-ended questions ask you to think creatively and critically about perform some EDAs on the provided datasets.

Tasks

  1. The provided dataset contains information about the passengers of the sunken ship ‘RMS Lusitania’. In this task, performed some preprocess and usual EDAs like such as executing some queries, plotting and etc using numpy, matplotlib and mighty pandas.
  2. This dataset focuses on data scientist salaries across different regions from 2020 to 2024. We needed to do some data cleaning and more preprocess, then applied different techniques to extract some usefull insights about the data.

CA3: PySpark

Overview

In this assignment, we work with PySpark, a Python API for Apache Spark. For the first step, we do somewarm-up exercises, in order to learn how to use it. Next, we work on a large Spotify parquet dataset to gain different insights from it using various and different methods we've learnt so far. The Spotify dataset has more than 1.5M data with around 25 features that we performed multiple preprocess, data cleaning and feature engineering to prepare it for EDA.

Some of our reuslts:

Relation of Each Pair of Features

CA4: Modeling

Overview

In this assignment, we explore various loss functions and apply gradient descent methods to optimize these functions. We work with the Diabetes dataset from the scikit-learn library This dataset consists of medical diagnostic measurements from numerous patients and is specifically designed to study diabetes progression. We use these data points to predict the quantitative measure of disease progression one year after baseline, thus practicing the application of regression analysis in a medical context.

Tasks

  1. Functions’ Implementation

    • Implementing following functions from scratch:
      • Mean Squared Error (MSE)
      • Mean Absolute Error (MAE)
      • Root Mean Squared Error (RMSE)
      • Coefficient of Determination (R² Score)
  2. Building and Training the Linear Regression Model

  3. Model Evaluation

  1. Ordinary Least Squares

CA5: Feature Engineering

Overview

In this assignment, first we apply feature engineering techniques to a football-related dataset to analyze the likelihood of scoring a goal through a shot. Next, we delve into regression and cross-validation concepts further by implementing multivariate regression and k-fold cross-validation from scratch and utilize them on a preprocessed dataset related to cars and also compare our outcomes with those attained using Python's built-in libraries.

Tasks

  1. Preprocess, Feature Engineering and Model Evaluation on an Football Dataset

Before Feature Engineereing

After Feature Engineereing

  1. Multivariate Regression Implementation We implement multivariate regression from scratch and use the gradient descent algorithm to update the weights. Also we plot the accuracy across different random states for a more robust verification.

  1. Manual K-Fold Cross Validation Implementation We implement K-Fold cross-validation from scratch. As in the previous section, use the gradient descent algorithm to adjust the weights.

  1. Comparison with Built-in Python Libraries

CA6: Clustering

Overview

In this assignment, we delve into dimensionality reduction and unsupervised learning tasks. We work with the database from an artice called Impact of c1HbA Measurement on Hospital Readmission Rates: Analysis of 70,000 Clinical Database Patient Records., with more than 200k items and 50 features.

Tasks

  1. Preprocess
  2. Dimensionality Reduction with PCA

  1. Unsupervised Learning
    • K-Means

  • DBSCAN

CA7: Large Language Model

Overview

In this assignmet, we work with a IMDB review comment dataset and try to train a model to classify them as positive or negetive automatically. First, we use different methods to expand labeled data for training, extract features from sentences, and then we train and evaluate our classifier models. We used usual models such as, Decision Tree, Logistic Regression, Gaussian Naive Bayes, Gradiant Boosting, Random Forest and SVM with different kernels to propagete the labels. At last, we used Phi-3 LLM model to generate labels and compare it with traditional methods.

Tasks

  1. EDA

  1. Feature Engineering

  1. Labeling with Traditional Methods
  • K-Means

  • KNN from scratch

  • Label Propagation

  • Self Training

  1. Labeling using LLM
  • Chain of Thoughts

  • Labeling Test Data with LLM

Project - Phase 0 - Data Retrieval:

In this phase we started by gathering housing data through the Realtor.com API. During this phase, we encountered several challenges:

  • Rate Limitations: The API's rate limit restricted the number of requests we could make. To overcome this, we utilized multiple systems and IP addresses to distribute the load.
  • Duplicated Data: Initially, our method for sorting and receiving data led to significant duplication. We refined our approach by implementing better sorting mechanisms to ensure the retrieval of unique listings.

After addressing these issues, we successfully collected 43,000 unique housing records, each with 25 features.

Project - Phase 1 - Preprocess and EDA:

This phase was dedicated to data cleaning, feature engineering and EDAs.

  • Data Cleaning: The data cleaning process was meticulous, focusing on handling missing values contextually for each column:

    • Location-based Imputation: For certain features, we employed K-Nearest Neighbors (KNN) to impute missing values based on data from neighboring cities.
    • Statistical Methods: For other columns, we used the average or median of the data's distribution, ensuring the imputed values were appropriate for their context.
    • Tags: We used the data in tags column to fill some of our columns.
  • Feature Engineering: We enhanced the dataset by performing feature engineering:

    • Exploding Tags Columns: Some columns contained lists of tags or categories. We exploded these columns, using One-Hot encoding so that they better represented the data's structure. We also added new columns based on that tags.
  • EDA: We performed many EDAs. Here are some of them:

Project - Phase 2 - Prediction:

In this phase after performing some more feature engineerings, we used classic ML methods and a Neural Network to predict the prices of the houses.

  • Feature Engineering

Log Transform Price

  • Dimansion Reduction: To understand the data's variance and simplify our model, we applied Principal Component Analysis (PCA):

    • Variance Explained: We assessed how much variance was captured by the first two principal components. It was around 20%.
    • Feature Requirement: We determined the number of principal components needed to cover 95% of the dataset's variance. Number of needed dimensions with keeping 95% of data: 32

  • Neural Network: A Deep learning model was trained to capture complex patterns in the data.

  • Classic ML: We used these models to predict the target:
    • Decision Tree
    • Gradient Boosting
    • Linear Regression
    • Random Forest
    • K-Nearest Neighbors (KNN)
    • XGBoost
    • Support Vector Machine (SVM)


Contributors

About

multiple projects including the entire data science lifecycle, such as web scraping, data cleaning, preprocessing, exploratory data analysis (EDA), data visualization, and applying clustering, classification, and regression models using various machine learning methods.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 50.5%
  • HTML 49.5%