Elite Data Science Primer

Exploratory Data Analysis ( EDA ) | Data Cleaning | Feature Engineering | Data Preprocessing

1. Exploratory Data Analysis

Extract Information and Capture Meaningful Insights from the Dataset is Exploratory Data Analysis

Know about the Data Set | Statistical Information | Summary of Data

Hint for Data Cleaning
Idea for Feature Engineering
Data Visualization to Discover Hidden Patterns and Relations | Spot Anomalies | Detect Outlier ( Five Number Summary | IQR )
Scatter Plot, Histogram and Boxplot for Continuous Variables ( Outliers, Correlation, Measure of Spread and Central Tendency )
Bar Plot to know about Categorical Variables | Histogram to understand Distribution of Data | Heatmap ( Correlation )

Ask Questions to Data

How many Observations ? ( Rows )
How many Features ? ( Columns )
Data Types of the Features ? ( Categorical or Numeric )
Target Variable in the Data Set ?
Whether Data set is Balanced or Imbalanced ?
Do the Values on the Right Scale ? or needs ( Transformation | Normalization | Standardization )
Identify Missing Data | Reason for Missing Data
Plot Numerical Distributions ( Detect Outliers, Boxplot, Scatter, Histogram )
Plot Categorical Data ( Bar Chart | Frequency of Data | Combine Sparse Classes )
Find Correlations ( Heatmaps )
Analysis : Univariate ( One Variable ) | Bivariate ( Two Variables ) | Multivariate ( More than Two Variables )

2. Data Cleaning

Better Data > Fancy Algorithm

What ? | Why ? | How ?

Remove Irrelevant | Incorrect | Incomplete | Improper | Duplicate | Inconsistent | Outdated | Insecure | Unformatted Data ( Filter )
Fix Typo Error ( Inconsistent Capitalization ) Mislabelled Classes ( 'home' and 'Home' is Same )
Filter Outliers by Setting Filter or Replace with some Relevant Value ( Mean, Median or Mode )
Convert Data from One Format to Standard Format
If Data is Stored in Multiple CSV Files, Combine CSV Data into Single Data Frame
Fill | Impute | Drop | Handle Missing Data ( Add 'Missing' label in Categorical, Set Flag as 0 in Continuous )
Split | Merge | Extract Column ( s ) or Row ( s )
Create New Relevant Feature from Existing One

Benefits of Data Cleaning

Improve Data Quality
Save Time and Increase Overall Productivity
Improve Decision Making
Boost Business
Minimize Compliance ( Laws | Requirements | Rules | Regulations | Standard | Policies | Governance | Transparency ) Risk

3. Feature Engineering

In EDA we only Identify and in Feature Engineering and Data Cleaning we Actually Fix the Problems Identified

Feature Engineering is Performed after complete EDA
Make Data Ready for Machine Learning Algorithms
Use Domain Knowledge of Data to Create Features and Variables to use in Machine Learning
Improve Machine Learning Model Performance
Remove Unused Features
Handle Date and Time Features
Handle Missing Data
Handle Categorical Data ( Categorical to Numeric | Grouping Sparse Data | Create Dummy )

Handling Missing Data | Encoding Categorical Variables | Variable Transformation | Create New Features

1. Handling Missing Data

Drop ( Particular Row | Entire Columns )
Impute ( Mean | Median | Mode ) ( Univariate )
Flag ( Continuous : 1 or 0 | Categorical : 'Missing' )
Predict from Existing Data ( Multivariate )

2. Categorical Encoding

Label Encoding ( Ordinal )
One Hot Encoding | Dummy Encoding ( Nominal )

3. Variable Transformation

Create Normal Distribution
Logarithmic ( log ( x ) ) | Exponential ( x ⁿ ) | Square root ( sqrt ( x ) ) | Reciprocal ( 1 / x )

4. Discretization

Sorting Value of Variables in Bins or Intervals ( Buckets )
Equal Width Discretization
Equal Frequency Discretization
Decision Tree Descretization ( Tree Splits Equally )

5. Handle Outliers

Visualize by using Boxplots, Histogram and Bar Graphs
Filter or Trim Data set
Remove Outliers
Setup a Filter and Trim Data Set
Change | Replace the Outlier

6. Feature Scaling

Standardization ( x - mean ( x ) ) / std ( x )
Min Max Rescaling ( 0 - 1 )
Maximum Absolute Scaling : Dividing Each value with Maximum Value ( 0 - 1 )
Robust Scaling : Dividing by IQR ( Q3 - Q1 )
Mean Normalization : ( x - mean ) / ( max - min )

7. Date and Time Engineering

Parse Date and Time so that we can extract Year, Month, Day, Week and Perform any Operation.

8. Feature Creation

Create New Meaningful Feature from the Existing Features. ( Aggregating | Arithmetic Calculation )

9. Aggregation of Transaction Data

Total of Sales
Profit or Loss of the Day
Total Amount Credited or Debited for the Day

Machine Learning Pipeline

Help to Automate Machine Learning Workflows.
Pipeline Chain together Multiple Steps to Train the Model, Output of each Step is used as Input to the Next Step.
Pipelines are Cyclic, contains Iterative Steps to Continuously Improve the Accuracy of Algorithm and makes Model Scalable.
The Learning Algorithm Finds Patterns in the Training Data that Maps the Features to the Target and Create a ML Model.

1. Data

Data Collection
Data Preprocessing
Exploratory Data Analysis
Data Preprocessing

2. Data Preparation

Data Cleaning
Feature Engineering ( Deal with Numerical Data + Handle Categorical Data )
Data Visualization ( Find Hidden Patterns )
Feature Extraction | Feature Selection
Transforation, Normalization or Standardization

3. Data Modeling

Split the Data into Train Set and Test Set
Train the Model on Train Set
Tune the Model on Validate Set
Evaluate the Model on Test Set

4. Deployment

Integrate with Application or Website
Fine Tuning
Real Life Prediction or Classifications

Small Pipelines

make_pipeline( Imputer, Column Transformer, Estimator ) | Estimator : Classifier or Regressor.
make_column_transformer() : We can apply different Encoder on different Column Individually based on Requirement )
pipeline can be directly used for Training using fit() and it can even Predict using predict()

4. Data Preprocessing

A process of preparing the raw data and making it suitable for analytics, insights and machine learning model.
First crucial step while creating a machine learning model.

Back to Questions

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Primer Steps.md

Primer Steps.md

Elite Data Science Primer

Exploratory Data Analysis ( EDA ) | Data Cleaning | Feature Engineering | Data Preprocessing

1. Exploratory Data Analysis

Know about the Data Set | Statistical Information | Summary of Data

Ask Questions to Data

2. Data Cleaning

What ? | Why ? | How ?

Benefits of Data Cleaning

3. Feature Engineering

1. Handling Missing Data

2. Categorical Encoding

3. Variable Transformation

4. Discretization

5. Handle Outliers

6. Feature Scaling

7. Date and Time Engineering

8. Feature Creation

9. Aggregation of Transaction Data

Machine Learning Pipeline

1. Data

2. Data Preparation

3. Data Modeling

4. Deployment

Small Pipelines

4. Data Preprocessing

Files

Primer Steps.md

Latest commit

History

Primer Steps.md

File metadata and controls

Elite Data Science Primer

Exploratory Data Analysis ( EDA ) | Data Cleaning | Feature Engineering | Data Preprocessing

1. Exploratory Data Analysis

Know about the Data Set | Statistical Information | Summary of Data

Ask Questions to Data

2. Data Cleaning

What ? | Why ? | How ?

Benefits of Data Cleaning

3. Feature Engineering

1. Handling Missing Data

2. Categorical Encoding

3. Variable Transformation

4. Discretization

5. Handle Outliers

6. Feature Scaling

7. Date and Time Engineering

8. Feature Creation

9. Aggregation of Transaction Data

Machine Learning Pipeline

1. Data

2. Data Preparation

3. Data Modeling

4. Deployment

Small Pipelines

4. Data Preprocessing