Extract Information and Capture Meaningful Insights from the Dataset is Exploratory Data Analysis
- Hint for Data Cleaning
- Idea for Feature Engineering
- Data Visualization to Discover Hidden Patterns and Relations | Spot Anomalies | Detect Outlier ( Five Number Summary | IQR )
- Scatter Plot, Histogram and Boxplot for Continuous Variables ( Outliers, Correlation, Measure of Spread and Central Tendency )
- Bar Plot to know about Categorical Variables | Histogram to understand Distribution of Data | Heatmap ( Correlation )
- How many Observations ? ( Rows )
- How many Features ? ( Columns )
- Data Types of the Features ? ( Categorical or Numeric )
- Target Variable in the Data Set ?
- Whether Data set is Balanced or Imbalanced ?
- Do the Values on the Right Scale ? or needs ( Transformation | Normalization | Standardization )
- Identify Missing Data | Reason for Missing Data
- Plot Numerical Distributions ( Detect Outliers,
Boxplot
,Scatter
,Histogram
) - Plot Categorical Data (
Bar Chart
| Frequency of Data | Combine Sparse Classes ) - Find Correlations (
Heatmaps
) - Analysis : Univariate ( One Variable ) | Bivariate ( Two Variables ) | Multivariate ( More than Two Variables )
Better Data > Fancy Algorithm
- Remove Irrelevant | Incorrect | Incomplete | Improper | Duplicate | Inconsistent | Outdated | Insecure | Unformatted Data ( Filter )
- Fix Typo Error ( Inconsistent Capitalization ) Mislabelled Classes ( 'home' and 'Home' is Same )
- Filter Outliers by Setting Filter or Replace with some Relevant Value ( Mean, Median or Mode )
- Convert Data from One Format to Standard Format
- If Data is Stored in Multiple CSV Files, Combine CSV Data into Single Data Frame
- Fill | Impute | Drop | Handle Missing Data ( Add 'Missing' label in Categorical, Set Flag as 0 in Continuous )
- Split | Merge | Extract Column ( s ) or Row ( s )
- Create
New
Relevant Feature from Existing One
- Improve Data Quality
- Save Time and Increase Overall Productivity
- Improve Decision Making
- Boost Business
- Minimize Compliance ( Laws | Requirements | Rules | Regulations | Standard | Policies | Governance | Transparency ) Risk
In EDA we only Identify and in Feature Engineering and Data Cleaning we Actually Fix the Problems Identified
- Feature Engineering is Performed after complete
EDA
- Make Data Ready for Machine Learning Algorithms
- Use
Domain Knowledge
of Data to Create Features and Variables to use in Machine Learning - Improve Machine Learning Model Performance
- Remove
Unused
Features - Handle
Date
andTime
Features - Handle
Missing
Data - Handle
Categorical
Data ( Categorical to Numeric | Grouping Sparse Data | Create Dummy )
Handling Missing Data | Encoding Categorical Variables | Variable Transformation | Create New Features
- Drop ( Particular Row | Entire Columns )
- Impute ( Mean | Median | Mode ) (
Univariate
) - Flag ( Continuous : 1 or 0 | Categorical : 'Missing' )
- Predict from Existing Data (
Multivariate
)
- Label Encoding (
Ordinal
) - One Hot Encoding | Dummy Encoding (
Nominal
)
- Create Normal Distribution
- Logarithmic ( log ( x ) ) | Exponential ( x n ) | Square root ( sqrt ( x ) ) | Reciprocal ( 1 / x )
- Sorting Value of Variables in Bins or Intervals ( Buckets )
- Equal Width Discretization
- Equal Frequency Discretization
- Decision Tree Descretization ( Tree Splits Equally )
- Visualize by using Boxplots, Histogram and Bar Graphs
- Filter or Trim Data set
- Remove Outliers
- Setup a Filter and Trim Data Set
- Change | Replace the Outlier
- Standardization ( x - mean ( x ) ) / std ( x )
- Min Max Rescaling ( 0 - 1 )
- Maximum Absolute Scaling : Dividing Each value with Maximum Value ( 0 - 1 )
- Robust Scaling : Dividing by IQR ( Q3 - Q1 )
- Mean Normalization : ( x - mean ) / ( max - min )
- Parse Date and Time so that we can extract Year, Month, Day, Week and Perform any Operation.
- Create New Meaningful Feature from the Existing Features. ( Aggregating | Arithmetic Calculation )
- Total of Sales
- Profit or Loss of the Day
- Total Amount Credited or Debited for the Day
- Help to Automate Machine Learning Workflows.
- Pipeline Chain together Multiple Steps to Train the Model, Output of each Step is used as Input to the Next Step.
- Pipelines are Cyclic, contains Iterative Steps to Continuously Improve the Accuracy of Algorithm and makes Model Scalable.
- The Learning Algorithm Finds Patterns in the Training Data that Maps the Features to the Target and Create a ML Model.
- Data Collection
- Data Preprocessing
- Exploratory Data Analysis
- Data Preprocessing
- Data Cleaning
- Feature Engineering ( Deal with Numerical Data + Handle Categorical Data )
- Data Visualization ( Find Hidden Patterns )
- Feature Extraction | Feature Selection
- Transforation, Normalization or Standardization
- Split the Data into Train Set and Test Set
- Train the Model on Train Set
- Tune the Model on Validate Set
- Evaluate the Model on Test Set
- Integrate with Application or Website
- Fine Tuning
- Real Life Prediction or Classifications
make_pipeline
( Imputer, Column Transformer, Estimator ) | Estimator : Classifier or Regressor.make_column_transformer()
: We can apply different Encoder on different Column Individually based on Requirement )- pipeline can be directly used for Training using
fit()
and it can even Predict usingpredict()
- A process of preparing the raw data and making it suitable for analytics, insights and machine learning model.
- First crucial step while creating a machine learning model.