Core: Programming Skills + Maths and Statistics + Subject Matter Expert
Maths and Statistics: Linear Algebra + Probability + Bayesian + Calculus
Machine Learning Engineering (MLE)
Machine Learning Operations (MLOps)
Focuses on building products/services that are highly scalable and highly performant.
Focuses on delivering products/services in production and ensuring service quality is always maintained.
Focuses on providing permanent fixes in response to any incident/ bug.
Focuses on ensuring product/service is up and running.
Not an end-user-facing role.
End-user-facing role that requires to have strong communication skills.
Outliers : Extreme Value Analysis | DBSCAN | 5 Number Summary | Algorithm ( KNN & Random Forest )
Imbalanced : Up & Down Sampling | F1 Score | Stratified K Fold Cross Validation | Random Forest ( class_weight )
Overfitting : Apply Regularization | Apply Ensembles | Apply Cross Validation | Feature Selection
Time Complexity of Occurence of Characters in a String : O(n)
Log
is inverse of exponent
e.g. Base investment : 5₹
and 5 times return : 125₹
( Log5 125 : 3
Years )
Log5 53 = 3 ( i.e. 3 * Log5 5 = 3 * 1
| Log5 5 = 1 )
A testing framework for Python that simplifies the process of writing and executing tests.
It provides an easy-to-use and expressive syntax for creating test cases, running tests, and reporting the results.
Model = Algorithm ( Parameters ) + Data
Data Pipeline ( Where and how the data are collected, transformed and loaded )
A set of actions that extract
data from various sources , transform
it into proper format and load
for processing.
An automated process :
Select columns
from database.
Merge
columns from two or more tables.
Subset
rows ( Sample )
Handle missing
data.
Load
them in other database.
First time the process is complicated but if you do it right you will have to do it just once.
To have automation you need to think, plan and write in Simple Language, keep it reproducible.
A storage repository where data is stored in its natural | raw format without applying any transformation.
Data warehouse uses files or folders structure, data lakes uses flat architecture.
We try to make out model more accurate by tuning and tweaking the parameters .
But we cannot make a 100%
accurate model.
Prediction
and classification
models, can never be error free .
Y = f ( x ) + e
Y : Response Variable | Dependent Variable
x : Independent variable
e : Irreducible error ( Even we make a 100% accurate estimate of f ( x ) , Our model can't be error free , known as irreducible error )
A function that takes in the weighted sum of all the inputs from previous layer adds bias and generates output for next layer .
Hyperparameter Optimization
Finding ideal set of parameters for a prediction algorithm with optimum performance.
Parameter
Hyperparameter
Automatically learns while training
Manually tuned by the developer to guide the training.
Weights and bias are the model parameters
Learning rate, depth of tree, class weights.
Internal
configuration variables of the model
External
configuration variables of the model.
Data Ware House
Data Lake
Structured + Pre-processed
Unstructured + Semi Structured + Structured + Raw
Organized before storing
Organized before using
Business professionals, Analyst, BI and Visualizations
Data Scientists , Analytics and AI
DBMS
RDBMS
Store data in the form of file
Store data in the form of tables
Hierarchical arrangement of data
Rows and columns ( Tables )
Manage data in computer
Maintain relationships of table in a database
Classification
Clustering
Need prior knowledge of data
No prior knowledge of data
Classify new sample into known classes
Suggest groups based on patterns in data
Decision tree
K Means
Labelled samples
Unlabelled samples
LDA
PCA
Linear Discriminant Analysis
Principle Component Analysis
Supervised
Unsupervised
K Means
K Nearest Neighbor
Unsupervised
Supervised
K : Number of clusters
K : Number of nearest neighbors.
Determine the distances of each data points to the centroid and assign each point to closest cluster centroid
Calculate distance between new data point with nearest K neighbours.
Variance s
2
Standard Deviation s
Distance between
the data points
in the dataset
Distance of a data point from the mean
of the dataset
Variance
Covariance
Magnitude
Magnitude and Direction
Data points from its mean
Data points varies with respect to each other.
Which Algorithm Generates the Best Model ?
Accuracy
Latency
How they handle data of different size ?
How long will it take to train the model ?
How will they handle complexity of feature relationships ?
How long will it take to predict the dependent variables ?
How will they handle messy data ( Missing Data + Outliers )
The correlation of the data point
with a delayed copy of itself.
Temperature of the day today vs temperature of the day yesterday or tommorrow .
A phenomenon in which at least two independent variables are linearly correlated ( One can be predicted
from the other )
Cross Join | Cartesian Product
Generate paired combination of each row of first table with each row of the second table.
Explore ( EDA ) and clean ( Data Cleaning ) the data.
Split data into train + validate + test sets.
Train with an initial model and evaluate.
Tune hyperparameters + cross validations ( Assurance of accuracy )
Evaluate on validation set ( Performance )
Evaluate on test set ( Prediction )