Data Science

Core: Programming Skills + Maths and Statistics + Subject Matter Expert

Maths and Statistics: Linear Algebra + Probability + Bayesian + Calculus

DS + AI | ML | Steps | EDA | Real World Applications | Elite | NLP | LLM | CV | P1 | P2 | Predictive Analytics

Python | Data Types | Pandas | NumPy | OOP | Git | SQL | Image | Power BI | Tableau | Visualization

Linear Regression | Logistic Regression | Metrics | Regularization | Ensemble Techniques | Hyperparameter

Statistics | Terms | Distribution | Rescaling | Error | Bias and Variance | Gradient Descent

Cross Validation | Multiclass vs Multilabel | Dimensionality Reduction | Feature Engineering | Jenkins

Missing Data | Outliers | Encoding | Imbalanced Data | Overfitting | Data Cleaning | Data Engineering

MLE vs MLOps

Machine Learning Engineering (MLE)	Machine Learning Operations (MLOps)
Focuses on building products/services that are highly scalable and highly performant.	Focuses on delivering products/services in production and ensuring service quality is always maintained.
Focuses on providing permanent fixes in response to any incident/ bug.	Focuses on ensuring product/service is up and running.
Not an end-user-facing role.	End-user-facing role that requires to have strong communication skills.

Pandas vs SQL | Lists vs Array

Outliers : Extreme Value Analysis | DBSCAN | 5 Number Summary | Algorithm ( KNN & Random Forest )

Imbalanced : Up & Down Sampling | F1 Score | Stratified K Fold Cross Validation | Random Forest ( class_weight )

Overfitting : Apply Regularization | Apply Ensembles | Apply Cross Validation | Feature Selection

Time Complexity of Occurence of Characters in a String : O(n)

Log Function

Log is inverse of exponent
e.g. Base investment : 5₹ and 5 times return : 125₹ ( Log₅ 125 : 3 Years )
Log₅ 5³ = 3 ( i.e. 3 * Log₅5 = 3 * 1 | Log₅5 = 1 )

PyTest

A testing framework for Python that simplifies the process of writing and executing tests.
It provides an easy-to-use and expressive syntax for creating test cases, running tests, and reporting the results.

Model = Algorithm ( Parameters ) + Data

Data Pipeline ( Where and how the data are collected, transformed and loaded )

A set of actions that extract data from various sources, transform it into proper format and load for processing.
An automated process :

Select columns from database.
Merge columns from two or more tables.
Subset rows ( Sample )
Handle missing data.
Load them in other database.

First time the process is complicated but if you do it right you will have to do it just once.
To have automation you need to think, plan and write in Simple Language, keep it reproducible.

Data Lake

A storage repository where data is stored in its natural | raw format without applying any transformation.
Data warehouse uses files or folders structure, data lakes uses flat architecture.

Important Disclaimer

We try to make out model more accurate by tuning and tweaking the parameters.
But we cannot make a 100% accurate model.
Prediction and classification models, can never be error free.

Y = f ( x ) + e

Y : Response Variable | Dependent Variable

x : Independent variable

e : Irreducible error ( Even we make a 100% accurate estimate of f ( x ), Our model can't be error free, known as irreducible error )

Activation Function

A function that takes in the weighted sum of all the inputs from previous layer adds bias and generates output for next layer.

Hyperparameter Optimization

Finding ideal set of parameters for a prediction algorithm with optimum performance.

Parameter	Hyperparameter
Automatically learns while training	Manually tuned by the developer to guide the training.
Weights and bias are the model parameters	Learning rate, depth of tree, class weights.
`Internal` configuration variables of the model	`External` configuration variables of the model.

Data Ware House	Data Lake
Structured + Pre-processed	Unstructured + Semi Structured + Structured + Raw
Organized before storing	Organized before using
Business professionals, Analyst, BI and Visualizations	Data Scientists, Analytics and AI

DBMS	RDBMS
Store data in the form of file	Store data in the form of tables
Hierarchical arrangement of data	Rows and columns ( Tables )
Manage data in computer	Maintain relationships of table in a database

Classification	Clustering
Need prior knowledge of data	No prior knowledge of data
Classify new sample into known classes	Suggest groups based on patterns in data
Decision tree	K Means
Labelled samples	Unlabelled samples

LDA	PCA
Linear Discriminant Analysis	Principle Component Analysis
Supervised	Unsupervised

K Means	K Nearest Neighbor
Unsupervised	Supervised
K : Number of clusters	K : Number of nearest neighbors.
Determine the distances of each data points to the centroid and assign each point to closest cluster centroid	Calculate distance between new data point with nearest K neighbours.

Variance `s` ²	Standard Deviation `s`
Distance `between` the `data points` in the dataset	Distance of a data point from the `mean` of the dataset

Variance	Covariance
Magnitude	Magnitude and Direction
Data points from its `mean`	Data points varies with respect to each other.

Which Algorithm Generates the Best Model ?

Accuracy	Latency
How they handle data of different size ?	How long will it take to train the model ?
How will they handle complexity of feature relationships ?	How long will it take to predict the dependent variables ?
How will they handle messy data ( Missing Data + Outliers )

Autocorrelation

The correlation of the data point with a delayed copy of itself.
Temperature of the day today vs temperature of the day yesterday or tommorrow.

Multicollinearity

A phenomenon in which at least two independent variables are linearly correlated ( One can be predicted from the other )

Cross Join | Cartesian Product

Generate paired combination of each row of first table with each row of the second table.

Data Scientist Steps

Explore ( EDA ) and clean ( Data Cleaning ) the data.
Split data into train + validate + test sets.
Train with an initial model and evaluate.
Tune hyperparameters + cross validations ( Assurance of accuracy )
Evaluate on validation set ( Performance )
Evaluate on test set ( Prediction )

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Interview.md

Interview.md

Data Science

Core: Programming Skills + Maths and Statistics + Subject Matter Expert

Maths and Statistics: Linear Algebra + Probability + Bayesian + Calculus

DS + AI | ML | Steps | EDA | Real World Applications | Elite | NLP | LLM | CV | P1 | P2 | Predictive Analytics

Python | Data Types | Pandas | NumPy | OOP | Git | SQL | Image | Power BI | Tableau | Visualization

Linear Regression | Logistic Regression | Metrics | Regularization | Ensemble Techniques | Hyperparameter

Statistics | Terms | Distribution | Rescaling | Error | Bias and Variance | Gradient Descent

Cross Validation | Multiclass vs Multilabel | Dimensionality Reduction | Feature Engineering | Jenkins

Missing Data | Outliers | Encoding | Imbalanced Data | Overfitting | Data Cleaning | Data Engineering

MLE vs MLOps

Pandas vs SQL | Lists vs Array

Log Function

PyTest

Model = Algorithm ( Parameters ) + Data

Data Pipeline ( Where and how the data are collected, transformed and loaded )

Data Lake

Important Disclaimer

Activation Function

Hyperparameter Optimization

Which Algorithm Generates the Best Model ?

Autocorrelation

Multicollinearity

Cross Join | Cartesian Product

Data Scientist Steps

Files

Interview.md

Latest commit

History

Interview.md

File metadata and controls

Data Science

Core: Programming Skills + Maths and Statistics + Subject Matter Expert

Maths and Statistics: Linear Algebra + Probability + Bayesian + Calculus

DS + AI | ML | Steps | EDA | Real World Applications | Elite | NLP | LLM | CV | P1 | P2 | Predictive Analytics

Python | Data Types | Pandas | NumPy | OOP | Git | SQL | Image | Power BI | Tableau | Visualization

Linear Regression | Logistic Regression | Metrics | Regularization | Ensemble Techniques | Hyperparameter

Statistics | Terms | Distribution | Rescaling | Error | Bias and Variance | Gradient Descent

Cross Validation | Multiclass vs Multilabel | Dimensionality Reduction | Feature Engineering | Jenkins

Missing Data | Outliers | Encoding | Imbalanced Data | Overfitting | Data Cleaning | Data Engineering

MLE vs MLOps

Pandas vs SQL | Lists vs Array

Log Function

PyTest

Model = Algorithm ( Parameters ) + Data

Data Pipeline ( Where and how the data are collected, transformed and loaded )

Data Lake

Important Disclaimer

Activation Function

Hyperparameter Optimization

Which Algorithm Generates the Best Model ?

Autocorrelation

Multicollinearity

Cross Join | Cartesian Product

Data Scientist Steps