Example 1: Model validation of Assumptions of Linear regression in Fama French 3-Factor Model

1. Checking Multicollinearity of features or independent variables w/ Correlation matrix

2. Checking Linearity w/ Scatter plots

3. Checking Independence of residuals w/ Autocorrelation Function (ACF) and D-W test

4. Checking Normality of residuals w/ histogram

5. Checking Homoscedasticity (equal variance) of Residuals w/ scatter plot of residuals and fitted values

Consequences:

1. Multicollinearity = Redundancy = It will be difficult for the model to find which feature is actually contributing to predict the target

2. Non-linearity = Model won't capture the relationship closely, leading to large errors in fitting

3. Autocorrelation in residuals = Missing something important. Check for some important feature

4. Non-Normality of residuals = Assumption of tests of having a normal distribution on residuals won't hold. Apply transformations on features.

5. No Homoscedasticity of residuals = less precision in estimates

Check out Model Validation for Linear Regression in Factor analysis in Python

Example 2: Model validation and tuning in Random Forest Regression, on a continuos data

1. Get the data

2. Define the target (y) and features (X)

3. Split the data into training and testing set (validation if required)

4. Initiate a model, set parameters, and Fit the training set | `X_train, y_train`

5. Predict on `X_test`

6. Accuracy or Error metrics on `y_test` | Ex: R squared

7. Bias-Variance trade-off check | Balancing underfitting and overfitting

8. Iterate to tune the model (from step 4)

9. Cross Validation | if model not generalizing well

10. Selecting the best model w/ Hyperparameter tuning

Check out Model Validation and Tuning for RFR in Python

Few Details:

Bias-Variance trade-off

Bias = failing to find relationship b/w data and response = ERROR due to OVERLY SIMPLISTIC models (underfitting)

Variance = following training data too closely = ERROR due to OVERLY COMPLEX models (overfitting) that are SENSITIVE TO FLUCTUATIONS (noise) in the training data

High Bias + Low Variance: Underfitting (simpler models)
Low Bias + High Variance: Overfitting (complex models)

Training error high = Underfitting

Testing error >> Training error = Overfitting

Cross Validation - An efficient method to find the balance

by sharpsightlabs.com

Splitting data into distinct subsets. Each subset used once as a test set while the remaining as training set. Results from all splits are averaged.

Why use?

Better Generalization: If our models are not generalizing well (Generalization refers to a model's ability to perform well on new, unseen data, not just the data it was trained on)
Reliable Evaluation
Efficient use of data (if we have limited data)

Types:

cross_val_score

Leave-one-out-cross-validation (LOOCV)

Use when data is limited, but computationally expensive
Each data point is used as a test set

cv = X.shape[0]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Example 1: Model validation of Assumptions of Linear regression in Fama French 3-Factor Model

1. Checking Multicollinearity of features or independent variables w/ Correlation matrix

2. Checking Linearity w/ Scatter plots

3. Checking Independence of residuals w/ Autocorrelation Function (ACF) and D-W test

4. Checking Normality of residuals w/ histogram

5. Checking Homoscedasticity (equal variance) of Residuals w/ scatter plot of residuals and fitted values

1. Multicollinearity = Redundancy = It will be difficult for the model to find which feature is actually contributing to predict the target

2. Non-linearity = Model won't capture the relationship closely, leading to large errors in fitting

3. Autocorrelation in residuals = Missing something important. Check for some important feature

4. Non-Normality of residuals = Assumption of tests of having a normal distribution on residuals won't hold. Apply transformations on features.

5. No Homoscedasticity of residuals = less precision in estimates

Check out Model Validation for Linear Regression in Factor analysis in Python

Example 2: Model validation and tuning in Random Forest Regression, on a continuos data

1. Get the data

2. Define the target (y) and features (X)

3. Split the data into training and testing set (validation if required)

4. Initiate a model, set parameters, and Fit the training set | `X_train, y_train`

5. Predict on `X_test`

6. Accuracy or Error metrics on `y_test` | Ex: R squared

7. Bias-Variance trade-off check | Balancing underfitting and overfitting

8. Iterate to tune the model (from step 4)

9. Cross Validation | if model not generalizing well

10. Selecting the best model w/ Hyperparameter tuning

Check out Model Validation and Tuning for RFR in Python

Bias-Variance trade-off

Training error high = Underfitting

Testing error >> Training error = Overfitting

Cross Validation - An efficient method to find the balance

by sharpsightlabs.com

Splitting data into distinct subsets. Each subset used once as a test set while the remaining as training set. Results from all splits are averaged.

LinkedIn | [email protected] | Research Works

Files

README.md

Latest commit

History

README.md

File metadata and controls

Example 1: Model validation of Assumptions of Linear regression in Fama French 3-Factor Model

1. Checking Multicollinearity of features or independent variables w/ Correlation matrix

2. Checking Linearity w/ Scatter plots

3. Checking Independence of residuals w/ Autocorrelation Function (ACF) and D-W test

4. Checking Normality of residuals w/ histogram

5. Checking Homoscedasticity (equal variance) of Residuals w/ scatter plot of residuals and fitted values

1. Multicollinearity = Redundancy = It will be difficult for the model to find which feature is actually contributing to predict the target

2. Non-linearity = Model won't capture the relationship closely, leading to large errors in fitting

3. Autocorrelation in residuals = Missing something important. Check for some important feature

4. Non-Normality of residuals = Assumption of tests of having a normal distribution on residuals won't hold. Apply transformations on features.

5. No Homoscedasticity of residuals = less precision in estimates

Check out Model Validation for Linear Regression in Factor analysis in Python

Example 2: Model validation and tuning in Random Forest Regression, on a continuos data

1. Get the data

2. Define the target (y) and features (X)

3. Split the data into training and testing set (validation if required)

4. Initiate a model, set parameters, and Fit the training set | X_train, y_train

5. Predict on X_test

6. Accuracy or Error metrics on y_test | Ex: R squared

7. Bias-Variance trade-off check | Balancing underfitting and overfitting

8. Iterate to tune the model (from step 4)

9. Cross Validation | if model not generalizing well

10. Selecting the best model w/ Hyperparameter tuning

Check out Model Validation and Tuning for RFR in Python

Bias-Variance trade-off

Training error high = Underfitting

Testing error >> Training error = Overfitting

Cross Validation - An efficient method to find the balance

by sharpsightlabs.com

Splitting data into distinct subsets. Each subset used once as a test set while the remaining as training set. Results from all splits are averaged.

LinkedIn | [email protected] | Research Works

4. Initiate a model, set parameters, and Fit the training set | `X_train, y_train`

5. Predict on `X_test`

6. Accuracy or Error metrics on `y_test` | Ex: R squared