5. Checking Homoscedasticity (equal variance) of Residuals w/ scatter plot of residuals and fitted values
Consequences:
1. Multicollinearity = Redundancy = It will be difficult for the model to find which feature is actually contributing to predict the target
4. Non-Normality of residuals = Assumption of tests of having a normal distribution on residuals won't hold. Apply transformations on features.
Few Details:
Bias = failing to find relationship b/w data and response = ERROR due to OVERLY SIMPLISTIC models (underfitting)
Variance = following training data too closely = ERROR due to OVERLY COMPLEX models (overfitting) that are SENSITIVE TO FLUCTUATIONS (noise) in the training data
High Bias + Low Variance: Underfitting (simpler models)
Low Bias + High Variance: Overfitting (complex models)
Splitting data into distinct subsets. Each subset used once as a test set while the remaining as training set. Results from all splits are averaged.
Why use?
- Better Generalization: If our models are not generalizing well (Generalization refers to a model's ability to perform well on new, unseen data, not just the data it was trained on)
- Reliable Evaluation
- Efficient use of data (if we have limited data)
Types:
- cross_val_score
- Leave-one-out-cross-validation (LOOCV)
Use when data is limited, but computationally expensive
Each data point is used as a test set
cv = X.shape[0]