- A statistical method used to estimate the relationship between a dependent variable and one or more independent variables.
- Independent Variables (x) | Features Matrix (An array of numbers, one or more rows, one or more columns)
- Dependent Variable (Y) | Target Vector (A list of numbers, can be in a single row or column)
- m (Slope: The rate of change of Y concerning x) and C (Intercept: The value of Y when x is 0)
- Residual | Error: The difference between actual and predicted values.
- A straight line can represent the linear relationship between two variables.
- Predict a continuous numeric dependent variable based on one or more independent variables.
- Predict a best-fit line (finding a regression line that best fits the data) with the least errors or residuals.
- Learning a linear regression model means estimating the values of the regression coefficients (slope and intercept)
- Linear regression is sensitive to overfittings and outliers.
- But can be prevented using dimensionality reduction, regularization, standardization and cross validation
- Linear regression predicts a target vector's value based on the feature matrix's value.
- The parameters m and c are learnt by the algorithm based on the data point pairs of (x, y)
- There are few statistical assumptions as well for linear regression.
- Also there are few metrics to evaluate how good our model learnt from the data.
- y = m * x + c (m and c are also called as regression coefficients)
- Indicates how much the dependent variable will change if an independent variable changes by one unit.
Data visualization is one of the best ways to check the dataset's relationship, distribution and variance.
- Scatter Plots:
- Plot the independent variable(s) on the x-axis and the dependent variable on the y-axis.
- The pattern of the data points can reveal the direction and strength of the relationship.
- A positive slope suggests a positive relationship (as x increases, y increases)
- A negative slope suggests a negative relationship (as x increases, y decreases)
- A random scatter plot suggests a non-linear relationship.
- Correlation Coefficient (r):
- The strength and direction of the linear relationship between independent and dependent variables.
- It ranges from -1 (perfect negative correlation) to +1 (perfect positive correlation), with 0 indicating no linear correlation.
- Regression Analysis:
- By fitting a model to your data, you can estimate the effect of changes in the independent variable(s) on the dependent variable.
- The model's coefficients (how much the dependent variable changes on the unit change of an independent variable)
- Residual Analysis:
- The difference between the actual and predicted values of the dependent variable can reveal potential issues with the model.
- Randomly scattered residuals suggest a good fit, while patterns in the residuals indicate potential problems like non-linearity or outliers.
- R squared | Adjusted R Squared:
- Train the model with different feature subsets and evaluate their performance on a validation set.
- The subset with the best performance is chosen.
- The point where the regression line intersects the Y-axis.
- Value of Y when the value of x and value of coefficients = 0.
Residuals | Error | e | Noise | Actual - Prediction
- Only one independent variable and one dependent variable (Continuous Numeric)
- Use statistics to estimate the coefficients i.e. slope (m) and intercept (c).
- More than one independent variables and only one dependent variable (Continuous Numeric)
- Consider features that have good correlation with the dependent variable (no multicollinearity).
- Multicollinearity: One independent feature can completely describe the other independent feature.
- Linearity: The relationship between the independent variable (x) and the dependent variable (y) is linear.
- Independence: The data points should be independent.
- Normality: The errors/residuals of the data points should be normally distributed.
- No Multicollinearity: The independent variables should not be highly correlated.
- Homoscedasticity: The variance of the regression line should remain constant throughout.
- Quantile Quantile Point: Data points should be close to the regression line.
- Covariance measures how much two variables change together (Direction of the linear relationship)
- Positive Covariance: Two variables move in the same direction.
- Negative Covariance: Two variables move in opposite directions.
- Correlation is a standardized version of covariance, that measures the strength and direction of a linear relationship.
- Measure how closely two variables are related and one variable can predict the other variable.
- Varies between -1 (Perfect Negative Correlation) to +1 (Perfect Positive Correlation)
Amount of R | Strength of Correlation |
---|---|
0.0 | No Correlation |
0.1 - 0.3 | Little Correlation |
0.3 - 0.5 | Medium Correlation |
0.5 - 0.7 | High Correlation |
0.7 - 1.0 | Very High Correlation |
- Covariance reveals how two variables change together only to their direction.
- Correlation determines the strength and direction of a linear relationship between two variables.
- Linear Relationship: A relationship between variables where a change in one is associated with a proportional change in the other
- Both covariance and correlation measure the relationship and the dependency between two variables.
- Unlike Covariance, Correlation values are standardized.
- A measure of how much the variance of the regression coefficient is inflated due to collinearity between the independent variables.
- VIF = 1 indicates that there is no collinearity between the independent variables.
- VIF > 10 indicates that there is high collinearity between the independent variables.
- We should consider removing the independent variable to reduce the VIF.
- VIF does not tell us which independent variable should be removed.
- VIF can be higher due to more independent features and features with high scales or different scales.
VIF = 1 / 1 - R2
Correlation Coefficient (r) | Relationship (Numerical measure of the strength of a linear relationship) |
---|---|
0 | No Correlation between two variables |
1 | Perfect Positive Correlation (Directly proportional, as one variable increases, the other also increases) |
-1 | Perfect Negative Correlation (Indirectly proportional, as one variable increases, the other decreases) |
- Compare the mean of two separate groups to observe the significant difference between them.
- Measure of the strength and direction of the relationship between two variables.
- Null Hypothesis (H0): There is no difference in the mean.
- Alternate Hypothesis (H1): There is a difference in the mean.
- P value is calculated, if the p-value > 0.05, then the Null Hypothesis (H0) is accepted.
- One sample t-test: This compares the mean of one group to a specific hypothesized value.
- Two sample t-test: This compares the means of two independent groups.
- Assumptions: The data is normally distributed and the variances of the two groups are similar.
Significance Level | P-value |
---|---|
95% | 0.05 |
99% | 0.01 |
99.9% | 0.1 |
- Generally occurs when there is a high correlation between two or more independent variables.
- One independent variable can be used to predict the other. This creates redundant information.
- Regression equation becomes unstable and create confusion.
- Observations (rows) and features (columns) should be independent.
- Remove one feature to prevent from multicollinearity and make regression stable.
- e.g. Experience vs Salary, Height vs Weight, Age of Car vs Car Price.
- Tolerance | T = 1 - R2 (T < 0.1 | There is multicollinearity)
- Variance Inflation Factor | VIF = 1 / (1 - R2) (VIF > 10 | There is multicollinearity)
- Correlation does not imply causation (relationship between cause and its effect)
- One variable affects another variable (Temperature affects ice cream sale | Sale of ice cream is more in summer)
- A strong correlation between two variables does not necessarily mean that one causes the other.
- There may be other factors influencing the relationship.
- Rescale-independent features for more reliable predictions use standardization or normalization.
- The difference between the predicted value and the actual value.
- Error should be as low as possible (Complete removal of error is impossible)
- Positive Residual: The actual value is above the regression line.
- Negative Residual: The actual value is below the regression line.
Making predictions outside the range of data due to the presence of outliers.