Multicollinearity : Estimation and Elimination
Abstract
Multiple regression fits a model to predict a dependent (Y) variable from two or more independent (X) variables. If the model fits the data well, the overall R2 value will be high, and the corresponding P value will be low In addition to the overall P value, multiple regression also reports an individual P value for each independent variable. A low P value here means that this particular independent variable significantly improves the fit of the model. It is calculated by comparing the goodness-of-fit of the entire model to the goodness-of-fit when that independent variable is omitted. If the fit is much worse when that variable is omitted from the model, the P value will be low, telling you that the variable has a significant impact on the model.
In some cases, multiple regression results may seem paradoxical. Even though the overall P value is very low, all of the individual P values are high. This means that the model fits the data well, even though none of the X variables has a statistically significant impact on predicting Y. This is due to the high correlation between the independent variables. In this case, neither may contribute significantly to the model after the other one is included. But together they contribute a lot. If we removed both variables from the model, the fit would be much worse. So the overall model fits the data well, but neither X variable makes a significant contribution when it is added to the model. When this happens, the X variables are collinear and the results show multicollinearity.
The best solution is to understand the cause of multicollinearity and remove it. This paper helps in ways for identification and elimination of multicollinearity that could result in best-fit model.