Multicollinearity is the issue when explanatory variables in a regression are not independent of one another. When I was taught applied econometrics, I was told that I should not worry about multi-collinearity because the problem is very rare, and if it was the result of a misspecification, then my software would warn me about it.

However, it seems that my teachers were too optimistic and that it is still worth it to warn people about the problem and teach them how to identify it. Indeed, no one seems to have noticed the issue in a paper by Ingrid Rohde and Kirsten Rohde, which was published this year in the Journal of Risk and Uncertainty:

Rohde, I. M., & Rohde, K. I. (2015). Managing social risks–tradeoffs between risks and inequalities. Journal of Risk and Uncertainty, 51(2), 103-124. http://link.springer.com/article/10.1007/s11166-015-9224-5?view=classic

The authors look at a problem that I am also working on, namely how people react to social risks. Social risks are risks that affect me and another person at the same time but possibly in different ways. One issue that I find interesting is that risks that lead to ex-post equality are particularly risky at the social level, while risks that lead to ex-post inequality are less so. If people dislike ex-post inequality, then they have to accept higher levels of collective risk. But do people care about collective risk? That is the question I investigate in my working paper:

Gaudeul, Alexia, (2016), Social preferences under risk: Minimizing collective risk vs. reducing ex-post inequality, CEGE Discussion Papers No 283, University of Göttingen, Department of Economics, http://EconPapers.repec.org/RePEc:zbw:cegedp:283.

I can illustrate the negative relation between those two aspects of risk in the following graph, where my payoffs are in red and the payoffs of the other person are in blue. There are two possible futures depending on a random event. One future and its resulting payoffs is on top, the other future is on the bottom. This first graph shows a situation with a high level of collective risk where payoffs are equal ex-post.

The second graph below shows a safe social situation (overall wealth does not vary) where the distribution of payoffs changes depending on the random event. Note that the risk for me (blue) is the same in both graphs.

What Rohde and Rohde tried to do is to look at the choices of people between social lotteries, index lotteries by their level of collective risk and ex-post inequality, and determine what drives the choice of people. The issue is that collective risk and ex-post inequality are mathematically related, so the regressions in Rohde and Rohde suffer from multicollinearity.

To illustrate, if I take the authors’ indexes of collective risk and ex-post inequality and other indexes of the properties of their lotteries (table 2), I find that

*ex post inequality= -0.1*collective risk-0.1*ex-ante inequality+1.1*individual risk *

with a R² of *97%*.

When I checked the authors’ regressions with their data, which they sent me, I found that the variance inflation factor (VIF) in their main regression (Table 6) was 11.45, which is really very high.

Other indications that something was wrong was the change in the sign of the parameter on ex-post inequality in their regressions depending on the specification (table 6, γ_{post}=0.680, then -0.295, then -0.093) and the authors’ inability to distinguish between their hypotheses in a non-parametric way (page 117).

Finally, and most significantly in my opinion, the authors’ conclusion on page 119, which is that people like ex-post inequality and higher social risk, flies in the face of common sense — and worse yet, contradicts my findings ☺ I think I would have missed the econometric issues with their paper if the authors’ conclusion had been less preposterous.

In my own paper on the topic, I look at the same issue but rather than vainly trying to distinguish those two aspects of social risk econometrically, I underline ex-post inequality in one condition, and I underline collective risk in the other, by changing the visual representation of payoffs. In the graphs below, I show the same social lottery, but in the first case payoffs are shown side-by-side, so the focus is on ex-post inequality, while in the second case payoffs are shown added-up, so the focus is on collective risk.

Differences in choices across subjects who were presented either one or the other representations allowed me to conclude that people care most about ex-post inequality and only very little about social risk. Beyond those results, I also found that only a minority of subjects do consider the risk carried by the other subject in their decision. However, preferences of that minority are sufficiently one-sided to have a significant overall effect against ex-post inequality in outcomes.

More details in my paper: http://EconPapers.repec.org/RePEc:zbw:cegedp:283

Alexia, your opening paragraph is misleading. Collinearity (the “multi” is redundant) is ubiquitous in econometric applications. We don’t, though, need to worry about it in the following fundamental sense: standard errors (and other measures of sampling variability) fully reflect the collinearity. There remains an issue: there is less information per observation if the covariates are highly related, but this is essentially the same issue as just having a smaller sample size, and thus neither of these problems is a challenge to inference in the sense that model misspecification is a challenge to inference.

I think perhaps what you have in mind in your opening paragraph is *perfect* collinearity, which means the covariates are linearly dependent, and in turn that the model is not identified by the sample, and your software will indeed warn you about it (Stata, for example, will just drop variables until the model is identified).

I am not quite sure the terminology is as settled as you seem to think. In my understanding, collinearity is between two variables, multi-collinearity is when 3 or more variables are in a linear relation with each other. As for anything less than perfect collinearity not being an issue, this is true only at the limit if you have a very large sample. I must say I have found many different and partially conflicting accounts of the issue. My choice of terms corresponds to what I have seen majoritarily used in the field.

Another issue I do not mention in the blog post is that the variables used in the regression are an arbitrary choice of measures of an underlying risk concept, with the relation between the underlying concepts not being clearly defined.

Alexia, putting aside the semantic issue, what I said about (multi)collinearity is in fact very well-settled and you can find versions of the results I outlined in any standard textbook on econometrics or regression analysis.

“Econometrics texts devote many pages to the problem of multicollinearity in multiple regression, but they say little about the closely analogous problem of small sample size,” Arthur Goldberger noted many years ago, continuing, “Perhaps the imbalance is attributable to the lack of an exotic polysyllabic name for `small sample size.’ If so, we can remove that impediment by introducing the term `micronumerosity.'”

His point, of course, is the same as the one I made above. Collinearity is not an issue, even in finite samples, except in the same sense as small samples themselves are an issue: there is just less information present in the data, but our measures of sampling variability correctly reflect that lack of information.

Collinearity causes two problems, one statistical and one mathematical. The statistical problem is that the variance of the estimates is large. The mathematical one is that the normal equations (the equations that are being solved to find the estimates) become ill-conditioned. That is to say, the estimates become unstable – the slightest change in the input data causes large changes in the solution set delivered as output. Good software (Mathematica, Matlab, SAS, SPSS, R) will warn you of this but some software (e.g. Excel) won’t. At the point of perfect collinearity your software will tell you that you are trying to invert a singular matrix, which is not possible.