Dummy variable (statistics)

Dummy variables are dichotomotous variables derived from a more complex variable.

A dichotomous variable is the simplest form of data. For example, colour (e.g., Black = 0; White = 1).

It may be necessary to dummy code variables in order to meet the assumptions of some analyses.

A common scenario is to dummy code a categorical variable for use as a predictor in multiple linear regression (MLR).

For example, we may have data about participants' religion, with each participant coded as follows:

A categorical or nominal variable with three categories

This is a categorical variable which would be inappropriate to use in this format as a predictor in MLR. However, this variable could be represented using a series of three dichotomous variables (coded as 0 or 1), as follows:

Dummy coding for a categorical variable with three categories

There is some redundancy in this dummy coding. For instance, if we know that someone is not Christian and not Muslim, then they are Atheist.

So, we only need to use two of the three dummy-coded variables as predictors. More generally, the number of dummy-coded variables needed is one less than the number of categories (k - 1, where k is the original number of categories). If all dummy variables were used, there would be multicollinearity.

Choosing which dummy variables to use is arbitrary, but depends on the researcher's logic. The dummy variable not uses becomes the reference category. Then, this is the tricky part conceptually, all other dummy variables will predict the outcome variable in relation to the reference variable.

For example, if I'm particularly interested in whether atheism is associated with higher rates of depression, then use the dummy coded variables for:
 * Christian (0 = Not Christian or 1 = Christian)
 * Muslim (0 = Not Muslim or 1 = Muslim)

If the regression coefficient for the Christian dummy coded variable is:
 * not significant, then whether someone is Christian vs. Atheist isn't related to their depression
 * significant and positive, then Christian people tend to be more depressed than Atheists
 * significant and negative, then Christian people tend to be less depressed than Atheists

If the regression coefficient for the Muslim dummy coded variable is:
 * not significant, then there whether someone is Muslim vs. Atheist isn't related to their depression
 * significant and positive, then Muslim people tend to be more depressed than Atheists
 * significant and negative, then Muslim people tend to be less depressed than Atheists

Alternatively, I may simply be interested to recode the data into a single dichotomous variable to indicate, for example, whether a participant is Atheist (0) or Religious (1), where Religious category consists of those who are either Christian or Muslim. The coding would be as follows:

A categorical or nominal variable with three categories