Sometimes it is useful to group similar dichotomous variables together into a single variable. One example is when asking survey respondents to identify race/ethnicity. Given a large, heterogeneous sample, there will undoubtedly be respondents who select two, three, or more racial/ethnic categories. When using checkbox input fields, most survey data collection platforms will store each checkbox value as a distinct variable (False=0, True=1) and provide a data column for each. This can become problematic when you want to report the summary output as mutually-exclusive data (each respondent counted only once).

For the race/ethnicity example, a respondent might click checkboxes for both ‘Black’ and ‘White’ if they identify as biracial. If you tabulate these two variables separately, this person will be counted twice. If you want mutually-exclusive respondent counts, you may want to combine all of the dichotomous variables into a single composite variable that you can then summarize. What follows is a fairly easy way to do that without losing your mind or the nuance of the original variables.

This problem could be solved a few different ways. One option would be to examine the combinations of two-dimensional contingency tables and then manually solve for data overlaps. This is a good way to drive yourself crazy by thinking about multi-dimensional space, waste paper, and possibly miscount respondents who endorsed more than two categories. Another solution would be to write “if/then” contingency statements for all of the possible combinations that might be in the data. This would result in many more variables than you originally had, and will drive you just as crazy as – if not more crazy than – the first option.

Here’s my approach: Use an exponential transformation on the dichotomous values and then create a new variable by summing them together. Using powers of two, this is what the racial/ethnic recalculation might look like (False=0 for all variables):

- Hispanic: True=1 (2^0)
- White: True=2 (2^1)
- Black: True=4 (2^2)
- Asian: True=8 (2^3)
- NH/PI: True=16 (2^4)
- Other: True=32 (2^5)

If you then sum the transformed variables together into a new variable, you will have mutually-exclusive counts that preserve all of the racial/ethnic diversity from the original data. For this example, the minimum value would be zero (nothing endorsed) and the maximum would be 63 (everything endorsed). All of the in-between values correspond to distinct combinations of the original dichotomous variables. If you included an additional category, its value would be 2^6=64: one greater than the previous max value (the new max would then be 2^7-1=127). While this literally results in an exponential number of potential categories, a summary report of this variable will quickly let you know which combinations are present or absent in the data, so that you don’t need to use trial-and-error to tease out the complexity.

This approach works well for quickly screening dichotomous categorical data before making decisions about how to re-categorize into composite variables for analysis. This is just an intermediary step in a larger decision making process, but it might save you some time in the long run.

colditzjb@gmail.comPost authorA quick follow-up on this, thanks to some Reddit feedback:

You could also use 10^n instead of 2^n when recalculating the dichotomous values. When you sum them, you would have a result that looks more like “100110” with each place value corresponding to one of the original dichotomous variables. Using the example categories from my post, this particular value translates to True for ‘Other’, ‘Black’, and ‘White’ and False for everything else. This is a more human-readable approach if you have just a few categories and you’re not worried about the overall size of the resulting variable.