Adapting the 'Oaxaca' Package Regression Model to Make Results Independent from Indicator Variables' Reference Categories

Adapting the ‘Oaxaca’ Package Regression Model to Make Results Independent from Indicator Variables’ Reference Categories

The Oaxaca-Blinder decomposition is a widely used technique in economics to decompose the difference between the predicted values of a dependent variable using different models into two components: the difference due to differences in input ratios (the “Blinder” component) and the difference due to differences in slopes (the “Oaxaca” component). This technique is particularly useful for comparing the results of different models, such as linear regression or instrumental variables estimation.

The ‘oaxaca’ package in R provides an implementation of this decomposition. However, when working with multiple sets of indicator variables representing different types of variables, the model can become complex and difficult to interpret.

Understanding the Problem

In this blog post, we will explore how to adapt the ‘Oaxaca’ package regression model to make the results independent from the indicator variables’ reference categories. This involves understanding the concept of reference categories in the context of indicator variables and learning how to specify these categories correctly in R using the ‘oaxaca’ package.

Reference Categories

In statistics, a reference category is a baseline or starting point for measuring the levels of an outcome variable. For categorical variables with multiple levels, the reference category is typically set as the most extreme value on one end of the scale. In other words, when we compare two groups defined by different categories of a variable, we are essentially comparing them relative to the most extreme category.

For example, if we have a variable representing age (0-19, 20-39, 40-59, and 60+), the reference category for this variable would be “60+”. All other ages will be compared relative to this baseline.

Indicator Variables and Reference Categories

In the context of multiple sets of indicator variables representing different types of variables, we need to specify a reference category for each set. This ensures that when comparing results across different models or groups, the comparisons are made relatively to the most extreme category in each group.

For instance, if we have an indicator variable representing education level (high school, bachelor’s degree, master’s degree), we would want the reference category to be “master’s degree”. All other levels of education would then be compared relative to this baseline.

Adapting the ‘Oaxaca’ Package Regression Model

To adapt the ‘oaxaca’ package regression model to make results independent from indicator variables’ reference categories, we need to specify a separate set of reference categories for each group of variables. This involves using the formula argument in R and specifying the correct syntax for defining these categories.

Specifying Reference Categories Using the formula Argument

The ‘oaxaca’ package regression model uses the following syntax for defining the formula:

y ~ x1 + x2 + ... | z [d1 + d2 + ...]

In this syntax, x1, x2, etc. are the predictor variables and z is an indicator variable representing the reference category. The | symbol separates the model from the decomposition component.

To specify a separate set of reference categories for each group of variables, we need to create a new formula that includes all the variables of interest, followed by a pipe (|) character and then another set of parentheses containing the reference categories.

Example Formula

Suppose we want to analyze the relationship between log hourly wage (log.wage) and several predictor variables while controlling for differences in input ratios (Blinder component) and differences in slopes (Oaxaca component). We also have three sets of indicator variables representing different types of variables: age, education level, and industry.

To specify a separate set of reference categories for each group of variables, we can create the following formula:

log.wage ~ hours + pubcon + temp + size1to50 + (Age | female) + (Education | edu1) + (Industry | nace1)

In this example, (Age | female) specifies a reference category for the “Age” variable, where female is the indicator variable representing the most extreme value on one end of the age scale. Similarly, (Education | edu1) and (Industry | nace1) specify reference categories for the “Education” and “Industry” variables, respectively.

Using the ref Argument

In addition to specifying a separate set of reference categories in the formula argument, we can also use the ref argument to specify these categories. This allows us to be more explicit about which categories should be used as reference points.

For example:

log.wage ~ hours + pubcon + temp + size1to50 + ref(female) + (Age | female) + (Education | edu1) + (Industry | nace1)

In this case, the ref argument is used to specify that the female indicator variable should be used as the reference category for the “Age” variable.

Conclusion

Adapting the ‘Oaxaca’ package regression model to make results independent from indicator variables’ reference categories requires a deeper understanding of the concepts involved. By specifying separate sets of reference categories using the formula argument or the ref argument, we can ensure that the comparisons made across different models and groups are accurate and reliable.

While working with multiple sets of indicator variables can add complexity to our analysis, following these steps will help us adapt the ‘Oaxaca’ package regression model to meet our specific research needs.


Last modified on 2024-03-18