The Economic Policy Institute (EPI) is an advocacy group posing as a think tank. Before we get into the glaring sloppiness of this paper extolling the virtues of a unionized labor force, please watch this video explaining why so many "studies" like this are junk. Complete and total junk.

Poor data analysis techniques are alarmingly common in "hard" sciences like chemistry and physics. But it is truly an epidemic in social sciences where most of the "research" is done by advocacy groups setting out to prove a pre-existing conclusion. Recall that phrase about using science "as a drunk uses a lamppost: for support rather than enlightenment." (An example of how junk research is done is found in this post about guns and crime, where simply shifting the observed time period a couple years produced the opposite "proof" of the study).

So we know there is a lot of junk science out there. We know the EPI is an advocacy group, publishing studies in social science-- the discipline far most likely to contain some junk science. How rigorous is this study?

## The Study

The executive summary of the study lists lots of claims that all generally go to support the claim that unions raise wages for both the unionized and non-unionized worker alike. The core findings of the study, however, concern the impact of unionization rates upon *non-union wages*.

On its face, this is a plausible claim. Indeed, higher wages and benefits are the raison d'être of organized labor. It is important to understand the mechanism by which this works: by introducing artificial scarcity, the existing demand will drive union wages higher by the reduction of supply. The union gets to be the monopoly provider of labor to the business that is unionized. It's pretty clear how union membership increases the wages of those union members.

It is also plausible that higher union wages might simultaneously raise *non-union wages* because the opportunity cost of union labor is higher, so the benefit of non-union labor is also higher. Putting numbers to it, if the free market labor rate is $20/hr and the union rate is $30/hr, a non-union worker could offer to work for $25 and still be cheaper than the union, while earning a premium relative to the free market rate. Remember that last phrase, as that is critical.

You can see that's it's reasonable to construct a situation where unionization raises the wages of both organized and non-union labor. But being able construct a scenario where this happens is not the same as saying that it actually has happened on a broad scale across a national labor market. And it is still another thing to claim that if it did happen on a broad scale at one period in time, that all of those factors would again combine the the same way to produce the same effects in a time period nearly 40 years removed.

While the authors do present some data for their claims, they also make some bald assertions that are completely unsubstantiated:

In the ongoing debates over wage stagnation, these indirect effects of unions have not received nearly the attention as the oft-cited accounts mentioned above. Partly this is due to the difficulty in disentangling the independent effect of unions on nonunion workers’ pay. Globalization, technological advances, and institutional shifts—most notably the dramatic decline of the U.S. labor movement, along with the falling real value of the minimum wage—have all affected average workers’ wages. These developments are intertwined in numerous ways.

For example, union decline reduced resistance to offshoring, and offshoring, or the threat thereof, emboldened employers in union negotiations

There is no evidence to support the idea that the unions were able to keep companies from offshoring. And it's far from given that a union can both create the benefit of offshoring (via "union wage premium" and also dis-incentivize that same off-shoring. How is it that a union can prevent off-shoring? It cannot. Clearly other factors may explain the marginal utility of offshoring: increased productivity in the developing world, enhanced capital mobility, improvements in the rule of law and safety of intellectual property, etc. It seems for more plausible to me that union strength, not decline, drove much of the appeal to offshoring. The value of off-shoring derives substantially from the opportunity cost of on-shore labor.

## Methodology

To really find out why the study is bogus, we have to dive into the methodology. Hang in there, this gets a little nerdy. The first thing to observe is what they include in their sample:

All of our analyses are limited to nonunion private-sector workers who report positive wages and report working 30 or more hours per week and are between the ages of 16 and 64. We exclude top-level managers along with the self-employed.

Samples are often a primary contributor to invalid results. In this case, the sample seems plausible. It is very large and encompasses a substantial portion of the nonunion labor force.

For our model-predicted wage series, referred to in the text as our “estimated” weekly wages, we regress weekly wages for private-sector workers who are not union members on the following set of covariates: industry-region unionization (described above), a oneyear lagged version of the industry-region employment rate (described above), four mutually exclusive race/ethnicity measures (non-Hispanic white, non-Hispanic AfricanAmerican, non-Hispanic other, and Hispanic), four mutually exclusive education measures (less than high school, high school diploma or equivalent, some college, four or more years of college), potential experience and potential experience squared, a set of four occupational measures (professional/managerial, production, service, farm/forestry/ fisheries), hours worked per week, a measure of whether the respondent lives in a metropolitan area, year dummies, and a dummy indicating whether the respondent works in the manufacturing sector. Wages are measured in constant 2013 dollars, and models are weighted to be representative of the active workforce. For our analyses of non–college degree and high school or less workers, we replicate the model described above but limit the sample first to those workers with less than a bachelor’s, and then to those workers with a high school diploma or less. We cluster our standard errors by industry-region in all models.

Our counterfactual series replicates the model described above except we set the industry-region unionization rates at their 1979 levels. That is, we solve the regression equation by plugging in the observed values for every other covariate except for industryregion unionization. For example, for each individual in the dataset, we treat that individual as if their industry-union rate was equal to the 1979 rate (regardless of what the true rate is in that particular year), leave all other covariate values equal to their observed values, and compute their predicted weekly wage using the estimated model equation. Instead of modeling log weekly wages, we estimate generalized linear models (GLM) specifying a log link and gamma distribution family. 50 While estimating a log-linear model and then applying an appropriate smearing factor to retransform the dependent variable will improve the mean predicted wages relative to exponentiating log wages, this approach does not ensure that predictions for individual cases are particularly accurate, and there is little consensus regarding what smearing factor is preferable. Our approach does not require transformation of the dependent variable in the first place or retransformation post-estimation in order to predict values of the dependent variable in its raw-scale. We experimented with other common approaches, such as retransforming our predictions to the original scale following the estimation of log-linear ordinary least square (OLS) models using naive and Duan smearing estimators. Results are available upon request. In general, the GLM approach produces slightly larger counterfactual wage estimates than retransforming and using the Duan smearing estimators.

OK, what is that in English?

A regression is simply a comparison in how changes in one factor relate to changes in another. A good model with statistical power can predict the change in the dependent variable based on changes in the independent variable(s). It's like collecting a bunch of data relating how hard you throw a ball to how far it goes. With good analysis, you can predict (with high precision) the distance the ball will travel if you know how hard it was thrown. That is a simple two-factor regression. You can add several more factors and often get some fascinating results. For example, I once generated a multiple regression model that could predict within 7 degrees Fahrenheit the temperature of a section of a diesel engine piston based on fuel rate, injection timing, and a couple other factors.

But this study "regresses" the dependent variable against all those factors listed above relating to race, region, education, ethnicity, hours worked, etc. The thing with regression is that it must be done with **continuous** variables as independent variables. That means each factor must be capable of assuming the infinite range of values within its limits. It is "continuous" because there are no gaps in the range of values the variable can take. This is critical because regression is dealing with rates of change between variables. The difference between 3 and 4 is one. But what is the numerical difference between White and Hispanic? Regression must be able to quantify a change in an independent variable. It cannot do that for "occupational measures."

Which means that each of these variable must be made into "dummy" variables that are completely and utterly arbitrary. So perhaps you assign a value of -2 to not having finished high school, -1 to high school diploma, 0 to Bachelor's degree, 1 to Master's, and 2 to PhD. The problem with that approach should be pretty obvious; if you are attempting to examine the impact on wages, then the gap between variable positions must be the same. For example, if you have a variable like temperature to the nearest degree, it would be OK because the difference between 30 and 31 means the same as the difference between 50 and 51: namely, a single degree Fahrenheit. THIS IS NOT THE CASE WITH EDUCATION, you cannot consider the five levels of education I just outlined from the view that the differences between each increment are comparable with respect to wages. Moving from "1" to "2" is not twice as much education.

The only proper way to use a dummy variable in regression is as a binary yes/no indicator-- and even then it introduces much source of error. For example, a researcher studying life expectancy in Kansas might make a dummy variable of "is there a war happening" with yes/no options (usually represented by one or zero, like in computer programming). But that is irrelevant because you can't really define "war" in a context like this, and it would be far less of an indicator than the number of Kansans present in the military at any given time.

Returning the education example, one could create a dummy variable for "master's degree" that is yes or no (1, 0). But those with Master's degrees had to get Bachelor's degrees, too-- which means that every "1" for Master's should also have a "1" for Bachelor's. Meaning you have massive collinearity in the variables if you do it this way.

Because of these dummy variables, the numerical precision of the regression is certainly wrong.

Moreover, the authors make the absurd assumptions that everything (except unionization rates) were the same in 1979. This is where taking techniques from the hard sciences and employing them in social science leads to junk.

Note also how we get certain factors to be lagged, squared, and such. There's no discussion of why lagging one year-- instead of three or not at all-- is the right amount. Nore is there any argument as to why certain factors should be squared.

In hard sciences, such lagging and squaring is very common, but it is completely justifiable. For example, science empirically proved long ago that aerodynamic drag is proportional to the cube of speed. So it would be appropriate to use "speed cubed" in a regression of factors that determine aerodynamic drag.

Not so with using "potential experience squared." What is "potential experience"? Why is it appropriate to square it?

To me, this suggests model "tuning" or "p-hacking" as the video above describes. The authors massaged their model until it supported (albeit very weakly) the headline case they wanted to argue: the unions help EVERYONE, even the non-union worker.