Introduction and Motivation

Overdose deaths in the US have increased dramatically since 1999. Opioid use deaths in particular have reached epidemic levels, with a 200% increase in overdose deaths between 2000 and 2014 (Ghertner and Groves 2018). The Centers for Disease Control and Prevention estimates that over 60,000 drug overdose deaths occurred in 2016, three times the rate of drug deaths as 1999 (H. Hedegaard and Warner 2018). Reports in the literature from the CDC and other health journals have established a clear individual-level relationship of risk of drug overdose with rates of Medicaid and private insurance coverage (H. Hedegaard and Warner 2018), codependence on other drugs (Puja Seth 2018), and is notably prevalent among working class non-Hispanic Whites in America (Nolan 2016). However, there is not much in the literature attesting to state-level effects as they relate to overdose deaths in general (not only limited to opioid-caused deaths), especially with regard to macroeconomic indicators like unemployment level, import/export levels, and other proxies for economic activity at a state-level.

As such, our analysis has two goals. First, we will determine the significance of unemployment when predicting overdose death rates, and validate this significance in the presence of other predictors like import/export activity, state population, and other economic indicators. Second, we will cross-validate several well-tuned models to determine which one has the most predictive ability. Our ultimate purpose is to see how economic indicators across time may affect the opioid epidemic.

Data Summary

Data was collected from a total of 3 sources: the Centers for Disease Control and Prevention (CDC), the Federal Bureau of Labor Statistics (BLS), and the Federal Reserve Bank of St. Louis’ Federal Reserve Economic Data (FRED).

CDC

We collected health data from the CDC in the form of the Vital Statistics Rapid Release Dataset (“VSRR Provisional Drug Overdose Death Counts” 2019). The VSRR data contains provisional counts of drug overdose deaths in the US as reported by agencies from all 50 states and the District of Columbia. The data is collected on a monthly basis.

The data of import to this project is the number of deaths in each state as a result of non-specific drug overdose. Drug overdoses are counted by state agencies in accordance to World Health Organization standards, which lay out the basic guides for reporting agencies to code and classify causes of death. Drug categories that are represented in this dataset include the major drivers of the opioid epidemic like heroin (coded by T40.1), natural opioid analgesics (morphine and codeine), synthetic opioids (oxycodone, hydrocodone, oxymorphone; T40.2), methadone (T40.3), other synthetics (fentanyl, tramadol; T40.4) and other drugs like cocaine, methamphetamine, etc. (“VSRR Provisional Drug Overdose Death Counts” 2019)

There were over 26052 data points from the VSRR dataset. Of those data points, many are individual observations of individually-coded deaths from different drugs. After reshaping and data cleaning, there are now 2652 individual observations. The data ranges across 4 years (2015 - 2019) from a start date of January 1, 2015 to December 1, 2018, with each state reporting 48 observations (once per month). Overdose deaths range from a low 55 deaths in the month of May 2015 in South Dakota to a high of 5697 in Pennsylvania in September of 2017.

BLS

Unemployment data was sourced from the Bureau of Labor Statistics. Unemployment data is published in monthly increments from the Bureau of Labor Statistics by state (“Overview of BLS Statistics on Unemployment” 2019). Data is published beginning in 1976 and is published on the first of each month describing the previous month’s unemployment rate.

There is a very specific definition of who in the labor force is considered unemployed. According to the BLS, those who are currently unemployed are those who are “jobless, looking for a job, and available for work.” People who are incarcerated, in a nursing home, or in a mental health care facility are not considered unemployed as they are not fit for work.

Using this definition, data was scraped from the BLS website and aggregated by each state (including the District of Columbia). The lowest unemployment rate in a given state and month is Vermont in 2019 with a 2.1% unemployment rate. The highest rate is DC in 2015 with a 7.4% unemployment rate. The data itself is roughly Normally distributed with a mean of 4.2% and a median of 4.31%.

We hypothesize that unemployment will have a significant positive association with overdose death rate. A working mechanism may be that unemployment exacerbates existing community economic strife and potentially leads to increase abuse of opioids.

FRED

Here, we consider a series of variables from the Federal Reserve Bank of St. Louis’ Federal Reserve Economic Data database. All such datasets are originally from Bureau of Labor Statistics or the U.S. Census Bureau; however, FRED publishes their data in a more accessible form.

First, we consider the number of new private housing units authorized by building permits. This timeseries is reported monthly. In our analysis, this variable serves as a proxy to housing development as well as the health of the housing market; more permits implies a healthy housing market and ample housing options. We thus expect a negative association between overdose deaths and new housing permits.

Next, we consider the money spent on imports and exports of manufactored and non-manufactored commodities in millions of dollars. This time-series are reported monthly. This feature will offer our analysis an insight into the health of the state-local manufactoring sector. In the past few years, the states with some of the worst opioid overdose numbers are those in in the Rust Belt (“Opioid Summaries by State” 2019). One potential narrative is that the increased use of overseas labor stips entire communities of their jobs, and many of those effected turn to opioids. We can reasonably quantify this effect by state with the number of dollars spent on imports. We expect an increase in spending on imports to imply a decrease in availability of manufactoring jobs and the health of the local manufactoring sector, and thus we expect a positive association between import spending and overdose deaths. By way of analygy, we expect the value of exports to have a negative association with overdose deaths.

Finally, we consider per capita personal income in dollars. This timeseries is reported annually. It is reasonable to assume that the average personal income does not change dramatically intrayear. Per capita income affords us a glimpse into the personal financial freedom of state residents. We chose to include per capita income instead of median household income in order to capture the influence of the wealthly. We expect a negative association between income and overdose deaths.

Below, we include annual population estimates from the Census Bureau. We use annual data for two reasons: the availability of state population estimates is limited and we can reasonably assume that state population does not change dramatically intrayear. Furthermore, we use the population estimate from the previous year as a given year’s population variable. We do so becuase population estimates are generated late into the year, and thus for any given year, the previous year’s Census Bureau estimate is likely more accurate. We will use population to normalize the other predictors.

R inherently has several statistics about states. Most of these are static estimates from the early 70’s (population estimates, income per capita, illiteracy, etc.). We can, however, use the datasets which give information about the region a state is in, as well as area (which has remained unchanged since the time of data collection). Region is a categorical variable with four categories: north, northcentral, south, and west Though the nomenclature is dated, these categories reasonably divide states by potentially important factors. For example, we expect the Rust Belt (which R includes in the region northcentral) to have a positive association with overdose deaths (as noted above). Here we note that we have to encode Washington, D.C., ourselves as the data in R does not contain the District of Columbia (in accordance with R’s encoding, Washington, D.C., is in the South).

Before we begin any modeling or analyis, we must normalize overdose deaths, permits, imports, and exports. By doing so, we convert each to a rate which can be directly compared across states. In order to avoid working with very small numbers, we convert each rate to: overdose deaths per 100000 people, permits per 100000 people, spending on imports in millions of dollars per 100000 people, and value of exports in millions of dollars per 100000 people.

Below we check to see if any cells of the dataframe are NA, NULL, NaN, or infinite. No such values exist. We have a relatively clean dataset.

We now consider the possibility of multicollinearity. As shown below, some of the quantitative features of our dataset are highly correlated (population and imports, imports and exports,…). In order to handle this multicolinearity, we will consider ensemble models such as random forests.

Correlation plot of predictors and response.

Correlation plot of predictors and response.

Plotting predictors and response by state and region

Plotting predictors and response by state and region

Histograms and boxplots of overdose deaths and log-transformed overdose deaths.

Histograms and boxplots of overdose deaths and log-transformed overdose deaths.

Results

Baseline Linear and Quadratic Models

##   Estimate Std. Error    t value   Pr(>|t|) 
##      0.091      0.009     10.072      0.000

Below, we see that the data is much closer to a Normal distribution if we apply a log transformation. We shall predict the log-rate of overdose deaths in the remainder of our models as a result. The simple regression model has a positive coefficient for unemployment (\(0.0913\)). With a \(t\)-statistic of \(10.072\) (\(p\)-value \(\approx 0\)), this coefficient is very significant. The model has a positive association between unemployment and overdose deaths, which agrees with our hypothesis.

Linear regression of log-overdose deaths with unemployment rate.

Linear regression of log-overdose deaths with unemployment rate.

In order to better evaluate the importance of unemployment, we apply an ESS F-test. First we fit two linear models. The first model include all main effects except unemployment, while the second includes all main effects. Then, we fit two quadratic models. The first model contains all main effects and their respective quadratic variants except unemployment. The second model contains the same predictors, except that it includes unemployment and its quadratic effect. We again verify the assumptions of a linear model via a plot of residuals vs. fitten values.

## Analysis of Variance Table
## 
## Model 1: log(overdoseDeaths) ~ state + month + year + permits + imports + 
##     exports + income + population + region + area
## Model 2: log(overdoseDeaths) ~ state + month + year + permits + imports + 
##     exports + income + population + region + area + unemployment
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1   1868 22.807                                  
## 2   1867 22.634  1     0.174 14.314 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## unemployment       income      imports      exports      permits 
##     -0.02950      0.00004     -0.00089      0.00073     -0.00018

With an F-statistic of \(14.314\) (with \(1867\), \(1\) degrees of freedom) and a corresponding p-value \(\approx 0\), unemployment provides significant predictive ability. We now consider the same ESS F-test with quadratic models.

Unemployment now has a negative coefficient. This is likely due to multicollinearity following from the results of our simple linear model. Regardless, this result contradicts our hypothesis concerning unemployment. We do, however, see positive coefficients for income and exports, and a negative coefficient for imports (as we hypothesized in data collection and EDA). The coefficient for permits is negative, which contradicts our hypothesis.

## Analysis of Variance Table
## 
## Model 1: log(overdoseDeaths) ~ state + month + year + poly(permits, 2, 
##     raw = T) + poly(imports, 2, raw = T) + poly(exports, 2, raw = T) + 
##     poly(income, 2, raw = T) + poly(population, 2, raw = T) + 
##     region + area
## Model 2: log(overdoseDeaths) ~ state + month + year + poly(permits, 2, 
##     raw = T) + poly(imports, 2, raw = T) + poly(exports, 2, raw = T) + 
##     poly(income, 2, raw = T) + poly(population, 2, raw = T) + 
##     region + area + poly(unemployment, 2, raw = T)
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1   1863 22.145                                  
## 2   1861 21.816  2    0.3294 14.049 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Standardized residuals of the quadratic full-effects model appears to satisfy Normality assumptions.

Standardized residuals of the quadratic full-effects model appears to satisfy Normality assumptions.

With an F-statistic of \(14.049\) (with \(1861\), \(2\) degrees of freedom) and a corresponding p-value \(\approx 0\), unemployment provides significant predictive ability. Additionally, there seems to be no underlying structure to the residuals, the residuals appear to have reasonably constant variance, and are reasonably normally distributed: the assumptions of our models (linearity, homoscedasticity of residuals, and Normality) appear to be satisfied.

We examine the importance of unemployment in two additional ways for sake of thoroughness: LASSO regression, and random forest regression.

LASSO

Due to L1-norm penalty of LASSO, less important variables quickly converge to zero as the tuning parameter increases. Below, we fit a LASSO model (with the same design matrix as that of the full quadratic model polynomial2) with a series of possible tuning parameters \(\lambda\). We then consider the order in which the variable coefficients shrink to zero.

LASSO feature selection plotted against values of log-lambda.

LASSO feature selection plotted against values of log-lambda.

LASSO feature selection agrees with the results of the ESS F-test. Unemployment’s quadratic effect is one of the last coefficients to shrink to zero along with population.

Random Forest

We further test this conclusion with a random forest model. Below, we fit a random forest model with all main effects and then examine the model’s relative feature importance.

Random forest feature importance.

Random forest feature importance.

Unemployment is not important in the random forest model compared to other predictors. It appears as though state and area (which is correlated with state) are the most important predictors. In order to handle the grouped nature of dates, state, area, etc., we turn to fit a mixed effects model.

Mixed Effects

Below, we fit a three-layered mixed-effects model. The first two layers are state and year respectively. Theoretically, we would prefer to fit a four-layered model which includes month below year. This is impossible with our dataset (\(n = 2448\)). With 51 states, 5 years, and 12 months, we would need a minimum of \(51 \times 5 \times 12 = 3060\) samples in order to fit such am model. We include a random intercept for states and years. The only random effect we include in the model is unemployment. These are purposeful design choices considering the limited size of our dataset. More complex models would not be reasonably fit.

## Analysis of Variance Table
##              Df   Sum Sq   Mean Sq F value
## month        11 0.305188 0.0277444 20.0454
## permits       1 0.003387 0.0033866  2.4469
## imports       1 0.002240 0.0022399  1.6183
## exports       1 0.000181 0.0001809  0.1307
## income        1 0.012506 0.0125057  9.0354
## population    1 0.002036 0.0020364  1.4713
## region        3 0.009944 0.0033146  2.3948
## area          1 0.003675 0.0036749  2.6551
## unemployment  1 0.000444 0.0004445  0.3211
## unemployment 
##  -0.01021003

We are not able to calculate p-values for the ANOVA table (the statistics from a mixed model are not F-distributed), however, we can reasonably say that month is the most significant predictor followed by per capita income and region. Unemployment is relatively significant, however, we can show its significance using a p-value. In this model, unemployment has a negative coefficent. This contradicts our simple model and our hypothesis.

Conclusions and Decisions

Significance of Unemployment

Unemployment is a significant preditor of overdose death rates. Our simple linear model agrees with this, however, our goal is to show that this relationship holds when other predictors are considered. We thus fit two kitchen sink linear models: one which includes unemployment, and one which does not. An ESS F-test showed that unemplyment provides significant predictive power. We corroborated this result with the same analysis on two quadratic kitchen sink models (again, one included unemployment while the other did not). The resulting ESS F-test agreed with that of the linear models.

We further explored this result with two additional models: a LASSO regression model and a random forest regression model. The LASSO model showed that unemployment is one of the later predictors to shrink to zero. This coroborates its significance. The random forest model, however, gave much more importance to state, and other non-monthly variables. We can control for these varaibles with a mixed effects model.

Our mixed effects model controled for state and year grouping (with random intercepts) as well as included a random slope for unemployment. The ANOVA table of the resulting model (though limited due to data contraints) showed that unemployment is likely significant. The coefficient for unemployment in this model is negative. We are therefore uncertain as to the association between unemployment and overdose deaths.

Prediction

RMSE of all models tested. The mixed effects model had lowest training and testing error.

RMSE of all models tested. The mixed effects model had lowest training and testing error.

The least predictive model is the simple linear model. The rest of the linear models (including the quadratic and LASSO-regularized models) performed very similarlly on the test and train sets. This indicates that overfitting is not a serious concern for any of our models. The random forest model and the mixed effects model performed the best reducing RMSE by more than half that of the multiples linear models.

Direction of Future Research

Sample Size

Month is the most important fixed effect in the mixed model. The predictive ability of the model may increase if month is included as a grouping variable. In our analysis, this was not possible due to limited sample size. In the future, it will be possible to fit such a model.

Choice of Predictors

The insignificance of many of our predictors in the mixed model indicates that we are likely missing important variables for predicting overdose death rates. Some potential predictors are: average temperature, crime rate, high school graduation rate, and literacy rate among others. As of now, these predictors are either not readily available online, not current, or not available in a reasonably frequent timeserie. As more data comes out, variables such as these may add significant predictive power.

References

Ghertner, R., and L. Groves. 2018. “The Opioid Crisis and Economic Opportunity: Geographic and Economic Trends.” Office of the Assistant Secretary for Planning and Execution 24 (September).

H. Hedegaard, A. M. Miniño, and M. Warner. 2018. “Drug Overdose Deaths in the United States, 1999 - 2017.” Centers for Disease Control and Prevention 329 (November).

Nolan, Dan. 2016. “How Bad is the Opioid Epidemic?” Edited by Chris Amico, February. PBS.

“Opioid Summaries by State.” 2019. National Institute of Health; National Institute on Drug Abuse.

“Overview of BLS Statistics on Unemployment.” 2019. USDOL; Bureau of Labor Statistics.

Puja Seth, Rita Noonan, Rose Rudd. 2018. “Quantifying the Epidemic of Prescription Opioid Overdose Deaths.” American Journal of Public Health 108 (4).

“VSRR Provisional Drug Overdose Death Counts.” 2019. Centers for Disease Control; Prevention; National Center for Health Statistics.