Chapter 8 Limited Dependent Variables | Applied Microeconometrics with R

8.1 Introduction

Limited response variables: Mixed discrete and continuous features of the dependent variable.

Typical sources: Censoring and truncation.

Economic explanations:

Corner solutions: Utility-maximizing choice of individuals is at the corner of the budget set, typically zero.
Sample selection: Data deficiency causes response to be unknown (or different) for subsample.
Treatment effects: Response for each individual is only observable for one level of a “treatment” variable.

Examples: Typical economic examples for limited responses.

Labor supply.
\(y_i\): Number of hours worked by person \(i\).
Potential covariates: Education, non-work income, …
Expenditures for health services.
\(y_i\): Expenditures of person \(i\) for health services last month.
Potential covariates: Health status, income, gender, …
Wage equations.
\(y_i\): Earnings of person \(i\) derived from tax returns.
Potential covariates: Education, previous experience, …
Unemployment programs.
\(y_i\): Wage of person \(i\) after unemployment.
Treatment: Training program (potentially non-random assignment).
Potential covariates: Occupation, duration of unemployment, …

8.1.1 Example: PSID 1976 (Mroz data)

Cross-section data from the 1976 Panel Study of Income Dynamics (PSID), based on data for the previous year, 1975.

A data frame containing 753 observations on 21 variables, including:

Table 8.1: Variables in the PSID (Mroz) data set
Variable	Description
`participation`	Did the individual participate in the labor force in 1975? (equivalent to `wage` \(> 0\) or `hours` \(> 0\))
`hours`	Wife’s hours of work in 1975.
`youngkids`	Number of children less than 6 years old.
`oldkids`	Number of children between ages 6 and 18.
`age`	Wife’s age in years.
`education`	Wife’s education in years.
`wage`	Wife’s average hourly wage in 1975 (USD).
`fincome`	Family income in 1975 (USD).

Example: Corner solution

Figure 8.1: Nonwife income against hours worked - jittered

Example: Sample selection

Figure 8.2: Education against wage - jittered

Figure 8.3: Education against log(wage+0.5) - jittered

8.2 Tobin’s Corner Solution Model

Problem: As motivated above, many microeconomic variables have

non-negative values,
a cluster of observations at zero.

Note: OLS cannot be used sensibly due to

possibly negative predictions,
imposed constant marginal effects (which cannot be remedied by logs as \(\log(0)\) is not defined).

Quantities of interest:

\(\text{P}(y = 0 ~|~ x) = 1 - \text{P}(y > 0 ~|~ x)\), probability of zero.
\(\text{E}(y ~|~ y > 0,~ x)\) expectation conditional on positive \(y\).
\(\text{E}(y ~|~ x) = P(y > 0 ~|~ x) \cdot \text{E}(y ~|~ y > 0,~ x)\) expectation.

Ideas:

Two-part model with binary selection and truncated outcome.
Latent variable driving both selection and outcome.

Tobin’s solution: Employ the latter approach with latent Gaussian variable \(y^*\) and observed discrete-continuous response \(y\):

\[\begin{eqnarray*} y^* & = & x^\top \beta ~+~ \varepsilon, \qquad \varepsilon ~|~ x \sim \mathcal{N}(0, \sigma^2),\\ y\phantom{^*} & = & \max(0, y^*), \end{eqnarray*}\]

i.e., \(y\) is a censored version of \(y^*\).

Likelihood:

\[\begin{equation*} L(\beta, \sigma; y, x) ~=~ \prod_{i = 1}^n f(y_i ~|~ x_i, \beta, \sigma)^{I(y_i > 0)} \cdot \text{P}(y_i = 0 ~|~ x_i)^{I(y_i = 0)} \end{equation*}\]

where

\[\begin{eqnarray*} \text{P}(y_i = 0 ~|~ x_i) & = & \text{P}(y_i^* \le 0 ~|~ x_i) \\ & = & \Phi \left(\frac{0 - x_i^\top \beta}{\sigma} \right) ~=~ 1 - \Phi \left(\frac{x_i^\top \beta}{\sigma} \right) \\ f(y_i ~|~ x_i, \beta, \sigma) & = & \frac{1}{\sigma} \cdot \phi \left( \frac{y_i - x_i^\top \beta}{\sigma} \right) \end{eqnarray*}\]

Remarks:

Model is known as tobit model.
Despite the name it does not belong to the same model class as the binary logit and probit models.
Log-likelihood can be shown to be globally concave.
MLE is well-behaved (asymptotically normal etc.).

In R: tobit() from package AER. Convenience interface to survreg() from package survival which provides much more general regression tools for censored responses.

Alternatively: crch(..., left = 0) from package crch for censored regression with conditional heteroscedasticity. Dedicated interface for tobit-type models with either censoring or truncation, different response distributions (beyond normal).

Example: Regression for annual hours of work in PSID 1976 data, using reduced-form specification. Wage is missing as a regressor (as unobserved for non-working subsample).

Model equation of interest:

hours_f <- hours ~ nwincome + education +
  experience + I(experience^2) + age + youngkids + oldkids

Fitting tobit model and naive OLS models:

hours_tobit <- tobit(hours_f, data = PSID1976)
hours_ols1 <- lm(hours_f, data = PSID1976)
hours_ols2 <- lm(hours_f, data = PSID1976,
  subset = participation == "yes")

Compare coefficients:

modelsummary(list("Tobit" = hours_tobit, 
                  "OLS (all)" = hours_ols1,
                  "OLS (positive)" = hours_ols2), 
             fmt=3, estimate="{estimate}{stars}")

tinytable_wmnx867g6gj7a6cvqzuh

	Tobit	OLS (all)	OLS (positive)
(Intercept)	965.305*	1330.482***	2056.643***
	(446.436)	(270.785)	(346.484)
nwincome	-8.814*	-3.447	0.444
	(4.459)	(2.544)	(3.613)
education	80.646***	28.761*	-22.788
	(21.583)	(12.955)	(16.434)
experience	131.564***	65.673***	47.005**
	(17.279)	(9.963)	(14.556)
I(experience^2)	-1.864***	-0.700*	-0.514
	(0.538)	(0.325)	(0.437)
age	-54.405***	-30.512***	-19.664***
	(7.419)	(4.364)	(5.894)
youngkids	-894.022***	-442.090***	-305.721**
	(111.878)	(58.847)	(96.450)
oldkids	-16.218	-32.779	-72.367*
	(38.641)	(23.176)	(30.361)
Num.Obs.	753	753	428
R2		0.266	0.140
R2 Adj.		0.259	0.126
AIC	7656.2	12117.1	6863.2
BIC	7697.8	12158.7	6899.7
Log.Lik.		-6049.534	-3422.581
F		38.495	9.792
RMSE	953.68	746.18	718.92

8.2.1 Truncated normal distribution

Question: How can these models be interpreted?

Needed: Better understanding of truncated (normal) distributions.

Probability density function: Random variable \(y\) with density function \(f(y)\), truncated from below at \(c\). General form and \(\mathcal{N}(\mu, \sigma^2)\) distribution.

\[\begin{eqnarray*} f(y ~|~ y > c) & = & \frac{f(y)}{\text{P}(y > c)} ~=~ \frac{f(y)}{1 - F(c)} \\ & = & \frac{1}{\sigma} \cdot \phi\left( \frac{y - \mu}{\sigma} \right) \left/ \left\{ 1 - \Phi \left(\frac{c - \mu}{\sigma} \right) \right\} \right. \end{eqnarray*}\]

Example: Standard normal distribution truncated at \(c = -1\) and \(c = 0\).

Figure 8.4: Standard normal distribution truncated at two different points

Expectation: For \(\varepsilon \sim \mathcal{N}(0, 1)\).

\[\begin{eqnarray*} \text{E}(\varepsilon ~|~ \varepsilon > c) & = & \int_c^\infty \varepsilon \cdot f(\varepsilon ~|~ \varepsilon > c) ~d \varepsilon \\ & = & \frac{1}{1 - \Phi(c)} ~ \int_c^\infty \varepsilon \cdot \phi(\varepsilon) ~d \varepsilon \\ & = & \frac{1}{1 - \Phi(c)} ~ \left\{ \left. - \phi(\varepsilon) \phantom{\frac{.}{.}} \! \right|_c^\infty \right\} \\ & = & \frac{\phi(c)}{1 - \Phi(c)} \end{eqnarray*}\]

Because \(\phi'(x) = -x \cdot \phi(x)\).

Note: This simple solution does not hold in general.

Inverse Mills ratio: Defined as

\[\begin{equation*} \lambda(x) ~=~ \frac{\phi(x)}{\Phi(x)}. \end{equation*}\]

Hence, due to symmetry of the normal distribution, the following holds for \(\varepsilon \sim \mathcal{N}(0, 1)\) and \(y \sim \mathcal{N}(\mu, \sigma^2)\), respectively:

\[\begin{eqnarray*} \text{E}(\varepsilon ~|~ \varepsilon > c) & = & \lambda(-c), \\ \text{E}(y ~|~ y > c) & = & \mu ~+~ \sigma \cdot \lambda\left(\frac{\mu - c}{\sigma}\right). \end{eqnarray*}\]

Example: \(y \sim \mathcal{N}(\mu, 1)\) with truncation at \(c = 0\).

Figure 8.5: Mean and Inverse Mills ratio

8.2.2 Interpretation of the tobit model

Uninteresting: Expectation of latent variable is straightforward but lacks substantive interpretation.

\[\begin{equation*} \text{E}(y^* ~|~ x) ~=~ x^\top \beta \end{equation*}\]

Of interest: Depending on substantive question, consider

\[\begin{eqnarray*} \text{P}(y > 0 ~|~ x) & = & \Phi(x^\top \beta / \sigma),\\ \text{E}(y ~|~ y > 0,~ x) & = & x^\top \beta ~+~ \sigma \cdot \lambda(x^\top \beta / \sigma),\\ \text{E}(y ~|~ x) & = & \text{P}(y > 0) \cdot \text{E}(y ~|~ y > 0,~ x) \\ & = & \Phi(x^\top \beta / \sigma) \cdot \left\{ x^\top \beta ~+~ \sigma \cdot \lambda(x^\top \beta / \sigma) \right\}. \end{eqnarray*}\]

Example: For \(\sigma = 1\).

Figure 8.6: Expectation of the latent and the observed truncated variable

Marginal effects:

\[\begin{eqnarray*} \frac{\partial \text{P}(y > 0 ~|~ x)}{\partial x_l} & = & \phi(x^\top \beta / \sigma) \cdot \beta_l / \sigma\\[0.5cm] \frac{\partial \text{E}(y ~|~ y > 0,~ x)}{\partial x_l} & = & \beta_l \cdot \left[ 1 - \lambda(x^\top \beta / \sigma) ~ \{ x^\top \beta / \sigma ~+~ \lambda(x^\top \beta / \sigma) \}\right] \\[0.5cm] \frac{\partial \text{E}(y ~|~ x)}{\partial x_l} & = & \frac{\partial \text{P}(y > 0 ~|~ x)}{\partial x_l} \cdot \text{E}(y ~|~ y > 0,~ x) ~+~ \\ & & \text{P}(y > 0 ~|~ x) \cdot \frac{\partial \text{E}(y ~|~ y > 0,~ x)}{\partial x_l} \\ & = & \beta_l \cdot \Phi(x^\top \beta / \sigma) \\ \end{eqnarray*}\]

Remarks:

Overall effect on \(\text{E}(y ~|~ x)\) is sum of an effect at the
- extensive margin (e.g., how much more likely a person is to join the labor force as education increases times the expected hours) and an effect at the
- intensive margin (e.g., how much the expected hours of work increase for workers as education increases times the probability of participation).
OLS estimation (both for full and positive sample) is biased towards zero due to inability to capture non-constant marginal effects.
Effects (e.g., at mean regressors) can again be considered instead of marginal effects.

Example: Effects in tobit model for PSID 1976 data (hand-crafted).

Figure 8.7: Effects in tobit model for PSID 1976 data (hand-crafted)

Figure 8.8: Effects of young- and old kids in tobit model for PSID 1976 data

Specification issues:

The classical (uncensored) OLS regression model is rather robust to misspecifications of distributional form and even non-constant variances. Estimates remain consistent.
In the tobit model, this is not the case. Misspecification of the likelihood (including constant variances) will lead to misspecification of the score. Hence, \(\text{E}(y ~ |~ y > 0,~ x)\) and \(\text{E}(y ~|~ x)\) are misspecified and estimates inconsistent.
Crucial assumption: The same latent process drives the probability of a corner solution and the expectation of positive outcomes.
Example: A single factor, such as number of old kids, can not increase the probability to work but decrease the expected number of hours conditional on participation in the labor force.
Check for the latter: Cragg two-part model.

8.3 Cragg Two-Part Model

Idea: Similar to hurdle model for count data, employ two parts.

Is \(y\) equal to zero or positive?
If \(y > 0\), how large is \(y\)?

Formally: Likelihood has two separate parts.

\(\text{P}(y > 0 ~|~ x)\): Binomial model (typically probit) for \(I(y > 0)\).
\(\text{E}(y ~|~ y > 0,~ x)\): Truncated normal model for \(y\) given \(y > 0\).

Remarks:

The tobit model is nested in the two-part model with probit link. Thus, LR tests etc. can be easily performed.
If the tobit model is correctly specified, estimates from the probit model \(\hat \gamma\) would be consistent (but less efficient) for the scaled coefficients from the tobit model. Thus, \(\hat \beta / \hat \sigma\) should be similar to \(\hat \gamma\) in empirical samples.

In R: Employ separate modeling function. glm() with "probit" link can be used for selection part.

part_f <- update(hours_f, participation ~ .)
part_probit <- glm(part_f, data = PSID1976,
  family = binomial(link = "probit"))
coeftest(part_probit)

## 
## z test of coefficients:
## 
##                 Estimate Std. Error z value Pr(>|z|)
## (Intercept)      0.27007    0.50808    0.53   0.5950
## nwincome        -0.01202    0.00494   -2.43   0.0149
## education        0.13090    0.02540    5.15  2.6e-07
## experience       0.12335    0.01876    6.58  4.8e-11
## I(experience^2) -0.00189    0.00060   -3.15   0.0017
## age             -0.05285    0.00846   -6.25  4.2e-10
## youngkids       -0.86833    0.11838   -7.34  2.2e-13
## oldkids          0.03601    0.04403    0.82   0.4135

truncreg() from package truncreg (or trch() from package crch) can be employed for estimating the truncated normal regression model.

library("truncreg")
hours_trunc <- truncreg(hours_f, data = PSID1976,
  subset = participation == "yes")
coeftest(hours_trunc)

## 
## z test of coefficients:
## 
##                 Estimate Std. Error z value Pr(>|z|)
## (Intercept)     2055.713    461.265    4.46  8.3e-06
## nwincome          -0.501      4.935   -0.10  0.91912
## education        -31.270     21.796   -1.43  0.15139
## experience        73.007     20.229    3.61  0.00031
## I(experience^2)   -0.970      0.581   -1.67  0.09521
## age              -25.336      7.885   -3.21  0.00131
## youngkids       -318.852    138.073   -2.31  0.02093
## oldkids          -91.620     41.202   -2.22  0.02617
## sigma            822.479     39.320   20.92  < 2e-16

Comparison of (scaled) parameter estimates from tobit, probit, and truncated model.

cbind("Censored"  = coef(hours_tobit) / hours_tobit$scale,
      "Threshold" = coef(part_probit),
      "Truncated" = coef(hours_trunc)[1:8] / coef(hours_trunc)[9])

##                  Censored Threshold  Truncated
## (Intercept)      0.860327  0.270074  2.4994098
## nwincome        -0.007856 -0.012024 -0.0006093
## education        0.071875  0.130904 -0.0380188
## experience       0.117256  0.123347  0.0887641
## I(experience^2) -0.001661 -0.001887 -0.0011788
## age             -0.048488 -0.052852 -0.0308044
## youngkids       -0.796795 -0.868325 -0.3876719
## oldkids         -0.014454  0.036006 -0.1113943

LR test and information criteria:

loglik <- c("Tobit" = logLik(hours_tobit),
  "Two-Part" = logLik(hours_trunc) + logLik(part_probit))
df <- c(9, 9 + 8)
-2 * loglik + 2 * df

##    Tobit Two-Part 
##     7656     7620

-2 * loglik + log(nrow(PSID1976)) * df

##    Tobit Two-Part 
##     7698     7698

pchisq(2 * diff(loglik), diff(df), lower.tail = FALSE)

##  Two-Part 
## 1.273e-08

Figure 8.9: Education and oldkids effects in Tobit and Cragg model

8.4 Sample Selection Models

Idea: Employ censored regression model for observations \(y ~|~ y > c\) but model censoring threshold \(c\) as random variable.

Name: Incidental censoring or self selection.

Example: Distribution of wages.

Wages are only observed for workers.
Thus, the wages that non-workers would receive (if they decided to work) are unknown/latent.
Individuals decide to work if their (possibly latent) offered wage \(w_0\) exceeds their individual reservation wage \(w_r\).
The distribution of wages for workers are thus \(f(w_o ~|~ w_o > w_r)\) with random \(w_r\).

Formally: With \(y\) and \(c\) stochastic rewrite

\[\begin{equation*} f(y ~|~ y > c) ~=~ f(y ~|~ y - c > 0) ~=~ f(y_1 ~|~ y_2 > 0). \end{equation*}\]

Special cases:

\(y_1 = y_2\): Standard censoring regression (i.e., tobit).
\(y_1\) and \(y_2\) independent: \(f(y_1 ~|~ y_2 > 0) = f(y_1)\) (i.e., no selection).

Simple approach: Bivariate normal distribution for \((y_1, y_2)\) with correlation \(\varrho\), including special cases \(|\varrho| = 1\) (identity) and \(\varrho = 0\) (independence).

Alternatively: Employ different bivariate distribution or semiparametric approach.

Regression model: Employ (potentially overlapping) regressors for mean of both components.

\[\begin{eqnarray*} \mbox{Outcome equation:} & y_1 ~= & x^\top \beta ~+~ \varepsilon_1, \\ \mbox{Selection equation:} & y_2 ~= & z^\top \gamma ~+~ \varepsilon_2, \end{eqnarray*}\]

where

\[\begin{equation*} \left( \begin{array}{cc} \varepsilon_1 \\ \varepsilon_2 \end{array} \right) ~\sim~ \mathcal{N}\left( \left( \begin{array}{cc} 0 \\ 0 \end{array} \right), \left( \begin{array}{cc} \sigma^2 & \varrho \\ \varrho & 1 \end{array} \right) \right). \end{equation*}\]

Remarks:

Observation rule: \(y_1\) is non-censored, observed for \(y_2 > 0\).
In selection equation, only \(y_2 > 0\) vs. \(y_2 \le 0\) is observed. Hence, standardize \(\text{Var}(y_2) = 1\) for identifiability (as in probit model).
Terminology: Heckman model, heckit model, sample selection model, self-selection model, tobit-2 model.

Conditional expectation:

\[\begin{equation*} \text{E}(y_1 ~|~ y_2 > 0,~ x) ~=~ x^\top \beta ~+~ \sigma \cdot \varrho \cdot \lambda(z^\top \gamma). \end{equation*}\]

Selection effect: Sample is

positively selected for \(\varrho > 0\): \(\text{E}(y_1 ~|~ y_2 > 0,~ x) > \text{E}(y_1 ~|~ x)\),
negatively selected for \(\varrho < 0\),
and no selection occurs for \(\varrho = 0\).

Estimation:

Efficient: Full maximum likelihood.
Biased: OLS for non-censored observations.
Consistent (and employed only for historical reasons): Two-step estimator. Probit model yielding \(\hat \gamma\) followed by OLS regressing \(y\) on \(x\) and \(\lambda(z^\top \hat \gamma)\).

Classical example: Estimation of mean wage offers (or worker’s productivity) from a sample of workers and wages.

In this context: Sample selection model called Gronau-Heckman-Roy model (by Winkelmann & Boes 2009) due to the authors that first recognized this problem and established its inference, respectively.

Identification: Economic considerations guide selection of variables \(x\) that affect the outcome and variables \(z\) that affect the selection.

Typically, all variables \(x\) for the outcome also affect selection, especially in the case of “self selection”.
Additional variables might affect the selection but not the outcome. Example: Presence of small children will affect selection but probably not the reservation wage.
Such variables play a similar role as “instruments”.

Example: Female wages in PSID 1976 data. Employ semi-logarithmic wage equation with education and quadratic polynomial in experience.

wage_f <- log(wage) ~ education + experience + I(experience^2)

OLS estimation: For selected subsample.

wage_ols <- lm(wage_f, data = PSID1976, subset = participation == "yes")

Sample selection model: Additionally employ age, number of old and young children, and other income as regressors in selection equation (corresponding to previously used part_f).

In R available in selection() from package sampleSelection:

library("sampleSelection")
wage_ghr <- selection(part_f, wage_f, data = PSID1976)

Sample selection models

coeftest(wage_ghr)

## 
## z test of coefficients:
## 
##                  Estimate Std. Error z value Pr(>|z|)
## (Intercept)      0.266449   0.508958    0.52   0.6006
## nwincome        -0.012132   0.004877   -2.49   0.0129
## education        0.131341   0.025382    5.17  2.3e-07
## experience       0.123282   0.018724    6.58  4.6e-11
## I(experience^2) -0.001886   0.000600   -3.14   0.0017
## age             -0.052829   0.008479   -6.23  4.7e-10
## youngkids       -0.867399   0.118651   -7.31  2.7e-13
## oldkids          0.035872   0.043475    0.83   0.4093
## (Intercept)     -0.552696   0.260379   -2.12   0.0338
## education        0.108350   0.014861    7.29  3.1e-13
## experience       0.042837   0.014879    2.88   0.0040
## I(experience^2) -0.000837   0.000417   -2.01   0.0449
## sigma            0.663398   0.022707   29.21  < 2e-16
## rho              0.026607   0.147078    0.18   0.8564

Interpretation: \(\hat \varrho\) is essentially zero signaling that there are no significant selection effects. Outcome part of Gronau-Heckman-Roy model and OLS for selected subsample are thus virtually identical.

cbind("GHR (outcome)" = coef(wage_ghr, part = "outcome"),
  "OLS (positive)" = coef(wage_ols))

##                 GHR (outcome) OLS (positive)
## (Intercept)        -0.5526963     -0.5220406
## education           0.1083502      0.1074896
## experience          0.0428368      0.0415665
## I(experience^2)    -0.0008374     -0.0008112

Similarly: Selection part of Gronau-Heckman-Roy model and probit for selection are thus virtually identical.

cbind("GHR (selection)" = coef(wage_ghr)[1:8],
  "Probit (0 vs. positive)" = coef(part_probit))

##                 GHR (selection) Probit (0 vs. positive)
## (Intercept)            0.266449                0.270074
## nwincome              -0.012132               -0.012024
## education              0.131341                0.130904
## experience             0.123282                0.123347
## I(experience^2)       -0.001886               -0.001887
## age                   -0.052829               -0.052852
## youngkids             -0.867399               -0.868325
## oldkids                0.035872                0.036006

Effects: Compare education effects graphically.

Set up auxiliary data with average regressor and varying education.

X <- model.matrix(hours_tobit)[, -c(1, 5)]
psid <- lapply(colnames(X), function(i)
  if(i != "education") mean(X[,i]) else
    seq(from = min(X[,i]), to = max(X[,i]), length = 100))
names(psid) <- colnames(X)
psid <- do.call("data.frame", psid)

Predictions for OLS:

wage_ols_out <- predict(wage_ols, newdata = psid)

Compute predictions for sample selection model by hand:

xb <- model.matrix(delete.response(terms(wage_f)),
  data = psid) %*% coef(wage_ghr, part = "outcome")
zg <- model.matrix(delete.response(terms(part_f)), data = psid) %*%
    head(coef(wage_ghr), 8)    
wage_ghr_out <- xb + coef(wage_ghr)["sigma"] *
  coef(wage_ghr)["rho"] * dnorm(zg)/pnorm(zg)

Figure 8.10: Sample selection model: education against log(wage) - jittered