Chapter 4 Binary Response Models

4.1 Introduction

In this chapter we examine binary response models, in which the dependent variable can only take up values zero and one. Typical economic examples of binary variables include:

  • Labor market participation of women.
    \(y_i\): Woman \(i\) works/does not work.
    Potential covariates: Age, education, children, …
  • Who is member of the workers’ union?
    \(y_i\): Person \(i\) is/is not member of the workers’ union.
    Potential covariates: Age, education, occupation, …
  • Default of consumer credits, credit scoring.
    \(y_i\): Person \(i\) pays/does not pay credit back.
    Potential covariates: Income, credit history, maturity, occupation, age, marital status, …
  • Choice of mode of transportation.
    \(y_i\): Person \(i\) uses/does not use public transport to travel to work.
    Potential covariates: Distance, costs, …

Microeconomic theory often assumes continuous choice within convex budget sets. However, in practice many choices are necessarily discrete, e.g., the purchase of a car. Furthermore, even continuous, ordered, or multinomial variables are sometimes transformed to binary variables. For example, a binary variable happy/unhappy could be derived from happiness on a scale 0–10, where a score up to 5 is assigned the value “unhappy”, and a score above 5 counts as “happy”. In such situations we lose information, however, this simplification often provides a good first answer due to its simple specification and its ease of interpretation.

4.2 GLMs for Binary Response Variables

As \(y_i \in \{0, 1\}\), the only conceivable model is the Bernoulli probability function with

\[\begin{equation*} f(y_i ~|~ x_i) ~=~ \pi_i^{y_i} ~ (1 - \pi_i)^{1 - y_i}, \end{equation*}\]

where \(\pi_i = \text{Prob}(y_i = 1 ~|~ x_i)\), the conditional probability of observing an outcome of one. Hence, conditional mean and conditional variance are:

\[\begin{eqnarray*} \text{E} (y_i ~|~ x_i) & = & \pi_i,\\ \text{Var}(y_i ~|~ x_i) & = & \pi_i ~ (1 - \pi_i). \end{eqnarray*}\]

For parameterization, typically the GLM approach is employed, i.e., \(g(\pi_i) ~=~ x_i^\top \beta\), where \(g: [0, 1] \rightarrow \mathbb{R}\) is the link function.

4.2.2 Interpretation of parameters

In the previous section, we have seen that the exp of a coefficient \(\exp(\beta_l)\) for \(l>1\) in a logit model can be interpreted as relative changes in the odds of “success” for a 1-unit increase in a certain regressor \(x_l\), ceteris paribus. There are two problems with this way of interpretation: Firstly, it only works for models with a logit link. For probit models, there is no such ceteris paribus interpretation available. Secondly, the interpretation is in terms of odds, rather than probabilities, which most practitioners find easier to interpret.

The intercept \(\beta_1\) is interpreted in logit models by taking its inverse link transformation \(\Lambda(\beta_1)\), and interpret it as the probability of “success” for the case where all \(x_j=0\) for \(j>0\). This interpretation of the intercept works analogously in probit models, with \(\Phi(\cdot)\) as the inverse link transformation. Notice that this only pertains to one specific combination of regressors, which may, or may not be realistic. For example, the regressor age is unlikely to take up zero as a value in data sets concerning adults.

The following graphics illustrate the different effects the intercept \(\beta_1\) and a single slope coefficient \(\beta_2\) have on probability, depending on \(x\).

Effects of Coefficients on Probability

Figure 4.2: Effects of Coefficients on Probability

We know that in the linear regression model, parameters \(\beta_j\) capture the expected changes in the response variable.

\[\begin{equation*} \frac{\partial \text{E} (y_i ~|~ x_i)}{\partial x_{il}} ~=~ \frac{\partial (x_i^\top \beta)}{\partial x_{il}} ~=~ \beta_l. \end{equation*}\]

In the binomial GLM, however, the partial derivatives, and thus the marginal effects, are not constant:

\[\begin{equation*} \frac{\partial \text{E}(y_i ~|~ x_i)}{\partial x_{il}} ~=~ \frac{\partial H(x_i^\top \beta)}{\partial x_{il}} ~=~ h(x_i^\top \beta) \cdot \beta_l. \end{equation*}\]

Some important properties of MPE are:

  • The sign of the marginal effect is equal to the sign of \(\beta_l\)
  • The effect is largest for \(x_i^\top \beta = 0\) for symmetric unimodal \(h(\cdot)\)
  • The effect varies among individuals

In practice, we usually report “typical” effects by taking expectations. This can be done two ways: we either estimate expected marginal probability effect (MPE) \(\text{E}_x(h(x_i^\top \beta)) \beta_l\) by average MPE (AMPE) \(1/n \sum_{i = 1}^n h(x_i^\top \beta) \beta_l\), or we estimate MPE at expected regressor \(h(E(x)^\top \beta) \beta_l\) by MPE at mean regressor \(h(\bar x^\top \beta) \beta_l\). Remember that taking means can be plausible for continuous regressors but it is hard to interpret for categorical regressors, and thus also for dummy variables pertaining to those. A better solution is to evaluate MPE in different groups (e.g., foreign and Swiss, men and women), or, alternatively, to assess the influence graphically by plotting \(\text{E}(y ~|~ x)\) against \(x_l\), while keeping all except \(l\)-th regressor at a typical value such as \(\bar x_{-l}\). In other words, we examine the marginal effect of one chosen regressor, while fixing all other regressors at their mean. Keep in mind that all previous computations assume that there is a one-to-one relationship between variables and regressors \(x_{il}\). This is violated when variables occur in multiple regressors, e.g., in interactions or polynomials. For example, in case of a quadratic polynomial \(\text{E} (y ~|~ x) = H(\beta_1 + \beta_2 x + \beta_3 x^2)\):

\[\begin{equation*} \frac{\partial \text{E} (y ~|~ x)}{\partial x} ~=~ h(\beta_1 + \beta_2 x + \beta_3 x^2) \cdot (\beta_2 + 2 \beta_3 x). \end{equation*}\]

This is typically ignored in software packages that offer automatic computation of marginal effects. In R, we can employ the margins package which correctly deals with polynomials.

4.2.3 Example: Swiss labor participation

In this section, we are going to analyze cross-section data on Swiss labor participation, originating from the health survey SOMIPOPS for Switzerland in 1981 (Gerfin 1996). The data frame contains 872 observations on the following 7 variables:

Table 4.1: Variables in the Swiss Labor Dataset
Variable Description
participation Factor. Did the individual participate in the labor force?
income Logarithm of nonlabor income.
age Age in decades (years divided by 10).
education Years of formal education.
youngkids Number of young children (under 7 years).
oldkids Number of older children (over 7 years).
foreign Factor. Is the individual a foreigner?
Labor Participation Depending on Age and Education

Figure 4.3: Labor Participation Depending on Age and Education

  1. Estimation: First, we fit a logit model with all explanatory variables (~ . in R), and a quadratic polynomial in age.
data("SwissLabor", package = "AER")
swiss_logit <- glm(participation ~ . + I(age^2),
  data = SwissLabor, family = binomial)
summary(swiss_logit)
## 
## Call:
## glm(formula = participation ~ . + I(age^2), family = binomial, 
##     data = SwissLabor)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)   6.1964     2.3831    2.60   0.0093
## income       -1.1041     0.2257   -4.89  1.0e-06
## age           3.4366     0.6879    5.00  5.9e-07
## education     0.0327     0.0300    1.09   0.2761
## youngkids    -1.1857     0.1720   -6.89  5.5e-12
## oldkids      -0.2409     0.0845   -2.85   0.0043
## foreignyes    1.1683     0.2038    5.73  9.9e-09
## I(age^2)     -0.4876     0.0852   -5.72  1.0e-08
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1203.2  on 871  degrees of freedom
## Residual deviance: 1017.6  on 864  degrees of freedom
## AIC: 1034
## 
## Number of Fisher Scoring iterations: 4
  1. Prediction: Then, we compute predictions of the response variable \(\hat \pi_i\). (The default is prediction of linear predictor \(\hat \eta_i\).)
predict(swiss_logit, data.frame(income = 11, age = 3.3,
  education = 12, youngkids = 1, oldkids = 1, foreign = "no"),
  type = "response")
##      1 
## 0.2783
predict(swiss_logit, data.frame(income = 11, age = 3.3,
  education = 12, youngkids = 1, oldkids = 1, foreign = "yes"),
  type = "response")
##      1 
## 0.5536
  1. Odds ratios: Compute the ratio of odds for working between foreign and Swiss women, first, without standardizing for other regressors. One way to do this is to compute an empirical contingency table, the associated proportions, odds, and odds ratio. A second way to compute odds ratios is by fitting a binomial GLM with logit link and take exp of the estimated coefficient for the foreign indicator variable, leading to the same result:
(tab <- xtabs(~ foreign + participation, data = SwissLabor))
##        participation
## foreign  no yes
##     no  402 254
##     yes  69 147
(tab <- prop.table(tab, 1))
##        participation
## foreign     no    yes
##     no  0.6128 0.3872
##     yes 0.3194 0.6806
(tab <- tab[,2]/tab[,1])
##     no    yes 
## 0.6318 2.1304
tab[2]/tab[1]
##   yes 
## 3.372
aux <- glm(participation ~ foreign, data = SwissLabor,
  family = binomial)
aux
## 
## Call:  glm(formula = participation ~ foreign, family = binomial, data = SwissLabor)
## 
## Coefficients:
## (Intercept)   foreignyes  
##      -0.459        1.215  
## 
## Degrees of Freedom: 871 Total (i.e. Null);  870 Residual
## Null Deviance:       1200 
## Residual Deviance: 1150  AIC: 1150
exp(coef(aux)[2])
## foreignyes 
##      3.372

We can also compute odds ratios for a marginal one unit change in all variables (ceteris paribus).

swiss_cf <- coef(swiss_logit)
exp(swiss_cf[- c(1, 3, 8)])
##     income  education  youngkids    oldkids foreignyes 
##     0.3315     1.0332     0.3055     0.7859     3.2167

As mentioned earlier, we need to be careful with polynomials because the assumption of a one-to-one relationship between the variable and the regressors is violated as the variable occurs in multiple regressors. In this model, for example, due to the age polynomial, the odds ratio depends on age itself. Consider a one-year change in age for a 30- and 50-year old woman.

exp(0.1 * swiss_cf[3] + (3.1^2 - 3.0^2) * swiss_cf[8])
##   age 
## 1.047
exp(0.1 * swiss_cf[3] + (5.1^2 - 5.0^2) * swiss_cf[8])
##    age 
## 0.8617

We can thus use the margins package to calculate expected marginal probability effects (MPE).:

library("margins")
swiss_mfx <- margins(swiss_logit)
summary(swiss_mfx)
##      factor     AME     SE       z      p   lower   upper
##         age -0.0820 0.0167 -4.9099 0.0000 -0.1148 -0.0493
##   education  0.0065 0.0060  1.0917 0.2750 -0.0052  0.0182
##  foreignyes  0.2431 0.0407  5.9774 0.0000  0.1634  0.3228
##      income -0.2204 0.0428 -5.1435 0.0000 -0.3044 -0.1364
##     oldkids -0.0481 0.0166 -2.9017 0.0037 -0.0806 -0.0156
##   youngkids -0.2367 0.0309 -7.6719 0.0000 -0.2972 -0.1762
  1. Effects: To assess our findings graphically, we visualize the change in probability for a partial (one unit) change in one regressor, given typical values for all other regressors. To make R understand that age and |I(age^2)| should be treated as a single term, use poly(age, 2, raw = TRUE) instead of adding the latter as an additional regressor.
swiss_logit2 <- glm(participation ~ income + education + 
  poly(age, 2, raw = TRUE) + youngkids + oldkids + foreign,
  data = SwissLabor, family = binomial)

Visualization using the effects package:

library("effects")
swiss_eff2 <- allEffects(swiss_logit2)
plot(swiss_eff2, rescale.axis = FALSE)
Effects of Explanatory Variables on Labor Force Participation - Logit Model

Figure 4.4: Effects of Explanatory Variables on Labor Force Participation - Logit Model

  1. Inference: Next we can test whether a quadratic polynomial in education significantly improves the model:
  • Wald test:
swiss_logit3 <- glm(participation ~ . + I(age^2) + I(education^2),
  data = SwissLabor, family = binomial)
coeftest(swiss_logit3)[9,]
##   Estimate Std. Error    z value   Pr(>|z|) 
##   0.009812   0.006002   1.634796   0.102092
waldtest(swiss_logit, swiss_logit3, test = "Chisq")
## Wald test
## 
## Model 1: participation ~ income + age + education + youngkids + oldkids + 
##     foreign + I(age^2)
## Model 2: participation ~ income + age + education + youngkids + oldkids + 
##     foreign + I(age^2) + I(education^2)
##   Res.Df Df Chisq Pr(>Chisq)
## 1    864                    
## 2    863  1  2.67        0.1
  • Likelihood ratio test (via analysis of deviance):
anova(swiss_logit, swiss_logit3, test = "Chisq")
## Analysis of Deviance Table
## 
## Model 1: participation ~ income + age + education + youngkids + oldkids + 
##     foreign + I(age^2)
## Model 2: participation ~ income + age + education + youngkids + oldkids + 
##     foreign + I(age^2) + I(education^2)
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1       864       1018                     
## 2       863       1015  1     2.65      0.1
  • Likelihood ratio test (via generic lmtest approach):
lrtest(swiss_logit, swiss_logit3)
## Likelihood ratio test
## 
## Model 1: participation ~ income + age + education + youngkids + oldkids + 
##     foreign + I(age^2)
## Model 2: participation ~ income + age + education + youngkids + oldkids + 
##     foreign + I(age^2) + I(education^2)
##   #Df LogLik Df Chisq Pr(>Chisq)
## 1   8   -509                    
## 2   9   -507  1  2.65        0.1
  1. The probit model: Probit estimation leads to virtually identical results.
swiss_probit <- glm(participation ~ income + education + 
  poly(age, 2, raw = TRUE) + youngkids + oldkids + foreign,
  data = SwissLabor, family = binomial(link = "probit"))
modelsummary(list("Logit model"= swiss_logit2, "Probit model" = swiss_probit), 
             fmt=3, estimate="{estimate}{stars}")
Logit model Probit model
(Intercept) 6.196** 3.749**
(2.383) (1.407)
income −1.104*** −0.667***
(0.226) (0.132)
education 0.033 0.019
(0.030) (0.018)
poly(age, 2, raw = TRUE)1 3.437*** 2.075***
(0.688) (0.405)
poly(age, 2, raw = TRUE)2 −0.488*** −0.294***
(0.085) (0.050)
youngkids −1.186*** −0.714***
(0.172) (0.100)
oldkids −0.241** −0.147**
(0.084) (0.051)
foreignyes 1.168*** 0.714***
(0.204) (0.121)
Num.Obs. 872 872
AIC 1033.6 1033.2
BIC 1071.7 1071.3
Log.Lik. −508.785 −508.577
F 20.055 22.819
RMSE 0.45 0.45
coef(swiss_logit2)/coef(swiss_probit)
##               (Intercept)                    income                 education 
##                     1.653                     1.655                     1.702 
## poly(age, 2, raw = TRUE)1 poly(age, 2, raw = TRUE)2                 youngkids 
##                     1.656                     1.657                     1.660 
##                   oldkids                foreignyes 
##                     1.639                     1.635
dnorm(0)/dlogis(0)
## [1] 1.596
swiss_eff <- allEffects(swiss_probit)
plot(swiss_eff, rescale.axis = FALSE)
Effects of Explanatory Variables on Labor Force Participation - Probit Model

Figure 4.5: Effects of Explanatory Variables on Labor Force Participation - Probit Model

4.2.4 The linear probability model

Instead of using binary GLMs, many practitioners prefer running linear regressions using OLS on a 0/1-coded response. This is also known as the linear probability model. The main advantage is that coefficients can be directly interpreted as marginal changes in the success probabilities without having to consider any relative changes in odds or other nonlinear transformations.

However, this advantage does not come for free. It is clear that OLS cannot be an efficient estimator because binary observations (i.e., from a Bernoulli distribution) are necessarily heteroscedastic and the variance changes along with the mean: \(Var(y_i ~|~ x_i) = \pi_i \cdot (1 - \pi_i)\). But practitioners argue that even under heteroscedasticity, the OLS estimator is consistent, and that inference can easily be adjusted by using “robust” sandwich standard errors.

This consistency is only obtained, though, under the assumption that the mean equation \(E(y_i ~|~ x_i) = \pi_i = x_i^\top \beta\) is correctly specified. As the success probability is bounded \(\pi_i \in [0, 1]\) while the linear predictor \(x_i^\top \beta\) is not, this assumption cannot hold in general. Moreover, as the variance is a direct transformation of the expectation in binary regressions for cross-section data, it is not possible to correctly specify the latter and potentially misspecify the former. Thus, if the variance is misspecified, the expectation is also, and OLS is inconsistent. Conversely, if the expectation is correctly specified, the variance is known as well and it is not necessary to employ “robust” sandwich standard errors.

Thus, from a theoretical perspective it is hard to justify the linear probability model. However, a more pragmatic take is that the linearity assumption for the expectation may actually work well enough in practice, at least if the fitted probabilities avoid the extremes of \(0\) and \(1\) sufficiently clearly. In such cases, the marginal effects from a binary GLM are typically approximately constant and roughly equal to the coefficients from the linear probability model.

As an example, see the following fitting of a linear model to the Swiss labor participation data. Notice that both conditional mean and variance are misspecified in this model.

SL2 <- SwissLabor
SL2$participation <- as.numeric(SL2$participation) - 1
swiss_lpm <- lm(participation ~ . + I(age^2), data = SL2)
coeftest(swiss_lpm, vcov=sandwich)
## 
## t test of coefficients:
## 
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  1.66368    0.39725    4.19  3.1e-05
## income      -0.21284    0.03554   -5.99  3.1e-09
## age          0.68254    0.11997    5.69  1.7e-08
## education    0.00666    0.00579    1.15   0.2504
## youngkids   -0.24063    0.03007   -8.00  3.9e-15
## oldkids     -0.04931    0.01743   -2.83   0.0048
## foreignyes   0.24960    0.04017    6.21  8.1e-10
## I(age^2)    -0.09703    0.01452   -6.68  4.2e-11
summary(fitted(swiss_probit))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0104  0.2806  0.4703  0.4607  0.6183  0.9416
summary(fitted(swiss_lpm))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -0.255   0.308   0.475   0.460   0.601   1.027
plot(fitted(swiss_lpm), fitted(swiss_probit))
abline(0, 1, col = 2, lwd = 2)
Linear vs. Probit model fitted values

Figure 4.6: Linear vs. Probit model fitted values

modelsummary(list("Linear model"=swiss_lpm, "Probit model"=swiss_probit), 
             vcov=sandwich, fmt=3, estimate="{estimate}{stars}")
Linear model Probit model
(Intercept) 1.664*** 3.749**
(0.397) (1.327)
income −0.213*** −0.667***
(0.036) (0.127)
age 0.683***
(0.120)
education 0.007 0.019
(0.006) (0.018)
youngkids −0.241*** −0.714***
(0.030) (0.106)
oldkids −0.049** −0.147**
(0.017) (0.052)
foreignyes 0.250*** 0.714***
(0.040) (0.122)
I(age^2) −0.097***
(0.015)
poly(age, 2, raw = TRUE)1 2.075***
(0.399)
poly(age, 2, raw = TRUE)2 −0.294***
(0.050)
Num.Obs. 872 872
R2 0.193
R2 Adj. 0.186
AIC 1091.3 1033.2
BIC 1134.2 1071.3
Log.Lik. −536.647 −508.577
RMSE 0.45 0.45
Std.Errors Custom Custom

4.3 Discrete Choice Models

Discrete choice models based on random utility maximization provide a further - economic - motivation for using binary models. They connect binary response models with microeconomic models of choice based on utility maximization problems subject to constraints. Consider the choice between two alternatives \(A\) and \(B\) (e.g., products/brands) with alternative-specific attributes \(z_{i, A}\) and \(z_{i, B}\) (e.g., price) and individual-specific characteristics \(x_i\) (e.g., age). The linear utility function with random error terms (also called random utility) is:

\[\begin{eqnarray*} u_{i, A} & = & x_i^\top \beta_A ~+~ z_{i, A}^\top \gamma ~+~ \varepsilon_{i, A}, \\ u_{i, B} & = & x_i^\top \beta_B ~+~ z_{i, B}^\top \gamma ~+~ \varepsilon_{i, B}. \end{eqnarray*}\]

Person \(i\) chooses alternative \(A\) (“success”: \(y_i = 1\)) if the corresponding utility is larger \(u_{i, A} > u_{i, B}\). Thus, the probability for choosing \(A\) is

\[\begin{eqnarray*} & = & \text{Prob} (u_{i, A} > u_{i, B} ~|~ x_i, z_{i, A}, z_{i, B}) \\ & = & \text{Prob} (x_i^\top (\beta_A - \beta_B) + (z_{i, A} - z_{i, B})^\top \gamma + (\varepsilon_{i, A} - \varepsilon_{i, B}) ~\ge~ 0 ~|~ x_i, z_{i, A}, z_{i, B}) \\ & = & \text{Prob} (- \varepsilon_i ~\le~ x_i^\top \beta ~+~ z_i^\top \gamma ~|~ x_i, z_i) \\ & = & \text{Prob} (- \varepsilon_i ~\le~ {x^*_i}^\top \beta^* ~|~ x^*_i) \\ & = & H({x^*_i}^\top \beta^*), \end{eqnarray*}\]

if we write \(\beta = \beta_A - \beta_B\), \(z = z_A - z_B\), \(\varepsilon = \varepsilon_A - \varepsilon_B\), and \(\beta^* = (\beta_A - \beta_B, \gamma)\), \(x^*_i = (x_i, z_{i, A} - z_{i, B})\). Notice that only the difference between the coefficients of the two alternatives \(\beta = \beta_A - \beta_B\) is identified. Moreover, alternative-specific parameters \(\gamma\) can only be estimated if attributes \(z_{i, A} - z_{i, B}\) vary across individuals \(i\).

The intercept in this case reflects whether ceteris paribus option \(A\) or \(B\) is more attractive. Moreover, if utility is normally distributed, i.e., if \(\varepsilon_A\) and \(\varepsilon_B\) are normally distributed, the difference in errors \(\varepsilon\) is also normal, which leads to the probit model. However, if the random component of the utility is assumed to have a type-I extreme value or Gumbel distribution, the difference \(\varepsilon\) has a logistic distribution, thus leading to the logit model. Due to their motivation, these models are also called random utility models (RUMs).

4.4 Goodness of Fit

Measuring the Goodness of Fit is more controversial in binary data than in numeric, and there are two main ideas for its assessment. One is to use a similar measure as \(R^2\) in linear regression, i.e., to create a measure that indicates goodness of fit in the unit interval. One problem with measuring the goodness of fit in terms of \(R^2\), even in numeric data, is that it is always problematic to asses what counts as a satisfactory \(R^2\) for a given model. Macroeconomic models, where most input quantities are strongly aggregated, have typically much higher \(R^2\) than it is possible to achieve for individual choice models, for example. Nevertheless, in a linear regression, an \(R^2\) has a clear meaning and interpretation as a variance decomposition. This is not the case for binary models, thus we call these Pseudo \(R^2\). A pseudo \(R^2\) measures the goodness of fit of a binary model in the unit interval, where 0 indicates a complete lack of fit, and 1 corresponds to a perfect fit. However, these measures are typically not proportionate within the unit interval, thus they are difficult to interpret. We first examine some of the most used options for pseudo \(R^2\) measures.

The McFadden \(R^2\) is constructed the following way: As \(\ell(\beta) \le 0\), one can exploit that \(|\ell(\hat \beta)| \le |\ell(\bar \beta)|\) where \(\bar \beta\) is the fit from the intercept-only model.

\[\begin{equation*} 0 ~\le~ 1 - \frac{\ell(\hat \beta)}{\ell(\bar \beta)} ~=:~ R_{\mathsf{McFadden}}^2 ~\le~ 1. \end{equation*}\]

This will only become \(1\) in the limit for a perfect prediction. Another pseudo \(R^2\) measure by McKelvey and Zavoina \(R^2\) employs the error sum of squares (\(\mathit{SSE}\)) and residual sum of squares (\(\mathit{RSS}\)) on a linear predictor scale.

\[\begin{eqnarray*} R_{\mathsf{MZ}}^2 & = & \frac{\mathit{SSE}}{\mathit{SSR} + \mathit{SSE}}, \\ \mathit{SSE} & = & \sum_{i = 1}^n (\hat \eta_i - \bar{\hat{\eta}})^2, \\ \mathit{SSR} & = & n \sigma^2. \end{eqnarray*}\]

Where the variance \(\sigma^2\) is determined by the link function, i.e., \(\sigma^2 = 1\) for the probit and \(\sigma^2 = \pi^2/3\) for the logit. The Cox and Snell \(R^2\) uses the likelihood instead of the log-likelihood:

\[\begin{equation*} R_{\mathsf{CS}}^2 ~=~ 1 - \left( \frac{L(\bar \beta)}{L(\hat \beta)} \right)^{2/n}. \end{equation*}\]

And lastly, the Nagelkerke/Cragg and Uhler \(R^2\) is a normalized version of \(R_{\mathsf{CS}}^2\), because its maximum \(R_{\mathsf{CS,max}}^2 = 1 - L(\bar \beta)^{2/n} \neq 1\).

\[\begin{equation*} R_{\mathsf{N/CU}}^2 ~=~ R_{\mathsf{CS}}^2 / R_{\mathsf{CS,max}}^2. \end{equation*}\]

There are many more pseudo \(R^2\) existing for binary data, and there is no universally accepted, or even “best” \(R^2\) measure available.

The second possibility to measure the goodness of fit in binary models is to assess the model’s predictive performance, i.e., to assess in sample or out of sample predictions. To illustrate this, let’s compare logit and probit specification for Swiss labor data.:

deviance(swiss_logit)
## [1] 1018
deviance(swiss_probit)
## [1] 1017
logLik(swiss_logit)
## 'log Lik.' -508.8 (df=8)
logLik(swiss_probit)
## 'log Lik.' -508.6 (df=8)
swiss_logit0 <- update(swiss_logit, formula = . ~ 1)
1 - as.vector(logLik(swiss_logit)/logLik(swiss_logit0))
## [1] 0.1543
swiss_probit0 <- update(swiss_probit, formula = . ~ 1)
1 - as.vector(logLik(swiss_probit)/logLik(swiss_probit0))
## [1] 0.1546

We can see that both models appear to have virtually identical fits. We thus create a confusion matrix, which is a contingency table of observed vs. predicted values, by choosing some cutoff \(c\). A cutoff of \(c = 0.5\) would mean that under a probability of 0.5, a woman is predicted not to participate in the labor force, while for a probability over 0.5 she is predicted to do so. This yields correct predictions (main diagonal: true negatives and true positives) and incorrect predictions (off-diagonal: false negatives and false positives).

table(true = SwissLabor$participation,
  pred = fitted(swiss_logit) > 0.5)
##      pred
## true  FALSE TRUE
##   no    337  134
##   yes   146  255
table(true = SwissLabor$participation,
  pred = fitted(swiss_probit) > 0.5)
##      pred
## true  FALSE TRUE
##   no    337  134
##   yes   146  255

It is then possible to consider all conceivable cutoffs \(c \in [0, 1]\) and compute various performance measures.

  • Accuracy: \(\mathsf{ACC}(c)\) is the proportion of correct predictions (among all observations).
  • True positive rate: \(\mathsf{TPR}(c)\) is the proportion of correct predictions (among observations with \(y_i = 1\)).
  • False positive rate: \(\mathsf{FPR}(c)\) is the proportion of incorrect predictions (among observations with \(y_i = 0\)).
  • Receiver operator characteristic: \(\mathsf{ROC} = \{ (\mathsf{FPR}(c), \mathsf{TPR}(c)) ~|~ c \in [0, 1] \}\). Ideally, this should be as close as possible to left upper corner \((0, 1)\).
  • Area under the curve: Aggregate measure that summarizes the whole ROC curve is the associated area under the curve (AUC).

In R, all these measures can easily be computed and graphed using the ROCR package. First, set up an object with predicted and observed values:

library("ROCR")
pred <- prediction(fitted(swiss_probit),
  SwissLabor$participation)

Then compute and plot accuracy:

plot(performance(pred, "acc"))

Compute and plot False positive rate vs. True positive rate, i.e., the ROC curve:

plot(performance(pred, "tpr", "fpr"))
abline(0, 1, lty = 2)
Accuracy at Different Cutoffs and ROC Curve

Figure 4.7: Accuracy at Different Cutoffs and ROC Curve

4.5 Perfect Prediction and (Quasi-)Complete Separation

(Log-)Likelihood is given by:

\[\begin{eqnarray*} L(\beta) & = & \prod_{i = 1}^n \pi_i^{y_i} ~ (1 - \pi_i)^{1 - y_i} ~=~ \prod_{i = 1}^n H(x_i^\top \beta)^{y_i} ~ (1 - H(x_i^\top \beta))^{1 - y_i} ~\le~ 1, \\ \ell(\beta) & = & \sum_{i = 1}^n \left\{ y_i ~ \log H(x_i^\top \beta) ~+~ (1 - y_i) ~ \log \{ 1 - H(x_i^\top \beta) \} \right\} ~\le~ 0. \end{eqnarray*}\]

The Likelihood function is “well-behaved”: it is bounded and can be shown to be globally concave for logit/probit. However, the maximum does not have to be assumed for finite \(\beta\). For example, let’s assume that there is a \(k+1\)-st regressor that is perfectly correlated with response, e.g., \(x_{i, k+1} = 0\) if \(y_i = 0\) and \(x_{i, k+1} = 1\) if \(y_i = 1\), i.e., the outcome \(y_i\) is \(0\) whenever the regressor is \(0\), and \(1\) whenever the regressor takes up \(1\) as a value. Using \(\tilde x_i = (x_i, x_{i, k+1})\) and \(\tilde \beta = (\beta, \beta_{k + 1})^\top\):

\[\begin{eqnarray*} \ell(\beta) & = & \sum_{i = 1}^n \left\{ y_i ~ \log H(\tilde x_i^\top \tilde \beta) ~+~ (1 - y_i) ~ \log \{ 1 - H(\tilde x_i^\top \tilde \beta) \} \right\} \\ & = & \sum_{x_{i, k+1} = 1} \log H(x_i^\top \beta ~+~ \beta_{k + 1}) \\ & & ~+~ \sum_{x_{i, k+1} = 0} \left\{ y_i ~ \log H(x_i^\top \beta) ~+~ (1 - y_i) ~ \log \{ 1 - H(x_i^\top \beta) \} \right\}. \end{eqnarray*}\]

While the second term could be maximized as usual, the first term is monotonically increasing in \(\beta_{k+1}\). Thus, for logit/probit, the maximum (\(= 0\)) is only attained in the limit for \(\beta_{k+1} \rightarrow \infty\). More generally, we say the data are quasi-completely separated if there is a \(\beta_0\) such that

\[\begin{eqnarray*} y_i ~=~ 0 \mbox{ if } x_i^\top \beta_0 ~\le~ 0, \\ y_i ~=~ 1 \mbox{ if } x_i^\top \beta_0 ~\ge~ 0. \\ \end{eqnarray*}\]

While they are completely separated if strict inequality holds. In words, complete separation occurs when one of the regressors is perfectly correlated with the outcome, while quasi-complete separation occurs when a regressor can predict at least one of the resulting classes perfectly.

There are two possible solutions to the problem of (quasi-) complete separation. The first option is to assess separated groups, if separation is believed to be structural rather than by chance. The second one is to employ some regularized estimator (i.e., with penalty on \(|\beta|\)). For an example of quasi complete separation, let’s examine a cross-section data set on executions in different states in 1950 (Maddala 2001, Table 8.4, p. 330), containing 44 observations on 8 variables.

Table 4.2: Variables in Maddala’s Executions Data Set
Variable Description
rate Murder rate per 100,000 (FBI estimate, 1950).
convictions Number of convictions divided by murders in 1950.
executions Average number of executions during 1946–1950 divided by convictions in 1950
time Median time served (in months) of convicted murderers released in 1951.
income Median family income in 1949 (in 1,000 USD).
lfp Labor force participation rate in 1950 (in percent).
noncauc Proportion of non-Caucasian population in 1950.
southern Factor indicating region.
data("MurderRates", package = "AER")
murder_logit <- glm(I(executions > 0) ~ time + income + noncauc + lfp + southern, 
                    data = MurderRates, family = binomial)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
coeftest(murder_logit)
## 
## z test of coefficients:
## 
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept)   10.9933    20.7734    0.53    0.597
## time           0.0194     0.0104    1.87    0.062
## income        10.6101     5.6541    1.88    0.061
## noncauc       70.9879    36.4118    1.95    0.051
## lfp           -0.6676     0.4767   -1.40    0.161
## southernyes   17.3313  2872.1707    0.01    0.995

Notice that the southern variable has a large coefficient and an extremely large standard error. For this reason, even though the variable is relevant, it is not meaningful to estimate coefficients and p-values (for the latter, non-standard inference would be needed).

murder_logit2 <- glm(I(executions > 0) ~ time + income + noncauc + lfp + southern, 
                     data = MurderRates, family = binomial, 
                     control = list(epsilon = 1e-15, maxit = 50, trace = FALSE))
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
modelsummary(
   list("Default" = murder_logit, "More iterations" = murder_logit2),
   fmt = 3, estimate = "{estimate}{stars}")
Default More iterations
(Intercept) 10.993 10.993
(20.773) (2.077e+01)
time 0.019+ 0.019+
(0.010) (1.000e−02)
income 10.610+ 10.610+
(5.654) (5.654e+00)
noncauc 70.988+ 70.988+
(36.412) (3.641e+01)
lfp −0.668 −0.668
(0.477) (4.770e−01)
southernyes 17.331 31.331
(2872.171) (1.733e+07)
Num.Obs. 44 44
AIC 981.1 1282.3
BIC 991.8 1293.0
Log.Lik. −484.536 −635.157
F 1.520
RMSE 0.25 0.25

Comparing the two models, we can see that all coefficients and standard errors remain the same, except for the “southernyes” variable’s coefficient and standard error, which both diverge to infinity.

The problem becomes visible once we create a contingency table of executions > 0 vs. southern. In southern states, executions were always nonzero, which means the data is quasi-completely separated for the southern variable. Whenever the southern variable takes up “yes” as a value, it immediately implies that the number of executions in the given state is nonzero in the sample.

table(MurderRates$executions > 0, MurderRates$southern)
##        
##         no yes
##   FALSE  9   0
##   TRUE  20  15

In practice, when standard errors are unusually large, one should locate the problem by changing the number of iterations, or drop some variables, etc. Be aware that some software packages - including the software Maddala used- do not report any error in this situation. Other software packages give verbose error messages. In R, this is available in safeBinaryRegression. Regularized estimators (instead of ML) are available in R in brglm (bias-reduced GLMs) or bayesglm in arm (Bayesian estimation).