Chapter 2 Generalized Linear Models

2.1 From Regression to Probability Models

To analyze microdata, we will shift our focus from the conditional expectation (mean) function to the conditional probability function. As discussed in the previous chapter, this has two reasons: firstly, for qualitative variables, the expected value is often not defined, and secondly, even if the conditional expectation function is defined, we gain additional information from defining the full conditional probability. In this chapter, we will explain this shift in attention in further detail.

The classical linear regression model can be written as: \[\begin{eqnarray*} y_i & = & x_i^\top \beta ~+~ \varepsilon_i \quad (i = 1, \dots, n), \mbox{or equivalently} \\ y & = & X \beta ~+~ \varepsilon. \end{eqnarray*}\]

The basic assumption for cross-section data is exogeneity: \[\begin{equation*} \text{E}(\varepsilon ~|~ X) ~=~ 0. \end{equation*}\]

Meaning the explanatory variable is uncorrelated with the error term. The Conditional expectation function is then correctly specified:

\[\begin{equation*} \text{E}(y ~|~ X) ~=~ X \beta \end{equation*}\]

and the ordinary least squares (OLS) estimator for the coefficients \(\beta\) \[\begin{equation*} \hat \beta ~=~ (X^\top X)^{-1} X^\top y \end{equation*}\] is consistent and asymptotically normal (given certain weak regularity assumptions and full column rank of the regressor matrix \(X\)).

The Gauss-Markov assumptions additionally require homoscedasticity and lack of correlation: \[\begin{equation*} \text{Var}(y ~|~ X) ~=~ \sigma^2 I. \end{equation*}\]

If this assumptions also holds, the OLS estimator is the best linear unbiased estimator (BLUE). If the heteroscedasticity assumption is violated, the OLS estimator is not efficient (among the linear estimators) but is still consistent. In this case, exact inference (such as \(t\) and \(F\) tests) requires a useful variance – covariance matrix estimator, e.g., some kind of sandwich estimator. Then, if in addition to information about the first and second moments (the conditional mean and the variance) of the distribution of \(y\), the whole distribution is specified, even stronger results can be obtained. If \(y ~|~ X\) is normally distributed

\[\begin{eqnarray*} y ~|~ X & \sim & \mathcal{N}(X \beta, \sigma^2 I) \quad \mbox{or equivalently}\\ y_i ~|~ x_i & \sim & \mathcal{N}(x_i^\top \beta, \sigma^2) \quad \mbox{independently}. \end{eqnarray*}\]

then \(\hat \beta\) is also the maximum likelihood (ML) estimator and is efficient among all estimators. Inference based on \(t\) and \(F\) tests is then exact. In other words, under the complete set of assumptions, the likelihood of \(y\) is fully specified. The linear regression model can then also be viewed as a likelihood model or a probability model. The conditional probability function is then given by

\[\begin{equation*} P(y_i \le y ~|~ x_i) ~=~ \Phi \left( \frac{y - x_i^\top \beta}{\sigma} \right), \end{equation*}\] Where \(\Phi(\cdot)\) is the standard normal cumulative distribution function.

To summarize, linear regression via OLS is sensible already under weak assumptions. In practice, many argue that not all Gauss-Markov assumptions need to be fulfilled to allow the use of OLS. Given stronger assumptions, stronger conclusions can be drawn, but violation of some assumptions (e.g., heteroscedasticity) can be remedied by adjusted inference (e.g., by using robust standard errors). However, the crucial assumption for OLS estimation is the correct specification of the conditional expectation (mean), which, when violated, leads to inconsistent estimates, invalid inference and biased predictions. Violation of the crucial assumption of a correctly specified conditional mean happens frequently in practice. The reasons for this include omitting relevant regressors, having endogenous regressors in the model (that are correlated with the error term), having different subsamples between which regression coefficients differ, and having a non-linear relationship between expectation \(\mu := \text{E} (y ~|~ X)\) and linear predictor \(\eta := X \beta\). This last condition is often true when the support of \(y\)’s expectation is not the real line and/or \(y\) is very “non-normal”.

Luckily, virtually the same logic – except for exact inference in finite samples – can be applied to a broader class of distributions beyond the normal case. In statistics this is well-known as the Generalized Linear Model (GLM). In econometrics, various special cases of the GLM are popular but the general framework is less well-known.

2.2 Model Frame

Generalized linear model:

The conditional distribution of \(y_i ~|~ x_i\) is given by density of type:

\[\begin{equation*} f(y; \theta, \phi) ~=~ \exp \left( \frac{y \theta - b(\theta)}{\phi} ~+~ c(y, \phi) \right), \end{equation*}\]

where \(b(\cdot)\) and \(c(\cdot)\) are known functions and \(\theta\) and \(\phi\) are the so-called canonical and dispersion parameter, respectively. For fixed \(\phi\) this is a (linear) exponential family. The conditional expectation function for \(\mu_i = \text{E} (y_i ~|~ x_i)\) is given by a linear predictor \(\eta_i = x_i^\top \beta\) and a monotone transformation \(g(\cdot)\):

\[\begin{equation*} g(\mu_i) ~=~ \eta_i, \end{equation*}\]

where \(g(\cdot)\) is called link function, which links the mean to a linear combination of explanatory variables. This implies that the conditional expectation function and conditional variance function are specified to be

\[\begin{eqnarray*} \text{E}(y_i ~|~ x_i) & = & \mu_i ~=~ b'(\theta_i), \\ \text{Var}(y_i ~|~ x_i) & = & \phi ~ b''(\theta_i). \end{eqnarray*}\]

Hence, \(V(\mu) = b''(\theta(\mu))\) is called variance function.

Conversely:

\[\begin{equation*} \theta_i ~=~ (b')^{-1} ( g^{-1}(x_i^\top \beta) ) \end{equation*}\]

Canonical parameter \(\theta_i\) and linear predictor \(\eta_i\) are identical if

\[\begin{equation*} g ~=~ (b')^{-1} \end{equation*}\]

which is thus called canonical link function.

As special cases, see below the Normal (Gaussian) distribution, the Poisson distribution and the Bernoulli/Binomial distribution.

Normal/Gaussian: \(y \in \mathbb{R}\), \(\mu \in \mathbb{R}\).

\[\begin{eqnarray*} f(y; \mu, \sigma^2) & = & \frac{1}{\sqrt{2 \pi \sigma^2}} ~ \exp \left\{ -\frac{1}{2} ~ \frac{(y ~-~ \mu)^2}{\sigma^2} \right\} \\ & = & \exp \left\{ \left(y \mu ~-~ \frac{\mu^2}{2} \right) \frac{1}{\sigma^2} ~-~ \frac{1}{2} ~ \left(\frac{y^2}{\sigma^2} ~+~ \log(2 \pi \sigma^2) \right) \right\} \\ \theta & = & \mu \\ b(\theta) & = & \frac{\theta^2}{2} \\ V(\mu) & = & 1 \\ \phi & = & \sigma^2 \end{eqnarray*}\]

Bernoulli: \(y \in \{0, 1\}\), \(\mu \in [0, 1]\).

\[\begin{eqnarray*} f(y; \mu) & = & \mu^y ~ (1 - \mu)^{1 - y} \\ & = & \exp \left\{ y \log\left(\frac{\mu}{1 - \mu}\right) ~+~ \log(1 - \mu) \right\} \\ \theta & = & \log \left( \frac{\mu}{1 - \mu} \right) \\ b(\theta) & = & \log(1 + \exp(\theta)) \\ V(\mu) & = & \mu ~ (1 - \mu) \\ \phi & = & 1 \end{eqnarray*}\]

Poisson: \(y \in \mathbb{N}_0\), \(\mu \in \mathbb{R}_+\).

\[\begin{eqnarray*} f(y; \mu) & = & \frac{\mu^y ~ \exp(-\mu)}{y!} \\ & = & \exp \left\{ y ~ \log(\mu) ~-~ \mu ~-~ \log(y!) \right\} \\ \theta & = & \log(\mu) \\ b(\theta) & = & \exp(\theta) \\ V(\mu) & = & \mu \\ \phi & = & 1 \end{eqnarray*}\]

Some remarks on general linear models:

GLMs with \(g(x) \neq x\) are linear models for a transformation of \(\mu_i\), namely \(g(\mu_i)\). \(g(\mu_i) = x_i^T \beta\) is thus a nonlinear model for \(\mu_i\).
In different parts of the literature, models with \(\text{E}(y_i ~|~ x_i) = h(x_i^\top \beta)\) are called single-index models. Typically, the inverse link function \(h(\cdot)\) is then unknown so that the model is semiparametric. Using this terminology, (generalized) linear models are parametric single-index models (due to the known link function).
If dispersion parameter \(\phi\) is unknown, the distribution may (but does not have to) be a two-parameter exponential family.

2.3 Maximum Likelihood Estimation

Due to the fully specified likelihood, we can estimate \(\beta\) via maximum likelihood. Likelihood depends on \(\beta\) only through \(\theta\), and thus dispersion parameter \(\phi\) can be treated as a nuisance parameter (or is known anyway). In general, there is no analytic solution for estimating GLMs, except for special cases such as normal case, thus a numerical solution is required. The numerical algorithm typically used to estimate GLMs is iterative weighted least squares (IWLS), a version of Fisher scoring adapted to GLMs. The likelihood can be given directly, however, analytics is often difficult with the likelihood function. Difficulties include handling the product of small numbers, resulting in an extremely small number that is difficult to work with, and it is also easier to deal with sums rather than products while derivation, for example. For these reasons, taking logs is sensible, and likelihood is therefore often given as the log likelihood.

For \(n\) independent observations \(y_i\):

\[ \begin{array}{rclcl} L(\theta) & = & L(\theta; y) & = & \displaystyle \prod_{i = 1}^n f(y_i; \theta) \\[0.6cm] \ell(\theta) & = & \log L(\theta) & = & \displaystyle \sum_{i = 1}^n \log f(y_i; \theta). \end{array} \]

The maximum likelihood estimator is defined by \[\begin{eqnarray*} \hat \theta_\text{ML} & = & \underset{\theta}{argmax} L(\theta) \\ & = & \underset{\theta}{argmax} \ell (\theta). \end{eqnarray*}\]

In words, the parameter for which likelihood is maximized is the same as for which log likelihood is maximized. Inference for models estimated by ML works as usual : either via Wald-tests, which are typically reported in coefficient tables, Score-tests (also known as Lagrange multiplier - LM tests), or Likelihood ratio (LR) tests, where the latter is typically used for nested model comparisons.

Likelihood ratio statistic: \[\begin{equation*} -2 ~ \{ \ell(\tilde \theta) ~-~ \ell(\hat \theta) \}, \end{equation*}\]

which can also be expressed as difference in so-called deviances. Deviance is used as special term in GLM literature for (transformation of) likelihood ratio statistic. This employs \(\ell(y; y)\), the log-likelihood of the saturated model, where typically the number of parameters is equal to the number of observations.

\[\begin{eqnarray*} \mathbf{Scaled ~ deviance:} & & D^*(\theta) ~=~ -2 ~ \{ \ell(\theta) - \ell(y) \}. \\ \mathbf{Deviance:} & & D(\theta) ~=~ \phi ~ D^*(\theta). \end{eqnarray*}\]

A saturated model gives the best fit within given model class, and deviance captures deviation of a given model from this ideal model.

Normal case: Deviance is residual sum of squares (RSS). \[\begin{eqnarray*} D^*(\mu, \sigma^2) & = & -2 ~ \{ \ell(\mu, \sigma^2) - \ell(y, \sigma^2) \} \\ & = & -2 \sum_{i = 1}^n -\frac{1}{2 \sigma^2} (y_i - \mu_i)^2 \\ & = & \frac{1}{\sigma^2} \sum_{i = 1}^n (y_i - \mu_i)^2 \\ D(\mu) & = & \phi ~ D^*(\mu, \sigma^2) \\ & = & \sum_{i = 1}^n (y_i - \mu_i)^2 \end{eqnarray*}\]

As we can see, estimation in GLMs is by “minimum deviance”, a generalization of “least squares” in linear regression models. Furthermore, deviance is additive for hierarchical models, and hence the generalization of ANOVA is called “analysis of deviance”.

Table 2.1: Special Cases of Deviance
Family	Deviance
Normal	\(\sum (y - \hat \mu)^2\)
Binomial	\(2 \sum \{ y \log (y/\hat \mu) + (1 - y) \log ((1-y)/(1-\hat \mu)) \}\)
Poisson	\(2 \sum \{ y \log (y/\hat \mu) - (y - \hat \mu) \}\)

If GLMs are correctly specified, they inherit all properties of a correctly specified ML estimation: efficiency, asymptotic normality, and consistency. Only the exact normality of the coefficient estimates does not hold in general GLMs but only in the special case of a linear model with Gaussian response variable. If the full likelihood is misspecified but the conditional expectation and variance functions are correct (and observations are independent), efficiency is in general lost, but all inference remains valid. If only the conditional expectation function is specified correctly, estimators are still consistent, but inference needs to be adjusted. Note, however, that a misspecification of the likelihood beyond the mean and variance is not possible for all response variables. For example, for binary responses a misspecification of the variance function always implies a misspecification of the mean function as well.

In other words, the three main steps are specifying: (1) The conditional mean function. If the mean is correctly specified, estimation is consistent, but might not be efficient. (2) Variance. If additionally the variance is correctly specified, we are able to use inference, as it is approximately (asymptotically) correct. (3) If additionally the full probability distribution is correctly specified, maximum likelihood estimation is efficient.

2.4 Residuals

In the linear regression model, raw residuals are used:

\[\begin{equation*} r_i ~=~ y_i - \hat \mu_i \end{equation*}\]

which is sensible in case of homoscedasticity., However, in GLMs heteroscedasticity is typically present, so we use a weighted version:

\[\begin{equation*} r_{P, i} ~=~ \frac{y_i - \hat \mu_i}{\sqrt{V(\hat \mu_i)}}, \end{equation*}\] These weighted residuals are called Pearson residuals in the GLM literature. In econometrics they are also referred to as standardized residuals. Alternatively, we can compute differences on a square-root likelihood scale.:

\[\begin{equation*} r_{D, i} ~=~ \text{sign}(y_i - \hat \mu_i) ~ \sqrt{\phi ~ 2 ~ \{ \ell(y_i; y_i) - \ell(\hat \mu_i; y_i) \}}, \end{equation*}\]

which are called deviance residuals. Deviance residuals are less well-known in econometrics. In nonlinear models the “correct” definition of residuals is not straightforward, hence there are many different definitions. Many have the property that their sum of squares yields an intuitive goodness-of-fit measure, e.g., the Pearson \(X^2\) statistic or the deviance.

\[\begin{eqnarray*} \sum_{i = 1}^n r_{P, i}^2 & = & X^2 ~:=~ \sum_{i = 1}^n \frac{(y_i - \hat \mu_i)^2}{V(\hat \mu_i)} \\ \sum_{i = 1}^n r_{D, i}^2 & = & D(\hat \mu) \end{eqnarray*}\]

If the dispersion parameter \(\phi\) is unknown, these generalized residual sums of squares are typically employed for estimation.

\[\begin{equation*} \hat \phi ~=~ \frac{1}{n - k} ~ \sum_{i = 1}^n r_{P, i}^2 \end{equation*}\] is often used. Sometimes the deviance instead of Pearson residuals are employed, or \(\hat \phi\) is also computed by maximum likelihood.

2.5 Implementation in R

We now reflect parallels between linear regression and GLM by showing similarities between lm() and glm():

glm(formula, data, 
  family = gaussian(link = "identity"),
  weights, subset, na.action, ...)

The arguments of the glm() function are:

formula: Symbolic description of the model.
data: Data set containing the variables from the formula.
family: Specification of error distribution and link function.
weights, subset, na.action: Fine control for selection of observations and setup of weights.
...: Further arguments, e.g., control parameters for the fitting algorithm, starting values, etc.

Table 2.2: Possible Values of ‘formula’ argument
Formula	Description
`y ~ a + x`	Model without interaction: identical slopes with respect to `x` but different intercepts with respect to `a`.
`y ~ a * x` `y ~ a + x + a:x`	Model with interaction: the term `a:x` gives the difference in slopes compared with the reference category.
`y ~ a / x` `y ~ a + x %in% a`	Model with interaction: produces the same fitted values as the model above but using a nested coefficient coding. An explicit slope estimate is computed for each category in `a`.
`y ~ (a + b + c)^2` `y ~ abc - a:b:c`	Model with all two-way interactions (excluding the three-way interaction).

Suitable family functions are available for various distributions, including:

Table 2.3: Family Functions for Different Distributions
Family	Canonical link	Name
`gaussian()`	\(\mu\)	\(\texttt{"identity"}\)
`binomial()`	\(\log(\mu/(1-\mu))\)	\(\texttt{"logit"}\)
`poisson()`	\(\log(\mu)\)	\(\texttt{"log"}\)
`Gamma()`	\(- 1 / \mu\)	\(\texttt{"inverse"}\)
`inverse.gaussian()`	\(-2 / \mu^2\)	\(\texttt{"1/mu^2"}\)

Furthermore, quasi() allows specification of various link functions \(g(\mu)\) and variance functions \(V(\mu)\). The return value of the glm() function is an object of class "glm", inheriting from "lm". Thus, all the usual extractor functions are available:

Table 2.4: Extractor Functions for “glm” Objects
Function	Description
`print()`	simple printed display
`summary()`	standard regression output
`coef()`	(or `coefficients()`) extract regression coefficients \(\hat \beta\)
`residuals()`	(or `resid()`) extract various types of residuals
`fitted()`	(or `fitted.values()`) extract fitted values \(\hat \mu\)
`anova()`	analysis of deviance comparison of nested models
`predict()`	various types of predictions for new data
`plot()`	diagnostic plots
`confint()`	confidence intervals for the regression coefficients
`deviance()`	deviance \(D(\hat \mu)\)
`vcov()`	(estimated) variance-covariance matrix
`logLik()`	log-likelihood \(\ell(\hat \mu)\)
`AIC()`	information criteria including AIC, BIC/SBC

As an example, we fit a logit model (i.e., binomial GLM with logit link) for the labor force participation (yes/no) of \(n = 872\) women in Switzerland, in 1981 (we are going to examine this example in more detail in Chapter 4). Available covariates are age, nonlabor income, education, number of younger/older kids, and nationality (a binary variable indicating Swiss/Non-Swiss).

Load the data and fit a GLM (use ~ . to regress on all available covariates. Add age squared with an insulate function):

data("SwissLabor", package = "AER")
fm <- glm(participation ~ . + I(age^2), family = binomial,
  data = SwissLabor)
summary(fm)

## 
## Call:
## glm(formula = participation ~ . + I(age^2), family = binomial, 
##     data = SwissLabor)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)   6.1964     2.3831    2.60   0.0093
## income       -1.1041     0.2257   -4.89  1.0e-06
## age           3.4366     0.6879    5.00  5.9e-07
## education     0.0327     0.0300    1.09   0.2761
## youngkids    -1.1857     0.1720   -6.89  5.5e-12
## oldkids      -0.2409     0.0845   -2.85   0.0043
## foreignyes    1.1683     0.2038    5.73  9.9e-09
## I(age^2)     -0.4876     0.0852   -5.72  1.0e-08
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1203.2  on 871  degrees of freedom
## Residual deviance: 1017.6  on 864  degrees of freedom
## AIC: 1034
## 
## Number of Fisher Scoring iterations: 4

Test the significance of all regressors jointly:

fm0 <- glm(participation ~ 1, family = binomial,
  data = SwissLabor)
anova(fm, fm0, test = "Chisq")

## Analysis of Deviance Table
## 
## Model 1: participation ~ income + age + education + youngkids + oldkids + 
##     foreign + I(age^2)
## Model 2: participation ~ 1
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1       864       1018                     
## 2       871       1203 -7     -186   <2e-16

And compare AICs of the two models:

AIC(fm, fm0)

##     df  AIC
## fm   8 1034
## fm0  1 1205