# Chapter 9 Duration Models

## 9.1 Introduction

Examples: Typical economic examples for duration responses.

• Macroeconomics.
$$t_i$$: Duration of strike $$i$$.
Potential covariates: Macroeconomic indicators, …
• Labor economics
$$t_i$$: Duration of unemployment of person $$i$$.
Potential covariates: Education, age, gender, …
• Marketing.
$$t_i$$: Duration between two purchases of person $$i$$.
Potential covariates: Price, price of alternative products, …
• Demography/population economics.
$$t_i$$: Age of woman $$i$$ at birth of first child.
Potential covariates: Education, income, …

Remarks:

• Duration: Non-negative, continuous random variable.
• In practice: Discrete durations in weeks/months are often treated like continuous variables.
• A lot of terminology stems from epidemiology/biostatistics (survival analysis), and technical statistics (analysis of failure time data).
• Related to count data.
• Count data: Often number of events in a time interval.
• Duration data: Often time between subsequent events.
• Censoring: Durations are often only observed up to a certain maximal time interval. Thus, observations for which the event of interest has not occurred in the time interval are right-censored.
• In social sciences: Event history analysis typically encompasses analysis of count data and duration data.

## 9.2 Basic Concepts

Duration: The random variable $$T$$ measures duration until event, also known as survival time or spell length.

• $$T > 0$$,
• $$T$$ has CDF $$F(t) =\text{P}(T \le t)$$,
• $$T$$ has PDF $$f(t) = F'(t)$$.

Survivor function: $$S(t)$$ is the proportion of events that has not yet occurred at time $$t$$.

$\begin{equation*} S(t) ~=~ \text{P}(T \ge t) ~=~ 1 ~-~ F(t). \end{equation*}$

Hazard function: $$h(t)$$ measures the risk of an event at time $$t$$ given survival/duration up to time $$t$$.

$\begin{equation*} h(t) ~=~ \lim_{\delta \rightarrow 0} \frac{P(t \le T < t + \delta ~|~ T \ge t)}{\delta}. \end{equation*}$

Also known as hazard rate, failure rate, or exit rate.

Cumulative hazard function:

$\begin{equation*} H(t) ~=~ \int_0^t h(u) ~ du. \end{equation*}$

Remark: $$f(t)$$, $$S(t)$$, $$h(t)$$, and $$H(t)$$ are equivalent ways of defining or characterizing a specific duration pattern uniquely.

Relationships:

$\begin{eqnarray*} h(t) & = & \frac{f(t)}{S(t)}, \\[0.3cm] f(t) & = & F'(t) ~=~ -S'(t), \\[0.3cm] h(t) & = & - \frac{S'(t)}{S(t)} ~=~ - \frac{\partial \log S(t)}{\partial t}, \\[0.3cm] H(t) & = & - \log S(t), \\[0.3cm] S(t) & = & \exp (-H(t)). \end{eqnarray*}$

Kaplan-Meier estimator: Nonparametric estimator of survivor function. Also called product limit estimator.

Idea: Assume $$n$$ observations of survival times (durations) without censorings occurring at $$p$$ distinct times

$\begin{equation*} t_{(1)} ~<~ \dots ~ < t_{(p)} \end{equation*}$

and let $$d_i$$ be the number of deaths (events/exits/…) at $$t_{(i)}$$. Then

$\begin{eqnarray*} \hat S(t) & = & 1 - \hat F(t) ~=~ \frac{n - \sum_{j = 1}^s d_j}{n} \qquad (t_{(s)} \le t < t_{(s + 1)}) \\ & = & \frac{n - d_1}{n} \cdot \frac{n - d_1 - d_2}{n - d_1} \ldots \frac{n - d_1 - \ldots - d_s}{n - d_1 - \ldots - d_{s - 1}} \\ & = & \left( 1 - \frac{d_1}{r_1} \right) \ldots \left( 1 - \frac{d_s}{r_s} \right) ~=~ \prod_{j = 1}^s \left( 1 - \frac{d_j}{r_j} \right). \end{eqnarray*}$

Number at risk: $$r_i$$, i.e., alive, just before $$t_{(i)}$$.

Without censoring: $$r_{i+1} = r_i - d_i$$. Number alive at $$t_{(i+1)}$$ are those from $$t_{(i)}$$ without the deaths from $$t_{(i)}$$.

With censoring: $$r_{i+1} = r_i - d_i - c_{i + 1}$$. Additionally, remove censorings in the interval ($$t_{(i-1)}, t_{(i)}$$).

Alternatively: Employ $$r_{i+1} = r_i - d_i - c_{i + 1}/2$$ as a continuity-corrected version. Also known as lifetable method.

Remark: If there are censorings after the last event $$\hat S(t) > 0$$ for all $$t$$.

In R: Methods for survival analysis in package survival.

Basic utility:

• Declare survival times as such using Surv(), encompassing information about censoring (if any).
• Surv(time, time2, event, type, origin = 0) can declare left-, right-, and interval-censored data.
• Basic usage for right-censored data has an indicator event where 0 corresponds to “alive”, and 1 to “dead” (or “with event”).

Example: Age at first birth (AFB) in US General Social Survey 2002. AFB is only known for women that have children. For women without children (but within child-bearing age) AFB is censored at their current age.

data("GSS7402", package = "AER")
gss2002 <- subset(GSS7402,
year == 2002 & (agefirstbirth < 40 | age < 40))

Compute censored AFB and declare censoring:

gss2002$afb <- with(gss2002, ifelse(kids > 0, agefirstbirth, age)) gss2002$afb <- with(gss2002, Surv(afb, kids > 0))

Observations now contain censoring information:

head(gss2002$afb) ## [1] 25+ 19 27 22 29 19+ Compute Kaplan-Meier estimator via survfit(): afb_km <- survfit(afb ~ 1, data = gss2002) afb_skm <- summary(afb_km) print(afb_skm) ## Call: survfit(formula = afb ~ 1, data = gss2002) ## ## time n.risk n.event survival std.err lower 95% CI upper 95% CI ## 11 1371 1 0.9993 0.000729 0.9978 1.0000 ## 13 1370 3 0.9971 0.001457 0.9942 0.9999 ## 14 1367 2 0.9956 0.001783 0.9921 0.9991 ## 15 1365 14 0.9854 0.003238 0.9791 0.9918 ## 16 1351 40 0.9562 0.005525 0.9455 0.9671 ## 17 1311 66 0.9081 0.007802 0.8929 0.9235 ## 18 1245 97 0.8373 0.009967 0.8180 0.8571 ## 19 1148 106 0.7600 0.011534 0.7378 0.7830 ## 20 1030 122 0.6700 0.012726 0.6455 0.6954 ## 21 898 123 0.5782 0.013406 0.5525 0.6051 ## 22 767 97 0.5051 0.013612 0.4791 0.5325 ## 23 654 72 0.4495 0.013600 0.4236 0.4770 ## 24 563 61 0.4008 0.013480 0.3752 0.4281 ## 25 484 60 0.3511 0.013248 0.3261 0.3781 ## 26 407 45 0.3123 0.012986 0.2878 0.3388 ## 27 351 45 0.2723 0.012618 0.2486 0.2981 ## 28 294 37 0.2380 0.012223 0.2152 0.2632 ## 29 246 32 0.2070 0.011795 0.1852 0.2315 ## 30 209 38 0.1694 0.011119 0.1489 0.1926 ## 31 163 18 0.1507 0.010730 0.1311 0.1733 ## 32 135 19 0.1295 0.010264 0.1108 0.1512 ## 33 108 17 0.1091 0.009766 0.0915 0.1300 ## 34 83 7 0.0999 0.009542 0.0828 0.1205 ## 35 65 13 0.0799 0.009101 0.0639 0.0999 ## 36 45 7 0.0675 0.008815 0.0522 0.0872 ## 37 29 7 0.0512 0.008572 0.0369 0.0711 ## 38 17 3 0.0422 0.008499 0.0284 0.0626 ## 39 12 2 0.0351 0.008411 0.0220 0.0562 plot(afb_km, conf.int = FALSE) with(afb_skm, plot(n.event/n.risk ~ time, type = "s")) Remarks: • $$\hat h(t) = d_s / r_s$$ for $$t_{(s)} \le t < t_{(s + 1)}$$ is a less reliable estimator for the hazard function. Grouping or smoothing (similar to histograms or kernel densities) would be necessary but rare in practice. • Parametric approach: Estimate distribution parameters (and thus the associated survivor and hazard functions). • Moreover: Regression models are required that capture how survivor and hazard functions change with covariates. • Simple groupings can be incorporated nonparametrically in survfit(y ~ group). Example: For data exploration, one could consider: gss2002$edcat <- cut(gss2002$education, c(0, 12, 14, 20)) afb_km2 <- survfit(afb ~ edcat, data = gss2002) plot(afb_km2, col = c(1, 2, 4)) Observations: Survivor functions for different groups look very similar. This conveys that hazards are similar up to a factor. This is often exploited in modeling and inference. Formally: The assumption of proportional hazards for two groups entails $\begin{eqnarray*} h_2(t) & = & c \cdot h_1(t), \qquad \mbox{(or equivalently)}\\ S_2(t) & = & S_1(t)^c. \end{eqnarray*}$ Implication: Survivor functions cannot cross. Remark: Violations in practice are more serious for lower death/exit times (which are based on more observations). ## 9.3 Continuous Time Duration Models Basic model: Constant hazard rate $$h(t) = \lambda$$ (independent of duration $$t$$) is the simplest conceivable model for duration data. Formally: Exponential distribution with hazard rate $$\lambda > 0$$. $\begin{eqnarray*} f(t) & = & \lambda ~ \exp (- \lambda ~ t),\\ S(t) & = & \exp (- \lambda ~ t), \\ h(t) & = & \lambda. \end{eqnarray*}$ Remarks: • Constant hazard rate is also called lack of memory. • In practice nonconstant hazards (depending on $$t$$) are relevant. In economics called duration dependence. In R: d/p/q/rexp with parameter $$\mathtt{rate} = \lambda$$. Generalization: Weibull distribution with additional shape parameter $$\alpha > 0$$. $\begin{eqnarray*} f(t) & = & \lambda ~ \alpha ~ t^{\alpha - 1} ~ \exp (- \lambda ~ t^\alpha),\\ S(t) & = & \exp (- \lambda ~ t^\alpha), \\ h(t) & = & \lambda ~ \alpha ~ t^{\alpha - 1}. \end{eqnarray*}$ Properties: The hazard function is always monotonic, for • $$\alpha > 1$$ increasing, • $$\alpha = 1$$ constant (exponential distribution), • $$\alpha < 1$$ decreasing. In R: d/p/q/rweibull with parameters $$\mathtt{scale} = \lambda^{-1/\alpha}$$ and $$\mathtt{shape} = \alpha$$. (Further parametrizations exist in the literature.) Example: For $$\lambda = 1$$ and $$\alpha = 1/4, 1, 4$$. Furthermore: Distributions with potentially non-monotonic hazard functions. • Lognormal distribution: $$\log(t) \sim \mathcal{N}(\mu, \sigma^2)$$. • Loglogistic distribution. ## 9.4 Proportional Hazard Models Idea: Assume proportional hazards for observations $$i = 1, \ldots, n$$. $\begin{equation*} h_i(t) ~=~ \lambda_i \cdot h_0(t), \end{equation*}$ where $$h_0(t)$$ is called baseline hazard. Regression: The factor $$\lambda_i$$ is allowed to depend on a set of regressors $$x_i$$, typically using a linear predictor and a log link. $\begin{equation*} \log(\lambda_i) ~=~ x_i^\top \beta. \end{equation*}$ Thus: Baseline hazard corresponds to observation with $$x = 0$$. Parameter estimation: Two approaches for estimation of $$\beta$$. 1. parametric: $$h_0(t, \alpha)$$ parametric function, 2. semi-parametric: leave $$h_0(t)$$ unspecified. Parametric regression models: Most common models are • Exponential: $$h_0(t) = 1$$. • Weibull: $$h_0(t) = \alpha ~ t^{\alpha - 1}$$. Inference: Via maximum likelihood, e.g., for exponential case. $\begin{eqnarray*} L(\beta) & = & \prod_{i = 1}^n \left\{\lambda_i \cdot \exp(- \lambda_i t_i)\right\}^{\delta_i} \exp(- \lambda_i t_i)^{1 - \delta_i} \\ & = & \prod_{i = 1}^n \lambda_i^{\delta_i} \exp(- \lambda_i t_i) \\ \log L(\beta) & = & \sum_{i = 1}^n \delta_i \cdot x_i^\top \beta - \sum_{i = 1}^n \exp(x_i^\top \beta) \cdot t_i, \end{eqnarray*}$ where $$\delta_i = 1$$ indicates an event and $$\delta_i = 0$$ censoring. In R: survreg() from package survival. survreg(formula, data, subset, na.action, dist, ...) with • formula: Response must be a "Surv" object from Surv(). • dist: Distribution (implying link and variance functions etc.). Available distributions include "weibull" (default), "exponential", "logistic", "gaussian", "lognormal", "loglogistic". • Similar in spirit to GLMs. However, even without censoring not always special cases of exponential families. • Slightly different parametrization: Reports estimates for coefficients $$\tilde \beta$$ and “scale” parameter $$\sigma$$. • Corresponds to $$\beta = - \tilde \beta/\sigma$$ in previous notation. • For Weibull: $$\alpha = 1/\sigma$$. Semiparametric regression: Cox proportional hazards model. Idea: Maximum likelihood requires specification of the hazard, hence construct the conditional likelihood. Instead of $$\text{P}(T_i = t_i)$$ use $\begin{equation*} \text{P}(T_i = t_i ~|~ \mbox{one event at } t_i). \end{equation*}$ This leads to $\begin{equation*} \frac{h_0(t_i) \exp(x_i^\top \beta)}{\sum_{j: t_j \ge t_i} h_0(t_i) \exp(x_j^\top \beta)} ~=~ \frac{\exp(x_i^\top \beta)}{\sum_{j: t_j \ge t_i} \exp(x_j^\top \beta)} \end{equation*}$ and thus, the conditional likelihood is $\begin{equation*} L(\beta) ~=~ \prod_{i = 1}^n \left(\frac{\exp(x_i^\top \beta)}{\sum_{j: t_j \ge t_i} \exp(x_j^\top \beta)}\right)^{\delta_i}. \end{equation*}$ If there are censorings this is called {partial likelihood}. In R: coxph() from survival. Inference: For both parametric and semi-parametric models asymptotic normality can be shown. Hence, the usual inference (LR/Wald/score test, confidence intervals, information criteria) can be performed. Interpretation: Similar to semi-logarithmic models. For observations $$x_1$$ and $$x_2$$ $\begin{equation*} \frac{h_1(t)}{h_2(t)} ~=~ \exp( (x_1 - x_2)^\top \beta), \end{equation*}$ independent of $$t$$. Example: Determinants of AFB. Employ model formula afb_f <- afb ~ education + siblings + ethnicity + immigrant + lowincome16 + city16 Of increased interest: influence of education. afb_ex <- survreg(afb_f, data = gss2002, dist = "exponential") afb_wb <- survreg(afb_f, data = gss2002, dist = "weibull") modelsummary(list("AFB exponential" = afb_ex, "AFB Weibull" = afb_wb), fmt=3, estimate="{estimate}{stars}") AFB exponential AFB Weibull (Intercept) 2.707*** 2.890*** (0.170) (0.037) education 0.047*** 0.026*** (0.011) (0.002) siblings −0.015 −0.007** (0.010) (0.002) ethnicitycauc 0.046 0.063*** (0.072) (0.016) immigrantyes 0.094 0.034 (0.093) (0.021) lowincome16yes −0.027 0.007 (0.074) (0.017) city16yes 0.037 0.026+ (0.061) (0.014) Log(scale) −1.487*** (0.021) Num.Obs. 1371 1371 AIC 9971.0 7688.7 BIC 10007.6 7730.5 RMSE 7.21 5.68 Comparison: Exponential and Weibull model can be compared using the standard likelihood methods. lrtest(afb_ex, afb_wb) ## Likelihood ratio test ## ## Model 1: afb ~ education + siblings + ethnicity + immigrant + lowincome16 + ## city16 ## Model 2: afb ~ education + siblings + ethnicity + immigrant + lowincome16 + ## city16 ## #Df LogLik Df Chisq Pr(>Chisq) ## 1 7 -4979 ## 2 8 -3836 1 2284 <2e-16 AIC(afb_ex, afb_wb) ## df AIC ## afb_ex 7 9971 ## afb_wb 8 7689 afb_cph <- coxph(afb_f, data = gss2002) coeftest(afb_cph) ## ## z test of coefficients: ## ## Estimate Std. Error z value Pr(>|z|) ## education -0.1138 0.0103 -11.04 < 2e-16 ## siblings 0.0296 0.0103 2.86 0.0042 ## ethnicitycauc -0.2907 0.0730 -3.98 6.8e-05 ## immigrantyes -0.2011 0.0928 -2.17 0.0303 ## lowincome16yes 0.0304 0.0742 0.41 0.6820 ## city16yes -0.0689 0.0607 -1.13 0.2564 Comparison: Parameter estimates. cf <- cbind( "Exponential" = -coef(afb_ex), "Weibull" = -coef(afb_wb)/afb_wb$scale,
"Cox-PH" = c(NA, coef(afb_cph))
)
cf
##                Exponential  Weibull   Cox-PH
## (Intercept)       -2.70684 -12.7815       NA
## education         -0.04679  -0.1133 -0.11380
## siblings           0.01549   0.0324  0.02961
## ethnicitycauc     -0.04560  -0.2800 -0.29069
## immigrantyes      -0.09401  -0.1517 -0.20108
## lowincome16yes     0.02725  -0.0310  0.03039
## city16yes         -0.03722  -0.1139 -0.06887

Hazard proportions:

exp(cf[-1,]) - 1
##                Exponential  Weibull   Cox-PH
## education         -0.04571 -0.10715 -0.10756
## siblings           0.01561  0.03293  0.03006
## ethnicitycauc     -0.04458 -0.24419 -0.25225
## immigrantyes      -0.08973 -0.14072 -0.18215
## lowincome16yes     0.02762 -0.03053  0.03086
## city16yes         -0.03654 -0.10765 -0.06655

Compare survivor functions at $$\bar x$$:

xbar <- colMeans(model.matrix(afb_ex, data = gss2002))

Distribution parameters:

lambda_ex <- exp(c(- xbar %*% coef(afb_ex)))
lambda_wb <- exp(c(- xbar %*% coef(afb_wb)) / afb_wb$scale) alpha_wb <- 1/afb_wb$scale

Visualization (using Kaplan-Meier at $$\bar x$$ for Cox-PH):

plot(afb_km, conf.int = FALSE)
tt <- 0:400/10
lines(tt, exp(-lambda_ex * tt), col = 3)
lines(tt, exp(-lambda_wb * tt^alpha_wb), col = 4)
lines(survfit(afb_cph), col = 2)

Alternatively: Using R’s own exponential and Weibull functions requires reparametrization, i.e.

pexp(tt, rate = lambda_ex, lower.tail = FALSE)
pweibull(tt, scale = lambda_wb^(-1/alpha_wb), shape = alpha, lower.tail = FALSE)

Conclusions:

• Exponential is clearly inappropriate.
• Weibull and Cox-PH lead to similar results.
• Cox-PH has least restrictive assumptions.

## 9.5 Outlook

Alternative models: Instead of proportional hazards (PH) models, employ accelerated failure time (AFT) models.

$\begin{equation*} \log(t_i) ~=~ x_i^\top \beta ~+~ \varepsilon_i, \end{equation*}$

e.g., leading to lognormal model (if $$\varepsilon$$ is normal) or loglogistic model (if $$\varepsilon$$ is logistic).

Failure time acceleration can be seen via reparametrization

$\begin{equation*} \log(\psi_i \cdot t_i) ~=~ \varepsilon_i, \end{equation*}$

with $$\psi_i = \exp(- x_i^\top \beta)$$ leading to shortened/lengthened survival times for $$\psi_i > 1$$ and $$\psi_i < 1$$, respectively.

Weibull/exponential model is the only model that is both of AFT and PH type.

Unobserved heterogeneity: Account for via additional random effects. In duration analysis called frailty models. Example: Exponential duration with gamma frailty

$\begin{equation*} \lambda_i ~=~ \nu \cdot \exp(x_i^\top \beta), \end{equation*}$

where $$\nu \sim \mathit{Gamma}(\theta, \theta)$$ leading to Pareto type~II distribution.

Further extensions:

• Need to incorporate time-varying regressors $$x_{it}$$.
• Competing risks: Deaths/exits can be caused by different types of events.
Example: Unemployment can be terminated by job or withdrawal from labor market.