Chapter 7 Count Data Models
7.1 Introduction
Examples: Typical economic examples for count data.
- Travel safety.
: Number of accidents for airline in last 10 years.
Potential covariates: Economic indicators for airline, … - Demand for recreation/leisure/travel.
: Number of visits to recreational resort by person in last year.
Potential covariates: Income, costs, costs of alternatives. - Health care utilization.
: Number of doctor visits of person in last year.
Potential covariates: Age, gender, health, insurance, …
- Patents.
: Number of patents of firm .
Potential covariates: Expenditures for research & development, …
7.2 Poisson Distribution
Basic distribution: For
Properties:
- Moments:
, . - Thus: Equidispersion.
- If
and independently: . - If
and : . for . .

Figure 7.1: Poisson probability distribution with different means
Background: Law of rare events.
Approximation: If
Formally: For
Historical example: Von Bortkiewicz (1898), “Das Gesetz der kleinen Zahlen”. Data on number of deaths by horse or mule kicks in the years 1875–1894 in 10 (out of 14 reported) corps of the Prussian army.
Estimation: Mean
Visualization: Observed and fitted (relative) frequencies. Preferably on square-root scale, also called rootogram.
## nDeaths
## 0 1 2 3 4
## 109 65 22 3 1
mu <- weighted.mean(0:4, HorseKicks)
hk <- cbind(count = 0:4, freq = HorseKicks,
rfreq = prop.table(HorseKicks), fit = dpois(0:4, mu))
hk
## count freq rfreq fit
## 0 0 109 0.545 0.543351
## 1 1 65 0.325 0.331444
## 2 2 22 0.110 0.101090
## 3 3 3 0.015 0.020555
## 4 4 1 0.005 0.003135

Figure 7.2: Absolute frequency - number of deaths by horse kicks

Figure 7.3: Relative frequency - number of deaths by horse kicks
barplot(sqrt(hk[,3]), ylim = c(-sqrt(0.005), sqrt(0.6)),
xlab = "Count", ylab = "sqrt(Frequency)")
lines(x1, sqrt(hk[,4]), type = "b", lwd = 2, pch = 19, col = 2)
abline(h = 0)

Figure 7.4: Square root of relative frequency - number of deaths by horse kicks

Figure 7.5: Hanging rootogram - number of deaths by horse kicks
7.3 Poisson GLM
Special case: Poisson distribution belongs to exponential family as
Properties:
- Variance function is
and the dispersion parameter is (signalling equidispersion). - Natural parameter is
, thus the canonical link is . - Hence, due to the resulting model equation
. Hence, the Poisson GLM is by construction heteroscedastic.
Estimation: Log-likelihood, score function, and Hessian.
where
Marginal effects:
Discrete changes:
Thus, for a single unit change
7.3.1 Example: Recreation demand
Cross-section data on the number of recreational boating trips to Lake Somerville, Texas, in 1980, based on a survey administered to 2,000 registered leisure boat owners in 23 counties in eastern Texas.
A data frame containing 659 observations on 8 variables:
Variable | Description |
---|---|
trips |
Number of recreational boating trips. |
quality |
Facility’s subjective quality ranking on scale 1–5. |
ski |
Was the individual engaged in water-skiing? |
income |
Annual household income (in 1,000 USD). |
userfee |
Did the owner pay an annual user fee at Lake Somerville? |
costC |
Expenditure when visiting Lake Conroe (in USD). |
costS |
Expenditure when visiting Lake Somerville (in USD). |
costH |
Expenditure when visiting Lake Houston (in USD). |
## trips quality ski income userfee
## Min. : 0.00 Min. :0.00 no :417 Min. :1.00 no :646
## 1st Qu.: 0.00 1st Qu.:0.00 yes:242 1st Qu.:3.00 yes: 13
## Median : 0.00 Median :0.00 Median :3.00
## Mean : 2.24 Mean :1.42 Mean :3.85
## 3rd Qu.: 2.00 3rd Qu.:3.00 3rd Qu.:5.00
## Max. :88.00 Max. :5.00 Max. :9.00
## costC costS costH
## Min. : 4.3 Min. : 4.8 Min. : 5.7
## 1st Qu.: 28.2 1st Qu.: 33.3 1st Qu.: 29.0
## Median : 41.2 Median : 47.0 Median : 42.4
## Mean : 55.4 Mean : 59.9 Mean : 56.0
## 3rd Qu.: 69.7 3rd Qu.: 72.6 3rd Qu.: 68.6
## Max. :493.8 Max. :491.5 Max. :491.0
7.3.2 Recreation demand: Visualization
Univariate visualization: Histogram or bar plot of discrete count response.

Figure 7.6: Absolute frequency - Number of trips
Bivariate visualization: Standard scatter plots do not work well due to many ties; employ discretized version in spine plots/spinograms instead.
trips1 <- cut(RecreationDemand$trips,
breaks = c(0:2, 4, 8, 90) - 0.5,
labels = c("0", "1", "2-3", "4-7", "8+"))
plot(trips1 ~ ski, data = RecreationDemand)
plot(trips1 ~ quality, data = RecreationDemand,
breaks = 0:6 - 0.5)

Figure 7.7: Spine plots: trips vs. quality, ski, income and userfee

Figure 7.8: Spine plots: trips vs. cost variables
Sometimes continuity-corrected log-response log(y + 0.5)
works well, jittered if necessary.
7.3.3 Recreation demand: Poisson GLM
##
## Call:
## glm(formula = trips ~ ., family = poisson, data = RecreationDemand)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.26499 0.09372 2.83 0.0047
## quality 0.47173 0.01709 27.60 < 2e-16
## skiyes 0.41821 0.05719 7.31 2.6e-13
## income -0.11132 0.01959 -5.68 1.3e-08
## userfeeyes 0.89817 0.07899 11.37 < 2e-16
## costC -0.00343 0.00312 -1.10 0.2713
## costS -0.04254 0.00167 -25.47 < 2e-16
## costH 0.03613 0.00271 13.34 < 2e-16
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 4849.7 on 658 degrees of freedom
## Residual deviance: 2305.8 on 651 degrees of freedom
## AIC: 3075
##
## Number of Fisher Scoring iterations: 7
Effects:
library("effects")
rd_eff <- allEffects(rd_pois,
xlevels = list(costC = 0:250, costS = 0:250, costH = 0:250))
plot(rd_eff, rescale.axis = FALSE)

Figure 7.9: Recreation demand effect plots
Log-likelihood:
## 'log Lik.' -1529 (df=8)
Goodness of fit: Here, graphical assessment of observed frequencies and expected frequencies
## 0
## 417
## [1] 276.5

Figure 7.10: Recreation demand: Hanging rootogram
Typical problem: Poisson GLM is often no suitable probability model.
Excess zeros: Response has 63.28% zeros, much more than captured by a Poisson regression.
##
## 0 1 2 3 4 5
## 0.63278 0.10319 0.05766 0.05159 0.02580 0.01973
## [1] 0.4196
Overdispersion: Variance is often (much) larger than mean.
## [1] 17.64
(Note, however, that this does not consider the influence of covariables.)
7.4 Overdispersion and Unobserved Heterogeneity
7.4.1 Overdispersion
Problem: In many applications (in economics and the social sciences)
Consequences: If ignored, point estimates from a Poisson model may or may not be valid (depending on correct specification of the mean equation, see below). However, inference is biased due to incorrect estimates of the covariance matrix (i.e., also the standard errors).
Solutions: Consider more general
- Quasi-Poisson: Employ Poisson model for the mean equation but allow overdispersion in variance equation.
- Robust covariances: Employ Poisson model for the mean equation but leave rest of the likelihood unspecified. Estimate covariances under general heteroscedasticity.
- Negative binomial: Employ more general likelihood (both for mean and variance) that allows for overdispersion.
Test: To test for over- (or under-)dispersion, consider the specification
where
Then, equidispersion corresponds to
Note: However, overdispersion can be relevant (for the inference in the Poisson GLM) before it is significant in a formal test. Thus, if in doubt, do not rely on equidispersion.
Example: Test for overdispersion in rd_pois
using
##
## Overdispersion test
##
## data: rd_pois
## z = 2.9, p-value = 0.002
## alternative hypothesis: true alpha is greater than 0
## sample estimates:
## alpha
## 1.316
Example: Test for overdispersion in rd_pois
using
##
## Overdispersion test
##
## data: rd_pois
## z = 2.4, p-value = 0.008
## alternative hypothesis: true alpha is greater than 0
## sample estimates:
## alpha
## 5.566
Example: Test for overdispersion in rd_pois
using
##
## Overdispersion test
##
## data: rd_pois
## z = 2.4, p-value = 0.008
## alternative hypothesis: true dispersion is greater than 1
## sample estimates:
## dispersion
## 6.566
7.4.2 Quasi-Poisson GLM
Formally: Test idea can be turned into a formal model.
- The score function for
(and thus the estimating equations and mean function) are assumed to be correct. - Relax remaining likelihood to allow overdispersion.
where
Properties:
- This is called quasi-Poisson model in the GLM literature.
- The quasi-ML estimator coincides with the ML estimator
because the score function is unchanged. - The model is not associated with a likelihood. Conditional mean but not conditional probability model.
- Estimate
required for inference.
Typically: Estimate
where
In R: A quasipoisson
family to be plugged into glm()
is provided. By default, it uses the Pearson residuals
##
## Call:
## glm(formula = trips ~ ., family = quasipoisson, data = RecreationDemand)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.26499 0.23521 1.13 0.2603
## quality 0.47173 0.04289 11.00 < 2e-16
## skiyes 0.41821 0.14353 2.91 0.0037
## income -0.11132 0.04916 -2.26 0.0239
## userfeeyes 0.89817 0.19822 4.53 7.0e-06
## costC -0.00343 0.00782 -0.44 0.6613
## costS -0.04254 0.00419 -10.15 < 2e-16
## costH 0.03613 0.00680 5.31 1.5e-07
##
## (Dispersion parameter for quasipoisson family taken to be 6.298)
##
## Null deviance: 4849.7 on 658 degrees of freedom
## Residual deviance: 2305.8 on 651 degrees of freedom
## AIC: NA
##
## Number of Fisher Scoring iterations: 7
## 'log Lik.' NA (df=9)
## [1] 6.298
7.4.3 Sandwich covariances
Similar idea: As before, assume that score function for
For correctly specified ML: Information matrix equality holds (under regularity conditions). Observed/expected information and outer product of gradients coincide.
Under misspecification: To account for potential misspecifications of the remaining likelihood – including but not limited to
modelsummary(list("Poisson sandwich vcov" = rd_pois),
vcov=sandwich, fmt=3, estimate="{estimate}{stars}")
Poisson sandwich vcov | |
---|---|
(Intercept) | 0.265 |
(0.432) | |
quality | 0.472*** |
(0.049) | |
skiyes | 0.418* |
(0.194) | |
income | -0.111* |
(0.050) | |
userfeeyes | 0.898*** |
(0.247) | |
costC | -0.003 |
(0.015) | |
costS | -0.043*** |
(0.012) | |
costH | 0.036*** |
(0.009) | |
Num.Obs. | 659 |
AIC | 3074.9 |
BIC | 3110.8 |
Log.Lik. | -1529.431 |
RMSE | 6.09 |
Std.Errors | Custom |
Note that ski
and income
are not significant at 1% level anymore.
Comparison: All methods employ same conditional mean model and lead to same
round(sqrt(cbind(
"Poisson" = diag(vcov(rd_pois)),
"Quasi-Poisson" = diag(vcov(rd_qpois)),
"Sandwich" = diag(sandwich(rd_pois))
)), digits = 3)
## Poisson Quasi-Poisson Sandwich
## (Intercept) 0.094 0.235 0.432
## quality 0.017 0.043 0.049
## skiyes 0.057 0.144 0.194
## income 0.020 0.049 0.050
## userfeeyes 0.079 0.198 0.247
## costC 0.003 0.008 0.015
## costS 0.002 0.004 0.012
## costH 0.003 0.007 0.009
Here, quasi-Poisson and sandwich standard errors are similar, but Poisson standard errors are much too small.
7.4.4 Unobserved heterogeneity
Question: What are the (potential) sources of overdispersion?
Answer: Unobserved heterogeneity. Not all relevant regressors are known, some remaining deviations
where
This implies:
justifying the previously considered variance equations.
Conditional distribution: To obtain the full distribution (rather than the first two moments only),
the full distribution
which corresponds to a continuous mixture distribution.
Question: What are useful assumptions for
Answer: Preferably such that there is a closed form solution for
Properties: If
with
Note:
Here: Employ
Now: Continuous mixture of Poisson distribution
which has
Special cases:
- Poisson:
. - Geometric:
.
7.4.5 Poisson vs. negative binomial distribution
Again, the probability distribution function of the Poisson distribution with different means looks the following:
Figure 7.11: Poisson probability distribution with different means
While the PDF of the negative binomial distribution with the same

Figure 7.12: Negative binomial PDF with different means, theta=10
Regression: To employ the NB distribution in a regression, the natural mean equation is employed:
Variance: For the variance two different specifications are commonly used.
- NB2: Assume constant
and thus . - NB1: Assume varying
such that and thus .
NB1 and NB2 coincide without regressors but lead to differing models with regressors.
Note: For known
In R: Package MASS provides negative.binomial(theta)
family that can be plugged into glm()
. glm.nb()
is the extension that also estimates theta
.
##
## Call:
## glm.nb(formula = trips ~ ., data = RecreationDemand, init.theta = 0.7292568331,
## link = log)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.12194 0.21430 -5.24 1.6e-07
## quality 0.72200 0.04012 18.00 < 2e-16
## skiyes 0.61214 0.15030 4.07 4.6e-05
## income -0.02606 0.04245 -0.61 0.539
## userfeeyes 0.66917 0.35302 1.90 0.058
## costC 0.04801 0.00918 5.23 1.7e-07
## costS -0.09269 0.00665 -13.93 < 2e-16
## costH 0.03884 0.00775 5.01 5.4e-07
##
## (Dispersion parameter for Negative Binomial(0.7293) family taken to be 1)
##
## Null deviance: 1244.61 on 658 degrees of freedom
## Residual deviance: 425.42 on 651 degrees of freedom
## AIC: 1669
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 0.7293
## Std. Err.: 0.0747
##
## 2 x log-likelihood: -1651.1150

Figure 7.13: Quality effect plot - negative binomial model

Figure 7.14: Income effect plot - negative binomial model

Figure 7.15: Hanging rootogram - negative binomial model
Likelihood ratio test: Test for overdispersion by comparing log-likelihoods of Poisson and NB2 model.
## Likelihood ratio test
##
## Model 1: trips ~ quality + ski + income + userfee + costC + costS + costH
## Model 2: trips ~ quality + ski + income + userfee + costC + costS + costH
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 8 -1529
## 2 9 -826 1 1408 <2e-16
Note: The
7.5 Censoring and Truncation
Common problem: In microeconometrics, data are often incomplete or only partially observable.
Previously: Unobserved heterogeneity can be interpreted as lack of observability of explanatory variables.
Now: Limited data observability for dependent variables, e.g.,
- Truncation: Some observations omitted entirely from sample.
- Censoring: Some observations only known to be in limited range.
Truncation:
- Example: Asking unemployed workers about the number of unemployment spells, yields at least one unemployment spell.
- Typical case: Left-truncation at zero.
- Probability density function: For
Censoring:
- Example: Survey with answer category more than
” (e.g., patents per year). - Here: Right-censoring at
in original variable . - Observation mechanism:
- Thus: Observe
for . Otherwise, exploit information that . - Probability density function: Given
7.6 Hurdle and Zero-Inflated Count Data Models
7.6.1 Hurdle count data models
Idea: Account for excess (or lack of) zeros by two-part model.
- Is
equal to zero or positive? “Is the hurdle crossed?” - If
, how large is ?
Formally:
- Zero hurdle:
. Binary part given by count distribution right-censored at (or simply Bernoulli variable). - Count part:
. Count part given by count distribution left-truncated at .
Combined: Probability density function for hurdle model,
Conditional mean: Using a log-link for the count component yields
Warning: Literature has different parametrizations for
Example: Likelihood for Poisson hurdle model has two parts,
Remarks:
- Joint likelihood is
. - Both parts can be maximized separately.
- Just requires software for binary response models and truncated count response models.
In R: hurdle()
in pscl.
Formula: Two-part formula y ~ x1 + x2 | z1 + z2 + z3
. Also, y ~ x1 + x2
is short for y ~ x1 + x2 | x1 + x2
.
Distributions:
dist
specifies :"poisson"
(default),"negbin"
or"geometric"
.zero.dist
specifies :"binomial"
(default, with varying links) or"poisson"
/"negbin"
/"geometric"
.
Hurdle test: If hurdletest()
.
rd_hp0 <- hurdle(trips ~ ., data = RecreationDemand,
dist = "poisson", zero = "poisson")
summary(rd_hp0)
##
## Call:
## hurdle(formula = trips ~ ., data = RecreationDemand, dist = "poisson",
## zero.dist = "poisson")
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -6.407 -0.357 -0.219 -0.169 13.331
##
## Count model coefficients (truncated poisson with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.15040 0.11178 19.24 < 2e-16
## quality 0.04426 0.02385 1.86 0.063
## skiyes 0.46702 0.05879 7.94 2.0e-15
## income -0.09770 0.02057 -4.75 2.1e-06
## userfeeyes 0.60069 0.07952 7.55 4.2e-14
## costC 0.00141 0.00396 0.36 0.722
## costS -0.03661 0.00204 -17.92 < 2e-16
## costH 0.02388 0.00347 6.89 5.6e-12
## Zero hurdle model coefficients (censored poisson with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.49906 0.27785 -8.99 < 2e-16
## quality 0.84848 0.05261 16.13 < 2e-16
## skiyes 0.44090 0.18443 2.39 0.01682
## income 0.00381 0.05039 0.08 0.93974
## userfeeyes 3.57526 73.99944 0.05 0.96147
## costC 0.00722 0.01700 0.42 0.67115
## costS -0.05582 0.00920 -6.07 1.3e-09
## costH 0.04827 0.01329 3.63 0.00028
##
## Number of iterations in BFGS optimization: 50
## Log-likelihood: -1.18e+03 on 16 Df
## Wald test for hurdle models
##
## Restrictions:
## count_((Intercept) - zero_(Intercept) = 0
## count_quality - zero_quality = 0
## count_skiyes - zero_skiyes = 0
## count_income - zero_income = 0
## count_userfeeyes - zero_userfeeyes = 0
## count_costC - zero_costC = 0
## count_costS - zero_costS = 0
## count_costH - zero_costH = 0
##
## Model 1: restricted model
## Model 2: trips ~ .
##
## Res.Df Df Chisq Pr(>Chisq)
## 1 651
## 2 643 8 449 <2e-16
rd_hp <- hurdle(trips ~ . | quality + income,
data = RecreationDemand, dist = "poisson")
rd_hnb <- hurdle(trips ~ . | quality + income,
data = RecreationDemand, dist = "negbin")
modelsummary(list("Negbin hurdle"=rd_hnb,
"Poisson hurdle" = rd_hp),
fmt=3, estimate="{estimate}{stars}")
Negbin hurdle | Poisson hurdle | |
---|---|---|
count_(Intercept) | 0.842* | 2.150*** |
(0.383) | (0.112) | |
count_quality | 0.172* | 0.044+ |
(0.072) | (0.024) | |
count_skiyes | 0.622** | 0.467*** |
(0.190) | (0.059) | |
count_income | -0.057 | -0.098*** |
(0.065) | (0.021) | |
count_userfeeyes | 0.576 | 0.601*** |
(0.385) | (0.080) | |
count_costC | 0.057** | 0.001 |
(0.022) | (0.004) | |
count_costS | -0.078*** | -0.037*** |
(0.012) | (0.002) | |
count_costH | 0.012 | 0.024*** |
(0.015) | (0.003) | |
zero_(Intercept) | -2.766*** | -2.766*** |
(0.362) | (0.362) | |
zero_quality | 1.503*** | 1.503*** |
(0.100) | (0.100) | |
zero_income | -0.045 | -0.045 |
(0.079) | (0.079) | |
Num.Obs. | 659 | 659 |
R2 | 0.998 | 0.848 |
R2 Adj. | 0.998 | 0.846 |
AIC | 1554.2 | 2398.7 |
BIC | 1608.1 | 2448.1 |
RMSE | 23.38 | 5.35 |
7.6.2 Zero-inflated count data models
Different idea: Account for excess zeros via additional zero probability
Formally: Mixture model with two components.
Conditional mean: Based on binomial GLM for unobserved probability for zero point mass
Remarks:
- Zeros have two sources: (1) Zero for sure in point mass. (2) Zero by chance in count component.
- Special cases: Zero-inflated Poisson (ZIP), zero-inflated negative binomial (ZINB).
- Likelihood cannot be separated as for hurdle models.
- ML estimation can be performed using EM (expectation-maximization) algorithm.
In R: zeroinfl()
from package pscl with analogous interface to hurdle()
.
rd_zip <- zeroinfl(trips ~ . | quality + income,
data = RecreationDemand, dist = "poisson")
rd_zinb <- zeroinfl(trips ~ . | quality + income,
data = RecreationDemand, dist = "negbin")
modelsummary(list("Zero inflated negbin"=rd_zinb,
"Zero inflated Poisson" = rd_zip),
fmt=3, estimate="{estimate}{stars}")
Zero inflated negbin | Zero inflated Poisson | |
---|---|---|
count_(Intercept) | 1.097*** | 2.099*** |
(0.257) | (0.111) | |
count_quality | 0.169** | 0.034 |
(0.053) | (0.024) | |
count_skiyes | 0.501*** | 0.472*** |
(0.134) | (0.058) | |
count_income | -0.069 | -0.100*** |
(0.044) | (0.021) | |
count_userfeeyes | 0.543+ | 0.610*** |
(0.283) | (0.079) | |
count_costC | 0.040** | 0.002 |
(0.015) | (0.004) | |
count_costS | -0.066*** | -0.038*** |
(0.008) | (0.002) | |
count_costH | 0.021* | 0.025*** |
(0.010) | (0.003) | |
zero_(Intercept) | 5.743*** | 3.292*** |
(1.556) | (0.516) | |
zero_quality | -8.307* | -1.914*** |
(3.682) | (0.206) | |
zero_income | -0.258 | -0.045 |
(0.282) | (0.108) | |
Num.Obs. | 659 | 659 |
R2 | 0.986 | 0.851 |
R2 Adj. | 0.986 | 0.849 |
AIC | 1467.9 | 2383.6 |
BIC | 1521.8 | 2433.0 |
RMSE | 14.87 | 5.33 |
## df AIC
## rd_hp 11 2448
## rd_hnb 12 1608
## rd_zip 11 2433
## rd_zinb 12 1522
round(c(
"Observed" = sum(RecreationDemand$trips < 1),
"Poisson" = sum(dpois(0, fitted(rd_pois))),
"NB" = sum(dnbinom(0, mu = fitted(rd_nb), size = rd_nb$theta)),
"Hurdle-NB"= sum(predict(rd_hnb, type = "prob")[,1]),
"ZINB" = sum(predict(rd_zinb, type = "prob")[,1])
))
## Observed Poisson NB Hurdle-NB ZINB
## 417 277 423 417 433

Figure 7.16: Hanging rootogram: Hurdle and zero inflated negative binomial model

Figure 7.17: Effect of quality on trips: Hurdle and negative binomial model
7.7 Finite Mixture Count Data Models
Idea: Population is a mixture of
Formally: For prior weights/probabilities
with
Interpretation: Clustering based on model, i.e., attractive for separating groups (such as healthy/ill individuals).
Terminology: Varies with community and basis model
Problems:
- No generally accepted criterion/heuristic for choosing
. - The latent weights
are only defined up to permutation. Sometimes ordering is induced by requiring . - Estimation would be easy if “hard” cluster/class memberships were known (corresponding to models of type
y ~ class * x
in R).
Task: Need to optimize full log-likelihood
Solution: Typically achieved using expectation-maximization (EM) algorithm, a general strategy for optimizing complex likelihoods with latent variables.
E step: Expectation of posterior class probabilities
M step: Maximization of log-likelihood in each component
Iterate until convergence.
In R:
- Package flexmix provides flexible mixture model framework.
- Package provides E step, leverages available model fitting functions for M step.
- Note that M step corresponds to fitting model
y ~ x
withweights = w_j
( ). - Alternatively, “hard” (instead of “soft”) classification can be used by setting weights to 0 or 1 (for class
with highest weight). - New drivers providing M~steps can be plugged into package flexmix.
Example: Try to separate “high” users from “low” users in RecreationDemand
data.
library("flexmix")
set.seed(4002)
rd_fm <- flexmix(trips ~ ., data = RecreationDemand, k = 2,
model = FLXMRglm(trips ~ ., family = "poisson"))
summary(rd_fm)
##
## Call:
## flexmix(formula = trips ~ ., data = RecreationDemand, k = 2,
## model = FLXMRglm(trips ~ ., family = "poisson"))
##
## prior size post>0 ratio
## Comp.1 0.81 589 640 0.920
## Comp.2 0.19 70 606 0.116
##
## 'log Lik.' -949.9 (df=17)
## AIC: 1934 BIC: 2010

Figure 7.18: Rootogram of posterior probabilities
## $Comp.1
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.87856 0.25400 -7.40 1.4e-13
## quality 0.79767 0.04712 16.93 < 2e-16
## skiyes 0.23247 0.13995 1.66 0.09669
## income -0.01074 0.04259 -0.25 0.80088
## userfeeyes 1.21230 0.26737 4.53 5.8e-06
## costC -0.03888 0.01108 -3.51 0.00045
## costS -0.01446 0.00367 -3.94 8.0e-05
## costH 0.05146 0.00990 5.20 2.0e-07
##
## $Comp.2
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.9311 0.3403 5.67 1.4e-08
## quality -0.1178 0.0525 -2.24 0.02491
## skiyes 1.3298 0.1799 7.39 1.4e-13
## income 0.0115 0.0497 0.23 0.81737
## userfeeyes -0.5547 0.1682 -3.30 0.00097
## costC 0.0813 0.0112 7.23 4.8e-13
## costS -0.1612 0.0191 -8.44 < 2e-16
## costH 0.0531 0.0119 4.48 7.5e-06
7.8 Summary and Outlook
7.8.1 Summary
Available strategies:
- Basic model: Poisson. Often too restrictive as conditional probability model (in microeconomic applications).
- Conditional mean model: Keep mean equation, give up remaining likelihood, employ robust standard errors.
- Overdispersion: Negative binomial (as conditional probability model) or quasi-Poisson (as conditional mean model).
- Excess zeros: Hurdle model or zero-inflation model (based on Poisson or negative binomial).
- Unobserved clusters: Finite mixture models.
7.8.2 Outlook
Further methods:
- Models without zeros: zero-truncation models. (In R: package countreg on R-Forge.)
- Alternative continuous mixtures: Poisson-lognormal, PIG, etc.
- Generalizations of negative binomial distribution: NegBin-
, NegBin-H (multiple index model with conditional dispersion parameter). - Other generalized count distributions, e.g., generalized Poisson etc.
- Multivariate count response models.
- Other semiparametric models, …