Chapter 11 Testing principles

Question: What is a general strategy for constructing a statistical test for a given problem?

In particular: What is a general strategy for constructing a statistical test for a 2-sample problem (including tests for location differences)?

11.1 Testing procedures

General strategy:

  1. Establish a null hypothesis and alternative.
  2. Compute a test statistic (based on the data) which captures deviations from the null hypothesis: It should be small under the null hypothesis but large or extreme under the alternative.
  3. Derive the distribution of the test statistic under the null hypothesis (based on assumptions about the theoretical distribution of the data) for computing critical values and \(p\)-values, respectively.

Example: Location difference in 2 samples.

  1. Null hypothesis: \(\mu_A = \mu_B\) vs. alternative \(\mu_A \neq \mu_B\).
  2. Test statistic: \((\overline y_A - \overline y_B)/\widehat{SD}\) (scaled difference of means).
  3. Null distribution: \(t\)-distribution or standard normal distribution (CLT).

In particular:

  • Null hypothese/alternative: The null hypothesis that 2 samples come from the same distribution can be tested against many alternatives.
    Example: Differences in mean or dispersion or skewness etc.
  • Test statistic: Various choices are possible.
    Example: Difference of location measures: (trimmed) means, medians, etc.
    Example: Ratio of dispersion measures.
    Example: Difference of skewness coefficients.
  • Null distribution: Only for some test statistics a CLT is available. And even if there is an asymptotic approximation, an exact test would be preferable.

Wanted: General strategy for obtaining the exact null distribution of a given test statistic.

Solution: Employ the permutation distribution.

11.2 Robust statistics

Problem: If the (empirical) distributions have “heavy” tails or if there are individual outliers, this can severely affect the empirical mean and it may become a poor location measure.

Example: Although the location in the center of the distribution differs clearly (\(Q_{A, 0.5} - Q_{B, 0.5} = -1.96\)), the corresponding means differ far less clearly (\(\overline y_A - \overline y_B = -0.96\)).

Solution: Employ a robust test statistic such as the median difference. \(p\)-values can be obtained from the permutation distribution (\(p = 0.047\)).

Comparison: Due to the outliers the \(t\)-test has little power (\(p = 0.412\)). Using the permutation distribution for the \(t\)-statistic yields almost equivalent results (\(p = 0.417\)).

Thus: The following two concepts should be distinguished.

  • Permutation distribution: The null distribution (for computing critical values and \(p\)-values) can be obtained from the permutation principle even if no CLT is available or appropriate.
  • Robust statistics: Test statistics on which outliers only have limited effects often have higher power if the data are very non-normal.

Note: Both concepts are often discussed jointly because robust tests are often conducted as permutation tests. However, both concepts can be applied separately.

11.3 \(t\)-test under different assumptions

Motivation: To introduce the general permutation principle for obtaining the null distribution, it is applied to the \(t\)-statistic \[ t \quad = \quad \frac{\overline y_A - \overline y_B}{\widehat{SD}} \] and compared to previous approaches based on different assumptions about the empirical data.

  • Normal distribution: Asymptotic.
  • \(t\)-distribution: Exact under normality.
  • Permutation distribution: Exact given the observed data.

11.3.1 Asymptotic \(t\)-test

Previous assumptions: Two samples (of size \(n_A\) and \(n_B\)) from the random variables \(Y_A\) and \(Y_B\). No further assumptions about type and shape of the distributions of \(Y_A\) and \(Y_B\) are needed for the central limit theorem (CLT).

CLT: Under the null hypothesis the \(t\)-statistic is asymptotically standard normal.

Advantages: Few assumptions, very versatile.

Disadvantages: As \(n_A \rightarrow \infty\) and \(n_B \rightarrow \infty\) the approximation through the standard normal distribution keeps improving. However, it is unclear how well the approximation works in (small) finite samples.

11.3.2 \(t\)-test under normality

Moreover: If both random variables \(Y_A\) and \(Y_B\) are exactly normally distributed with equal variances \(\sigma_A^2 = \sigma_B^2\) and (under the null hypothesis) equal expectations \(\mu_A = \mu_B\), then the \(t\)-statistic is exactly \(t\)-distributed.

Advantages: Exact distribution.

Disadvantages: Only valid under normality. Whether this is actually the case, is typically unclear and hard to assess.

11.3.3 Practical remarks regarding \(t\)-tests

Remark: \(t\)-distribution and normal distribution already lead to very similar results for moderately large samples. Both work well when the samples “look normal”, i.e.,

  • (approximately) symmetric,
  • unimodal,
  • few extreme observations or outliers.

But: If these properties do not hold, then the \(t\)-statistic itself is not well-suited. Furthermore, the significance level might be violated and/or the test might have low power under the alternative.

Transformation: For asymmetric samples, a log or square-root transformation can often help to transform the sample to approximately symmetric.

11.4 Permutation tests

11.4.1 Permutation distribution for the \(t\)-test

Idea:

  • All computations are conditioned on the observed data.
  • The corresponding empirical distribution is employed for obtaining the null distribution of the \(t\)-statistic.

Assumption: Under the null hypothesis the distributions of \(Y_A\) and \(Y_B\) are identical.

Thus: The assignment of labels \(A\) vs. \(B\) is completely random and without information. Labels can be permuted or re-randomized without changing the null distribution of the test statistic.

Name: Permutation principle leading to the permutation test.

Permutations: For sample sizes of \(n_A = 6\) and \(n_B = 8\) there are \({ 14 \choose 6 } = 3003\) possible permutations of the response variable \(Y\) onto the explanatory variable \(X\) (i.e., its labels \(A\) and \(B\)).

Permutation distribution: For each of these possible permutations, evaluate the test statistic.

\(p\)-value: Proportion of statistics that deviate at least as much from \(0\) as the observed statistic. Here: \(p = 708/3003 = 0.236\). This \(p\)-value is virtually the same as the \(p\)-value based on the asymptotic normal distribution (\(p = 0.240\)) or \(t\)-distribution (\(p = 0.262\)), respectively.

In general: The same principle can be applied to any statistic that contrasts two distributions: median difference, ratio of interquartile ranges, etc. In all cases the test statistic is evaluated on the observed data and all possible permutations to assess whether the observed statistic is extreme within the permutation distribution.

Advantage: Versatile, easy to interpret, exact distribution, independent of the underlying distribution of the data (even for finite \(n\)).

Disadvantage: Computationally intensive. The number of possible permutations grows exponentially with the number of observations.

Solution: Do not evaluate all possible evaluations but just a very large number, say 10,000 or similar.

In R: With package coin for “\(\underline{co}\)nditional \(\underline{in}\)ference”.

library("coin")
oneway_test(y ~ x, data = dat1, distribution = approximate(10000))
## 
##  Approximative Two-Sample Fisher-Pitman Permutation Test
## 
## data:  y by x (A, B)
## Z = 1.159, p-value = 0.253
## alternative hypothesis: true mu is not equal to 0
oneway_test(y ~ x, data = dat2, distribution = approximate(10000))
## 
##  Approximative Two-Sample Fisher-Pitman Permutation Test
## 
## data:  y by x (A, B)
## Z = -2.474, p-value = 0.0078
## alternative hypothesis: true mu is not equal to 0

Remark: Instead of the pooled estimator for \(\widehat{SD}\) this uses the 1-sample estimator.

11.5 Tutorial

11.5.1 Setup

Load the MLA data set. (The data set can be downloaded as MLA.csv or MLA.rda.)

R
load("MLA.rda")
head(MLA, 3)
##    invest      gender male     age treatment grade arrangement
## 1 70.0000 female/male  yes 11.7533      long   6-8        team
## 2 28.3333 female/male  yes 12.0866      long   6-8        team
## 3 50.0000 female/male  yes 11.7950      long   6-8        team
Python
# Make sure that the required libraries are installed.
# Import the necessary libraries and classes:
import pandas as pd 
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns

pd.set_option("display.precision", 4) # Set display precision to 4 digits in pandas and numpy.
# Load the MLA dataset
MLA = pd.read_csv("MLA.csv", index_col=False, header=0) 

11.5.2 Permutation tests with coin

An alternative approach for obtaining the null distribution for the \(t\)-test is the permutation principle. Thus, rather than assuming that the 2 samples come from normal distributions or relying on the central limit theorem, we can employ the permutation distribution conditional on the observed samples. The simple underlying idea is that under the null hypothesis it is random which observation falls into which of the 2 samples and may thus be permuted (i.e., re-randomized). First, we show how this can be done very conveniently using the coin package (for conditional inference) before implementing an entire permutation test from scratch in the next section.

The central workhorse function of the coin package is independence_test(y ~ x) which conducts an independence test for any combination of y and x variable. The function does so by employing transformations that are suitable for the scale of measurement of y and x, respectively.

Doing so the framework encompasses \(t\)-test, \(F\)-test, \(\chi^2\)-test, Wilcoxon-Mann-Whitney test, Spearman test and many others. Typical transformations include the identity, ranks, or dummy variables, among others. For all of the resulting independence test different null distributions can be used. The default is the asymptotic distribution which is either standard normal or the \(\chi^2\)-distribution. Alternatively, the null distribution can be approximated by sampling a (large) number of permutations. For certain tests - such as the Wilcoxon-Mann-Whitney test - the coin package also provides efficient algorithms to compute the permutation distribution exactly.

In addition to the core independence_test() function the coin package also comes with a number of convenience interfaces that implement important special cases. For example, oneway_test() is the counterpart to oneway.test() in base R or wilcox_test() is the counterpart to wilcox.test() and so on.

Here, we employ the function oneway_test() with 10,000 permutations for illustration. To make results exactly reproducible a random seed is specified. The result is very similar to the asymptotic tests above.

R

The coin package can be easily installed from CRAN “as usual”, i.e., using the packages menu in RStudio or generally with the command install.packages("coin") and subsequently loaded in the current session with library("coin").

library("coin")
oneway_test(invest ~ treatment, data = MLA, distribution = approximate(10000))
## 
##  Approximative Two-Sample Fisher-Pitman Permutation Test
## 
## data:  invest by treatment (long, short)
## Z = -1.268, p-value = 0.206
## alternative hypothesis: true mu is not equal to 0
oneway_test(invest ~ arrangement, data = MLA, distribution = approximate(10000))
## 
##  Approximative Two-Sample Fisher-Pitman Permutation Test
## 
## data:  invest by arrangement (single, team)
## Z = -6.894, p-value <0.0001
## alternative hypothesis: true mu is not equal to 0
Python

Unfortunately, there is no Python equivalent of the R package coin that implements several permutation tests.

Hence, we provide a Python interface to coin based on the Python package r_functions.

Setup

  1. Install R (https://discdown.org/rprogramming/installation.html).
  2. Install the R packages coin and jsonlite via Rscript -e "install.packages(c('coin', 'jsonlite'), repos='https://cloud.r-project.org/', lib = .Library)" in your shell/prompt.
  3. Install the Python package r_functions via pip install r-functions.
  4. Download the coin.R file in the same folder of your notebook.

Usage

from r_functions import create

# create Python functions bound to R functions
oneway_test = create("coin.R", "oneway_test") # oneway ANOVA

###
#    oneway_test can be used as a Python function to compute oneway ANOVA using the permutation principle
#    
#    :param formula: a string containing the R formula. For example:
#        formula ="invest ~ treatment" is the same as running anova_oneway with 
#        data=MLA["invest"] and groups=MLA["treatment"]
#    
#    :param data: a string containing the path to your dataset
#    
#    :param distribution: string ("Asymptotic" - default, "Approximative", or "Exact") or integer (number of permutations)##
#
#    :returns: a dictionary with 'method', 'alternative', 'variables', 'statistic', and 'pvalue'
###

res = oneway_test(formula="invest ~ treatment", data="MLA.csv", distribution=10000)
print(res)
## {'method': 'Approximative Two-Sample Fisher-Pitman Permutation Test', 'data': 'invest by treatment (long, short)', 'statistic': -1.2681, 'pvalue': 0.2095, 'alternative': 'true mean is not equal to 0'}
res = oneway_test(formula="invest ~ arrangement", data="MLA.csv", distribution=10000)
print(res)
## {'method': 'Approximative Two-Sample Fisher-Pitman Permutation Test', 'data': 'invest by arrangement (single, team)', 'statistic': -6.8935, 'pvalue': 0, 'alternative': 'true mean is not equal to 0'}

11.5.3 Permutation tests by hand

Instead of using the coin package it is also rather straightforward to conduct a permutation test by hand. For illustration, a permutation \(t\)-test is implemented from scratch below.

Three steps are used:

  1. The function t_stat(y, x) computes the standard 2-sample \(t\)-statistic based on a numeric “response” y and binary explanatory factor x.
  2. In addition to evaluating t_stat(y, x) on the original y and x, it is also evaluated on a large number of cases (e.g., 10,000) where the response is permuted: t_stat(sample(y), x).
  3. From the resulting permutation distribution of the \(t\)-statistic \(p\)-values and critical values can be computed.
R

The function for step 1 employs tapply(y, x, function) for various functions to compute mean, sd (standard deviation), and length (sample size) in each of the 2 samples. These are then simply plugged into the equation for the \(t\)-statistic (using the pooled standard deviation).

t_stat <- function(y, x)
{
  ## observed n, mean, sd
  n <- tapply(y, x, length)
  m <- tapply(y, x, mean)
  s <- tapply(y, x, sd)
  
  ## pooled variance and t-statistic
  pvar <- (n[1] + n[2]) / (n[1] * n[2]) *
    ((n[1] - 1) * s[1]^2 + (n[2] - 1) * s[2]^2) / (n[1] + n[2] - 2)
  tstat <- (m[1] - m[2]) / sqrt(pvar)

  return(tstat)
}
Python
def t_stat(y, x):
    data = pd.DataFrame({"y": y, "x": x})
    group_data = data.groupby("x")["y"]
    
    # observed n, mean, sd
    n = group_data.size().to_numpy()
    m = group_data.mean().to_numpy()
    s = group_data.std().to_numpy()
    
    # pooled variance and t-statistic
    pvar = (n[0] + n[1]) / (n[0] * n[1]) * \
    ((n[0] - 1) * s[0]**2 + (n[1] - 1) * s[1]**2) / (n[0] + n[1] - 2)
    tstat = (m[0] - m[1]) / np.sqrt(pvar)
    
    return tstat

In step 2 the function t_stat() is applied in nperm permutations, defaulting to nperm = 10000. As already stated above the permutation that breaks up the association of y and x (if any) can be easily carried out by the basic sample() function.

R
t_test <- function(y, x, nperm = 10000)
{
  ## observed statistic
  stat <- t_stat(y, x)
  
  ## statistic for permutations
  perm <- rep(0, nperm)  
  for(i in 1:nperm) perm[i] <- t_stat(sample(y), x)
  
  ## return everything in a list
  ## (with class htest for automatic printing)
  rval <- list(
    statistic = c(t = stat),
    p.value = mean(abs(perm) >= abs(stat)),
    method = "Permutation t-test",
    permutations = perm
  )
  class(rval) <- "htest"
  return(rval)
}
Python
def t_test(y, x, nperm = 10000):
    # observed statistic
    stat = t_stat(y, x)
      
    ## statistic for permutations
    perm = np.zeros(nperm)
    for i in range(1, nperm): 
        perm[i] = t_stat(np.random.permutation(y), x)
        
    # return everything in a dictionary
    tperm = {"statistic":[], "pvalue":[], "method":[], "permutations":[]}
    tperm["statistic"].append(stat)
    tperm["pvalue"].append(np.mean(np.abs(perm) >= np.abs(stat)))
    tperm["method"].append("Permutation t-test")
    tperm["permutations"].append(perm)
    
    return tperm

Finally, in step 3, the permutations are leveraged to compute a 2-sided \(p\)-value, which corresponds to the proportion of more extreme permutations that are at least as high in absolute value as the originally-observed statistic. The class "htest" is assigned to the list of results which has a print() method displaying results of hypothesis tests in a standardized format.

Applying the t_test() function to the assessment of the treatment in the MLA experiment, yields virtually the same result as the oneway.test() and oneway_test() functions above. (Replicating the 1-sided t.test() result would require implementation of an alternative = argument in t_test() which was kept deliberately simple here.)

R
set.seed(403)
tperm <- t_test(MLA$invest, MLA$treatment)
tperm
## 
##  Permutation t-test
## 
## data:  
## t.long = -1.269, p-value = 0.201
Python
np.random.seed(403)

tperm = t_test(y=MLA["invest"], x=MLA["treatment"], nperm=10000)

print(tperm)
## {'statistic': [np.float64(-1.268777972414684)], 'pvalue': [np.float64(0.2001)], 'method': ['Permutation t-test'], 'permutations': [array([ 0.        , -0.30267799, -1.1674391 , ..., -0.07128435,
##        -1.1528186 , -0.53832334])]}

The resulting non-significant \(p\)-value is very close to the \(p\)-values from the \(t\)-distribution and asymptotic standard normal distribution:

R
2 * pnorm(-abs(tperm$statistic))
##  t.long 
## 0.20452
2 * pt(-abs(tperm$statistic), df = nrow(gsa) - 2)
##   t.long 
## 0.226768
Python
from scipy import stats

print(2 * stats.norm.cdf(-np.abs(tperm["statistic"]))) 
## [0.20452026]

The results for the arrangement test are:

R
set.seed(403)
tperm2 <- t_test(MLA$invest, MLA$arrangement)
tperm2
## 
##  Permutation t-test
## 
## data:  
## t.single = -7.194, p-value <2e-16
Python
np.random.seed(403)

tperm2 = t_test(y=MLA["invest"], x=MLA["arrangement"], nperm=10000)

print(tperm2)
## {'statistic': [np.float64(-7.194441325422449)], 'pvalue': [np.float64(0.0)], 'method': ['Permutation t-test'], 'permutations': [array([ 0.        ,  1.26605982, -0.46202587, ..., -1.30663328,
##        -0.2600691 ,  0.08704257])]}

This shows that for a rather large sample with unimodal distribution that is not very skewed all three approaches for the null distribution (permutation distribution, normal approximation, \(t\)-distribution) yield very similar results and lead to qualitatively the same conclusions. Graphically, the three distributions can be compared in a histogram/density as shown below.

R
hist(tperm$permutations, freq = FALSE, col = "lightgray", main = "")
abline(v = tperm$statistic, col = 2, lwd = 2)
s <- seq(-8, 8, by = 0.1)
lines(s, dnorm(s), col = 4)
lines(s, dt(s, nrow(MLA) - 2), col = 4, lty = 2)
hist(tperm2$permutations, xlim = c(-7.5, 4.5), freq = FALSE, col = "lightgray", main = "")
abline(v = tperm2$statistic, col = 2, lwd = 2)
lines(s, dnorm(s), col = 4)
lines(s, dt(s, nrow(MLA) - 2), col = 4, lty = 2)

Python
fig, ax = plt.subplots(1, 2)

ax[0].hist(tperm["permutations"], density=True, color = 'lightgrey', edgecolor = 'black', bins=14)
ax[0].axvline(x=tperm["statistic"], color='r')
ax[0].set_xlabel('tperm - statistic')
ax[0].set_ylabel('Density')
density = stats.gaussian_kde(tperm["permutations"])
xs = np.linspace(-5,5,1000)
ax[0].plot(xs, density(xs))

ax[1].hist(tperm2["permutations"], density=True, color = 'lightgrey', edgecolor = 'black', bins=14)
ax[1].axvline(x=tperm2["statistic"], color='r')
ax[1].set_xlabel('tperm2 - statistic')
ax[1].set_ylabel('Density')
density = stats.gaussian_kde(tperm2["permutations"])
xs = np.linspace(-5,5,1000)
ax[1].plot(xs, density(xs))


fig.subplots_adjust(wspace=0.5)
plt.show()

The permutation test is so appealing because it can be applied very easily and also works for other samples (small and/or multimodal and/or skewed). To obtain a powerful test it may be necessary, though, to employ a more robust test statistic than the \(t\)-statistic.

In terms of code there is not much in the t_test() function (except the method label) that is specific to \(t\)-tests. Simply by re-defining the t_stat() function other test statistics can also be used, e.g., a difference in means (without scaling by the standard deviation), a difference in medians or trimmed means, etc. A slightly more elegant solution compared to re-defining the t_stat() function would be to add an argument stat = t_stat that takes a function for computing the test statistic.

Readers are encouraged to try out different versions of such a test by themselves. For example, they can explore different scalings for the \(t\)-statistic (pooled vs. 2-sample vs. no scaling) to see that this makes almost no difference for the permutation \(p\)-value.