Chapter 8 \(k\)-means clustering

8.1 Motivation

Idea: Instead of constructing a hierarchy of partitions, construct only one partition with a given number of \(k\) clusters directly.

Question: Which partition to choose out of all possible partitions into \(k\) clusters for \(n\) objects?

Answer: Adopt an objective function and choose that partition for which this is optimized.

Follow-up questions:

Which objective function should be used?
How should it be optimized? (Exhaustive search will only be feasible for small \(n\).)

Question: Which objective function should be used?

Answer: If data are quantitative, it is natural to use error sums of squares \(\mathit{SS}\). This is the sum of squared Euclidean distances of the individual observations from the corresponding cluster means.

Formally: Mean in cluster \(j\):

\[ \overline x_j \quad = \quad \frac{1}{|C_j|} \sum_{x \in C_j} x \]

Formally: Within sum of squares in cluster \(j\) and total within sum of squares.

\[\begin{eqnarray*} \mathit{WSS}(C_j) & = & \sum_{x \in C_j} d_2(x, \overline x_j)^2 \\ \mathit{SS}(\mathcal{C}) & = & \sum_{j = 1}^k \mathit{WSS}(C_j) \end{eqnarray*}\]

Jargon:

The \(k\) means \(\overline x_1, \dots, \overline x_k\) defining partition \(\mathcal{C}\) are also known as prototypes.
The partition \(\mathcal{C}\) (or equivalently: the corresponding \(k\) means), that minimizes the error sum of squares \(\mathit{SS}(\mathcal{C})\), is also known as \(k\)-means partition.

8.2 Algorithm

Question: In principle one go through all \(n \choose k\) possible partitions into \(k\) clusters for \(n\) objects. However, this is not feasible for large \(n\). Are there faster algorithms?

Answer: Yes, there are various algorithms for obtaining an approximative solution for the \(k\)-means problem such as the following.

Start with \(k\) randomly-selected means \(\overline x_j\).
Assign each observation \(x_i\) to the cluster \(j\) whose corresponding mean \(\overline x_j\) is closest.
Compute the new means \(\overline x_j\) by taking averages in each cluster \(j\).
Repeat steps 2 and 3 until the clusters do not change.

Problem: This algorithm is not guaranteed to find the global minimum for \(SS(\mathcal{C})\) but only a local minimum.

Example: Two different runs of \(k\)-means with different starting values for the scaled GSA data.

Remark: The following visualization of the partition uses the first two principal components. The sums of squares are computed in \(\mathbb{R}^8\) but are reflected well in the projection to \(\mathbb{R}^2\).

8.3 Interpretation

Due to different starting values different partitions are obtained. At least one of these has to be only a local minimum.

Here: The second partition \(\mathcal{C}_2\) is to be preferred because

\[\begin{eqnarray*} \mathit{SS}(\mathcal{C}_1) & = & 30.723 \mathit{SS}(\mathcal{C}_2) & = & 24.565 \end{eqnarray*}\]

In practice: Compute \(20\) (or \(50\) or \(\dots\)) partitions with different starting values and only keep the one with the smallest total within sum of squares.

Question: Which number of clusters \(k\) should be used?

Answer: Keep increasing the number of clusters \(k = 1, 2, 3, \dots\) as long as the total within sum of squares drops “clearly” for an additional cluster.

Follow-up question: What is “clearly”?

Answer: No universally accepted answer. Here, we judge this graphically based on the bar plot for the sums of squares.

Here: Choose at least \(k = 3\) clusters, maybe even \(4\).

Alternatively: There is a wide range of cluster indices that has been proposed in the literature. All of these make some form of trade-off between the number of clusters \(k\) (complexity) and homogeneity within / heterogeneity between the clusters (goodness of fit).

Then: Choose the number of clusters \(k\) that optimizes a given cluster index.

However: No universally accepted “best” cluster index. Hence not pursued in detail here.

8.4 Illustration

\(3\)-means solution for GSA data:

## K-means clustering with 3 clusters of sizes 5, 2, 8
## 
## Cluster means:
##   SA01.tennis SA02.cycle SA03.ride  SA05.swim SA17.shop SA18.concert SA19.sight
## 1    0.993413   1.131318  1.153165  0.8135672 -1.091354    -0.832119 -0.8122326
## 2   -1.305580  -1.421533 -0.833282 -1.8568762  1.188556     1.611043  1.9283186
## 3   -0.294488  -0.351691 -0.512408 -0.0442605  0.384957     0.117313  0.0255657
##   SA21.museum
## 1   -0.569453
## 2    1.868646
## 3   -0.111253
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

\(4\)-means solution for GSA data:

## K-means clustering with 4 clusters of sizes 5, 2, 5, 3
## 
## Cluster means:
##   SA01.tennis SA02.cycle SA03.ride SA05.swim SA17.shop SA18.concert SA19.sight
## 1  -0.4356596  -0.340845 -0.550756  0.161007  0.509226    -0.357827  -0.241757
## 2  -1.3055803  -1.421533 -0.833282 -1.856876  1.188556     1.611043   1.928319
## 3   0.9934126   1.131318  1.153165  0.813567 -1.091354    -0.832119  -0.812233
## 4  -0.0592015  -0.369766 -0.448494 -0.386373  0.177842     0.909215   0.471104
##   SA21.museum
## 1   -0.617494
## 2    1.868646
## 3   -0.569453
## 4    0.732482
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

8.6 Tutorial

As in previous chapters, the Guest Survey Austria data are used for illustration, considering the interest in \(p = 8\) summer activities aggregated within \(n = 15\) countries. See Chapter 3 for the data preprocessing, or download the data set directly: gsa-country.csv, gsa-country.rda.

8.6.1 Setup

R

load("gsa-country.rda")

Python

# Make sure that the required libraries are installed.
# Import the necessary libraries and classes:
import pandas as pd 
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
from pca import pca

from scipy.spatial.distance import pdist
from sklearn.cluster import KMeans

pd.set_option("display.precision", 4) # Set display precision to 4 digits in pandas and numpy.

Load the gsa-country.csv dataset. See Chapter 3 for the data preprocessing and the explanations for handling the index column.

gsa = pd.read_csv("gsa-country.csv", index_col = "Country", header = 0) 
gsa.head()

##                   SA01.tennis  SA02.cycle  ...  SA19.sight  SA21.museum
## Country                                    ...                         
## Austria (Vienna)       0.0380      0.1876  ...      0.1638       0.0575
## Austria (other)        0.0399      0.1995  ...      0.2112       0.0680
## Belgium                0.0134      0.1000  ...      0.3045       0.0866
## Denmark                0.0192      0.0806  ...      0.3608       0.0787
## France                 0.0190      0.0711  ...      0.4218       0.1754
## 
## [5 rows x 8 columns]

As argued before, the different popularity levels of the summer activities need to be accounted for, e.g., by explicit scaling to zero means and unit variance. Here, we do so by overwriting the gsa data with the scaled version for simplicity. Alternatively, an additional copy of the unscaled data could be kept as well, e.g., for characterizing the clusters resulting from the \(k\)-means analysis.

R

gsa <- scale(gsa)

Python

# Normalize data
gsa = (gsa-gsa.mean())/gsa.std()

8.6.2 Clustering

R

\(k\)-means clustering can be easily carried out in base R using the kmeans() function. It requires at least two arguments: The data matrix x (which needs to be scaled already if this is desired) and a specification of the centers at which the algorithm is started. This can either be a \(k \times p\) matrix with the initial cluster centers or simply the integer \(k\) with the desired number of clusters. In the latter case \(k\) distinct rows are chosen randomly from x. In addition some details of the iterative algorithm can be specified, see below for more details.

Here, we choose the initial cluster centers randomly and hence set a random seed first in order to make the results in this tutorial exactly reproducible.

set.seed(22)

Then we start out by 2-means clustering:

km2 <- kmeans(gsa, 2)
km2

## K-means clustering with 2 clusters of sizes 10, 5
## 
## Cluster means:
##   SA01.tennis SA02.cycle SA03.ride SA05.swim SA17.shop SA18.concert SA19.sight
## 1   -0.496706  -0.565659 -0.576582 -0.406784  0.545677     0.416059   0.406116
## 2    0.993413   1.131318  1.153165  0.813567 -1.091354    -0.832119  -0.812233
##   SA21.museum
## 1    0.284727
## 2   -0.569453
## 
## Clustering vector:
## Austria (Vienna)  Austria (other)          Belgium          Denmark           France 
##                2                2                1                1                1 
##          Germany          Hungary            Italy      Netherlands            Spain 
##                2                2                1                1                1 
##           Sweden      Switzerland               UK              USA            other 
##                1                2                1                1                1 
## 
## Within cluster sum of squares by cluster:
## [1] 48.89445  9.66128
##  (between_SS / total_SS =  47.7 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

Python

k-means clustering in python using the scikit-learn library.

km2 = KMeans(n_clusters=2, init='random', random_state=22).fit(gsa)

gsa.reset_index(inplace=True);
gsa.insert(loc=1, column="Labels", value=km2.labels_)

gsa.set_index(["Country", "Labels"], inplace=True) # Set "Country"and "Labels" as index
gsa.head()

##                          SA01.tennis  SA02.cycle  ...  SA19.sight  SA21.museum
## Country          Labels                           ...                         
## Austria (Vienna) 1            1.4187      1.3994  ...     -1.3495      -0.8244
## Austria (other)  1            1.5783      1.6138  ...     -1.0711      -0.7235
## Belgium          1           -0.6094     -0.1900  ...     -0.5242      -0.5451
## Denmark          1           -0.1330     -0.5416  ...     -0.1936      -0.6207
## France           0           -0.1526     -0.7144  ...      0.1639       0.3076
## 
## [5 rows x 8 columns]

print("Cluster means: \n", gsa.groupby(["Labels"]).mean())

## Cluster means: 
##          SA01.tennis  SA02.cycle  ...  SA19.sight  SA21.museum
## Labels                           ...                         
## 0           -0.5987     -0.8214  ...      0.9111       0.9056
## 1            0.3992      0.5476  ...     -0.6074      -0.6037
## 
## [2 rows x 8 columns]

This shows that one cluster with Austria/Germany/Hungary/Switzerland is found that was also already detected in with Hierarchical clustering and visualized in the Principal component analysis. All remaining countries form the second cluster.

Note that the reported cluster means pertain to the scaled data. Thus, for the Austria etc. cluster there is an above-average interest in the sports activities (around 1 standard deviation above zero) and a below-average interest in cultural activities (0.5-1 standard deviations below zero).

Finally, almost half (47.7%) of the variance of the overall data is captured by splitting the data into two clusters. The corresponding sums of squares are also accessible as components of the returned model: As these data are scaled to unit variances, the total sum of squares (totss) is \((n - 1) \cdot p = 14 \cdot 8 = 112\). The within sum of squares (withinss) in each of the two clusters is 48.894 and 9.661, yielding a total within sum of squares (tot.withinss) of 58.556.

R

km2$totss

## [1] 112

km2$withinss

## [1] 48.89445  9.66128

km2$tot.withinss

## [1] 58.5557

km2$betweenss

## [1] 53.4443

Python

Total sum of squares

We show 3 alternative ways of computing the totss.

# The total sum of square coincide with the total (within) sum of squares for 1 cluster
km2.totss = ((gsa-gsa.mean())**2).values.sum() # sqeuclidean -> squared euclidean distance
km1 = KMeans(n_clusters=1, init='random', random_state=22).fit(gsa)

print(km2.totss, km1.inertia_)

## 112.00000000000001 112.0

# The following only works if data are scaled to zero mean and unit variances 
totss = (len(gsa) - 1) * (len(gsa.columns))
print(totss)

## 112

Between cluster sum of squares

print(km2.totss - km2.inertia_) # betweenss

## 54.25916754120245

Number of observations for each cluster

counts = np.bincount(km2.labels_)

print(counts)

## [6 9]

Within SS for each cluster

for i in np.unique(km2.labels_):
    withinss = ((gsa.iloc[km2.labels_==i,:]-km2.cluster_centers_[i])**2).values.sum()
    print("withinss of cluster {}: {}".format(i, withinss))

## withinss of cluster 0: 23.72087018019008
## withinss of cluster 1: 34.01996227860748

print(km2.inertia_) # total (within) sum of squares (ss)

## 57.74083245879756

Thus, the total within sum of squares (tot.withinss) is a bit more than half of the original totss, whereas the remaining betweenss is captured by splitting up the data into two clusters.

R

However, recall that the \(k\)-means algorithm is an iterative approach that tries to minimize the total within sum of squares but may yield a local minimum only. In other words we have to check whether 58.556 really is the minimum sum of squares we can obtain for a 2-means solution. Therefore, we run a couple of further 2-means clusterings and extract the total within sum of squares:

kmeans(gsa, 2)$tot.withinss

## [1] 57.7408

kmeans(gsa, 2)$tot.withinss

## [1] 57.7408

kmeans(gsa, 2)$tot.withinss

## [1] 57.7408

This shows that our initial solution of 58.556 was, in fact, a local optimum only. To avoid such cases, one often runs a sufficiently large number (say 50 or 100 or …) of \(k\)-means clusterings with random initializations and only retains the best solution in terms of the within sum of squares. The kmeans() function offers the nstart argument for this so that we can easily do:

km2 <- kmeans(gsa, 2, nstart = 50)
km2$tot.withinss

## [1] 57.7408

There is no guarantee that this is indeed the global minimum. In this small sample, though, it would be possible to run the exhaustive search over all \(15 \choose 2 = 105\) binary partitions and show that there indeed is no better solution. However, more generally if \(n\) is larger, it quickly becomes prohibitively expensive (or even impossible) to search all possible partitions. Then using the best out of nstart trials is typically the best that can be done and one only assures that further runs don’t yield substantial changes anymore.

Choose a suitable number \(k\) of clusters

R

In R, run kmeans() for \(k = 1, \dots, 7\):

km1 <- kmeans(gsa, 1, nstart = 50)
km3 <- kmeans(gsa, 3, nstart = 50)
km4 <- kmeans(gsa, 4, nstart = 50)
km5 <- kmeans(gsa, 5, nstart = 50)
km6 <- kmeans(gsa, 6, nstart = 50)
km7 <- kmeans(gsa, 7, nstart = 50)

Python

km1 = KMeans(n_clusters=1, init='random', n_init=50, random_state=22).fit(gsa)
km3 = KMeans(n_clusters=3, init='random', n_init=50, random_state=22).fit(gsa)
km4 = KMeans(n_clusters=4, init='random', n_init=50, random_state=22).fit(gsa)
km5 = KMeans(n_clusters=5, init='random', n_init=50, random_state=22).fit(gsa)
km6 = KMeans(n_clusters=6, init='random', n_init=50, random_state=22).fit(gsa)
km7 = KMeans(n_clusters=7, init='random', n_init=50, random_state=22).fit(gsa)

Then we collect the associated total within sum of squares and visualize them in a simple bar plot:

R

wss <- c(km1$tot.withinss, km2$tot.withinss, km3$tot.withinss,
  km4$tot.withinss, km5$tot.withinss, km6$tot.withinss, km7$tot.withinss)
names(wss) <- 1:7
barplot(wss)

Python

withinss = [km1.inertia_, km2.inertia_, km3.inertia_, km4.inertia_, km5.inertia_, km6.inertia_, km7.inertia_]
names = [*range(1,8)]
wss = pd.DataFrame({'withinss': withinss, 'names': names})

wss.plot.bar(x='names', y='withinss', rot=0)
plt.show()

This shows that in relative terms the total within sum of squares drops to about a half for the 2-means solutions and less than a third for the 3-means solutions. For higher \(k\) the total within sum of squares drops only relatively little.

R

wss/wss[1]

##        1        2        3        4        5        6        7 
## 1.000000 0.515543 0.294635 0.219326 0.174473 0.142557 0.118122

Python

print(wss['withinss']/wss['withinss'][0])

## 0    1.0000
## 1    0.5155
## 2    0.2946
## 3    0.2193
## 4    0.1772
## 5    0.1426
## 6    0.1260
## Name: withinss, dtype: float64

The relative changes compared to the total sum of squares can also be computed as:

R

100 * (wss[2:7] - wss[1:6])/wss[1]

##         2         3         4         5         6         7 
## -48.44569 -22.09081  -7.53090  -4.48528  -3.19161  -2.44350

Python

print(100 * (wss['withinss'][1:7].values - wss['withinss'][0:6].values)/wss['withinss'][0])

## [-48.4456853  -22.09081116  -7.53089987  -4.20953199  -3.46735335
##   -1.65349234]

Thus, a 3-means clustering appears to be sufficient here as changes for higher \(k\) are rather small.

8.6.3 Interpreting the clusters

The 3-means result in fact yields the same clusters as the Hierarchical clustering in Chapter 7.

R

km3$cluster

## Austria (Vienna)  Austria (other)          Belgium          Denmark           France 
##                3                3                2                2                2 
##          Germany          Hungary            Italy      Netherlands            Spain 
##                3                3                2                2                1 
##           Sweden      Switzerland               UK              USA            other 
##                2                3                2                1                2

Python

gsa.reset_index(inplace=True);
gsa["Labels"] = km3.labels_
gsa.set_index(["Country", "Labels"], inplace=True)

gsa

##                          SA01.tennis  SA02.cycle  ...  SA19.sight  SA21.museum
## Country          Labels                           ...                         
## Austria (Vienna) 2            1.4187      1.3994  ...     -1.3495      -0.8244
## Austria (other)  2            1.5783      1.6138  ...     -1.0711      -0.7235
## Belgium          1           -0.6094     -0.1900  ...     -0.5242      -0.5451
## Denmark          1           -0.1330     -0.5416  ...     -0.1936      -0.6207
## France           1           -0.1526     -0.7144  ...      0.1639       0.3076
## Germany          2           -0.2285      0.8247  ...     -0.5490      -0.5745
## Hungary          2            0.3238      0.8900  ...     -0.0167       0.1558
## Italy            1            0.6322      0.1634  ...      0.3931       1.0689
## Netherlands      1            0.0794      0.6803  ...     -0.5231      -0.6763
## Spain            0           -1.3853     -1.4897  ...      2.4629       2.5119
## Sweden           1           -0.7117     -0.6766  ...     -0.1642      -0.7441
## Switzerland      2            1.8748      0.9287  ...     -1.0749      -0.8807
## UK               1           -0.8036     -0.9762  ...      0.1964      -0.5013
## USA              0           -1.2259     -1.3534  ...      1.3937       1.2254
## other            1           -0.6573     -0.5584  ...      0.8563       0.8209
## 
## [15 rows x 8 columns]

Therefore, the same analysis of these clusters as in Chapter 7.5.3 could be carried out here. However, here we only point out one additional nice aspect: The color-coded plot of the first principal components shown in Chapter 7.5.3 can be complemented by a projection of the cluster means:

R

In R, by using the predict() method of prcomp objects.

gsa.pca <- prcomp(gsa, scale = TRUE)
plot(predict(gsa.pca)[,1:2], type = "n",
  xlim = c(-4.5, 4.5), ylim = c(-4.5, 4.5))
text(predict(gsa.pca)[,1:2], rownames(gsa), col = km3$cluster)
points(predict(gsa.pca, km3$centers)[,1:2], col = 1:3, pch = 3, cex = 3)

Python

gsa_pca = pca(n_components=8, normalize=False)
pca_results = gsa_pca.fit_transform(gsa.reset_index(drop=True))

## [pca] >Extracting column labels from dataframe.
## [pca] >Extracting row labels from dataframe.
## [pca] >The PCA reduction is performed on the [8] columns of the input dataframe.
## [pca] >Fit using PCA.
## [pca] >Compute loadings and PCs.
## [pca] >Compute explained variance.
## [pca] >Outlier detection using Hotelling T2 test with alpha=[0.05] and n_components=[8]
## [pca] >Multiple test correction applied for Hotelling T2 test: [fdr_bh]
## [pca] >Outlier detection using SPE/DmodX with n_std=[3]

pc = gsa_pca.results['PC'].iloc[:, 0:2]

fig = plt.figure()
ax = plt.axis([-5.5, 5.5, -5.5, 5.5])
colors = plt.cm.get_cmap("rainbow", 3)

## <string>:1: MatplotlibDeprecationWarning: The get_cmap function was deprecated in Matplotlib 3.7 and will be removed in 3.11. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap()`` or ``pyplot.get_cmap()`` instead.

countries = gsa.index.levels[0]
for index, row in pc.iterrows():
    color = colors(km3.labels_[index])
    plt.text(row["PC1"], row["PC2"], countries[index], {"color": color})

print(km3.cluster_centers_)

## [[-1.30558032 -1.42153324 -0.83328176 -1.85687623  1.18855647  1.61104305
##    1.92831865  1.86864579]
##  [-0.29448778 -0.35169052 -0.5124076  -0.04426046  0.38495692  0.11731339
##    0.02556571 -0.11125325]
##  [ 0.99341258  1.13131813  1.15316487  0.81356723 -1.09135367 -0.83211864
##   -0.81223259 -0.56945311]]

proj_cluster_means = gsa_pca.transform(np.array(km3.cluster_centers_))

## [pca] >Column labels are auto-completed.
## [pca] >Outlier detection using Hotelling T2 test with alpha=[0.05] and n_components=[8]
## [pca] >Multiple test correction applied for Hotelling T2 test: [fdr_bh]
## [pca] >Outlier detection using SPE/DmodX with n_std=[3]

for index in range(0, proj_cluster_means["PC1"].values.shape[0]):    
    m1 = proj_cluster_means["PC1"].values[index]
    m2 = proj_cluster_means["PC2"].values[index]
    plt.plot(m1, m2, color=colors(index), marker="+", markersize=16)

plt.xlabel("PC1")
plt.ylabel("PC2")
plt.show()

This shows that also in the projection on the first two principal components the cluster means are nicely in the center of their respective cluster.

To facilitate the creation of these plots, we took the code above, and made it slightly more flexible.

R

Additionally, in R we integrated it into a plot() method for kmeans objects. Download the plot.kmeans.R file in your working directory.

Our plot() method for kmeans objects requires two arguments: the fitted kmeans() output and the corresponding original data (in our case the scaled Guest Survey Austria data). Using this function we can easily compare the optimal 3-means to the optimal 4-means result:

source("plot.kmeans.R")
plot(km3, gsa)
plot(km4, gsa)

Python

def plot_kmeans(km, data):
    """
    @ToDo
    """
    gsa_pca = pca(n_components=None, normalize=False)
    pca_results = gsa_pca.fit_transform(data.iloc[:, 2:])
    pc = gsa_pca.results['PC'].iloc[:, 0:2]
    
    n_clusters = km.get_params()["n_clusters"]
    
    fig = plt.figure()
    plt.axis([-5.5, 5.5, -5.5, 5.5])
    colors = plt.cm.get_cmap("rainbow", n_clusters)
    
    for index, row in pc.iterrows():
        color = colors(km.labels_[index])
        plt.text(row["PC1"], row["PC2"], data["Country"][index], {"color": color})

    proj_cluster_means = gsa_pca.transform(np.array(km.cluster_centers_)) 
    for index in range(0, n_clusters):    
        m1 = proj_cluster_means["PC1"].values[index]
        m2 = proj_cluster_means["PC2"].values[index]
        plt.plot(m1, m2, color=colors(index), marker="+", markersize=16)

    plt.xlabel("PC1")
    plt.ylabel("PC2")
    plt.show()

plot_kmeans(km3, gsa.reset_index())

## [pca] >n_components is set to 7
## [pca] >Extracting column labels from dataframe.
## [pca] >Extracting row labels from dataframe.
## [pca] >The PCA reduction is performed on the [8] columns of the input dataframe.
## [pca] >Fit using PCA.
## [pca] >Compute loadings and PCs.
## [pca] >Compute explained variance.
## [pca] >Outlier detection using Hotelling T2 test with alpha=[0.05] and n_components=[7]
## [pca] >Multiple test correction applied for Hotelling T2 test: [fdr_bh]
## [pca] >Outlier detection using SPE/DmodX with n_std=[3]
## <string>:13: MatplotlibDeprecationWarning: The get_cmap function was deprecated in Matplotlib 3.7 and will be removed in 3.11. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap()`` or ``pyplot.get_cmap()`` instead.
## [pca] >Column labels are auto-completed.
## [pca] >Outlier detection using Hotelling T2 test with alpha=[0.05] and n_components=[7]
## [pca] >Multiple test correction applied for Hotelling T2 test: [fdr_bh]
## [pca] >Outlier detection using SPE/DmodX with n_std=[3]

plot_kmeans(km4, gsa.reset_index())

## [pca] >n_components is set to 7
## [pca] >Extracting column labels from dataframe.
## [pca] >Extracting row labels from dataframe.
## [pca] >The PCA reduction is performed on the [8] columns of the input dataframe.
## [pca] >Fit using PCA.
## [pca] >Compute loadings and PCs.
## [pca] >Compute explained variance.
## [pca] >Outlier detection using Hotelling T2 test with alpha=[0.05] and n_components=[7]
## [pca] >Multiple test correction applied for Hotelling T2 test: [fdr_bh]
## [pca] >Outlier detection using SPE/DmodX with n_std=[3]
## <string>:13: MatplotlibDeprecationWarning: The get_cmap function was deprecated in Matplotlib 3.7 and will be removed in 3.11. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap()`` or ``pyplot.get_cmap()`` instead.
## [pca] >Column labels are auto-completed.
## [pca] >Outlier detection using Hotelling T2 test with alpha=[0.05] and n_components=[7]
## [pca] >Multiple test correction applied for Hotelling T2 test: [fdr_bh]
## [pca] >Outlier detection using SPE/DmodX with n_std=[3]

Up to the cluster assignment of the “other” country level, this is again essentially what hierarchical clustering yielded. The corresponding means of the scaled data are:

R

km3$centers

##   SA01.tennis SA02.cycle SA03.ride  SA05.swim SA17.shop SA18.concert SA19.sight
## 1   -1.305580  -1.421533 -0.833282 -1.8568762  1.188556     1.611043  1.9283186
## 2   -0.294488  -0.351691 -0.512408 -0.0442605  0.384957     0.117313  0.0255657
## 3    0.993413   1.131318  1.153165  0.8135672 -1.091354    -0.832119 -0.8122326
##   SA21.museum
## 1    1.868646
## 2   -0.111253
## 3   -0.569453

km4$centers

##   SA01.tennis SA02.cycle SA03.ride SA05.swim SA17.shop SA18.concert SA19.sight
## 1   0.9934126   1.131318  1.153165  0.813567 -1.091354    -0.832119  -0.812233
## 2  -1.3055803  -1.421533 -0.833282 -1.856876  1.188556     1.611043   1.928319
## 3  -0.0592015  -0.369766 -0.448494 -0.386373  0.177842     0.909215   0.471104
## 4  -0.4356596  -0.340845 -0.550756  0.161007  0.509226    -0.357827  -0.241757
##   SA21.museum
## 1   -0.569453
## 2    1.868646
## 3    0.732482
## 4   -0.617494

Python

print(km3.cluster_centers_)

## [[-1.30558032 -1.42153324 -0.83328176 -1.85687623  1.18855647  1.61104305
##    1.92831865  1.86864579]
##  [-0.29448778 -0.35169052 -0.5124076  -0.04426046  0.38495692  0.11731339
##    0.02556571 -0.11125325]
##  [ 0.99341258  1.13131813  1.15316487  0.81356723 -1.09135367 -0.83211864
##   -0.81223259 -0.56945311]]

print(km4.cluster_centers_)

## [[-0.43565956 -0.3408453  -0.55075583  0.16100713  0.50922588 -0.35782735
##   -0.24175749 -0.61749425]
##  [-0.05920148 -0.36976589 -0.44849388 -0.3863731   0.177842    0.90921461
##    0.47110438  0.73248174]
##  [ 0.99341258  1.13131813  1.15316487  0.81356723 -1.09135367 -0.83211864
##   -0.81223259 -0.56945311]
##  [-1.30558032 -1.42153324 -0.83328176 -1.85687623  1.18855647  1.61104305
##    1.92831865  1.86864579]]

Chapter 8 \(k\)-means clustering

8.1 Motivation

8.2 Algorithm

8.3 Interpretation

8.4 Illustration

8.5 Related methods

8.6 Tutorial

8.6.1 Setup

R

Python

R

Python

8.6.2 Clustering

R

Python

R

Python

R

R

Python

R

Python

R

Python

R

Python

8.6.3 Interpreting the clusters

R

Python

R

Python

R

Python

R

Python