Chapter 4.2 Illustrative application
readme.html
and run
4-0_ChapterSetup.R
. This will create
4-0_ChapterSetup.RData
in the sub folder
data/R
. This file contains the data required to produce the
plots shown below.legend_large_box
to
your environment in order to render the tweaked version of the legend
described below. You find this file in the source
folder of
the unzipped Chapter 4 archive.LoadInstallPackages.R
# assuming you are working within .Rproj environment
library(here)
# install (if necessary) and load other required packages
source(here("source", "load_libraries.R"))
# load environment generated in "4-0_ChapterSetup.R"
load(here("data", "R", "4-0_ChapterSetup.RData"))
In chapter 4.2, we apply hierarchical and partitional clustering to family formation sequences. The data come from a sub-sample of the German Family Panel - pairfam. For further information on the study and on how to access the full scientific use file see here.
We apply a hierarchical cluster analysis by using the command
?hclust
to the dissimilarity matrix
partner.child.year.om
for the family formation sequences,
computed based on OM with indel
=1 and sm
=2. We
use non-squared dissimilarities (see the method
option) and
weights (see the members
option, where we have to specify
to which data.frame
the vector with the weights belongs
to).
<- hclust(as.dist(partner.child.year.om),
fam.ward1 method = "ward.D",
members = family$weight40)
One can combine different heuristics to select the number of
clusters. One is the so-called “elbow” method, based on a line graph
which can be obtained by using the ?fviz_nbclust
command.
Notice that we specify here the method option (the method to be used for
estimating the optimal number of clusters) as wss
, that is
the total within sum of square. The input of the command is the
dissimilarity matrix partner.child.year.om
:
fviz_nbclust(partner.child.year.om,
FUN = hcut,
method = "wss",
barfill = "black",
barcolor = "black",
linecolor = "black")
dev.off()
As an alternative or to be used in combination with the elbow method
(please refer Chapter 4.2 in the book), we can use the
?as.clustrange
command by using the ?hclust
results (here stored in the object fam.ward1
) as input.
?as.clustrange
will return a series of cluster quality
indicators. Notice that we have to specify the diss
option
by using the dissimilarity matrix of interest and the number of clusters
we want the command to “test” (here 10 in the ncluster
option)
<- as.clustrange(fam.ward1,
fam.ward.test diss = partner.child.year.om,
weights =family$weight40,
ncluster = 10)
Let’s look at the cluster quality indicators
fam.ward.test
PBC HG HGSD ASW ASWw CH R2 CHsq R2sq HC
cluster2 0.25 0.30 0.29 0.18 0.18 323.97 0.12 565.24 0.19 0.34
cluster3 0.42 0.51 0.50 0.21 0.21 293.92 0.20 558.60 0.32 0.25
cluster4 0.53 0.69 0.68 0.23 0.23 250.58 0.24 515.74 0.40 0.17
cluster5 0.56 0.74 0.72 0.23 0.23 218.61 0.27 464.61 0.44 0.15
cluster6 0.49 0.69 0.68 0.16 0.17 193.78 0.29 407.00 0.47 0.18
cluster7 0.51 0.72 0.71 0.18 0.18 178.04 0.31 390.32 0.50 0.17
cluster8 0.53 0.77 0.75 0.18 0.18 169.58 0.34 385.53 0.54 0.15
cluster9 0.52 0.81 0.79 0.16 0.17 167.33 0.36 390.71 0.57 0.14
cluster10 0.52 0.83 0.81 0.16 0.17 159.92 0.38 383.55 0.60 0.13
We can also visualize the trends in the indicators. Here some selected cluster quality indicators, visualized as z-scores
plot(fam.ward.test, stat = c("ASW", "HG", "PBC", "CH", "R2"),
lty = 1,
norm = "zscore",
main = "CQIs, adjusted by mean and standard deviation (z-scores)",
lwd = 5,
ylab="z-scores")
dev.off()
Based on considerations presented in the book, we extract the
5-cluster solution. This can be done with the ?cutree
command, specifying the ?hclust
-generate object and the
k
number of clusters to be extracted. We store this
information into an object called fam.ward.5cl
.5cl <-cutree(fam.ward1, k = 5) fam.ward
…that we add to the main data.frame
family
:
$fam.ward.5cl<-fam.ward.5cl family
Because we want to illustrate how to compare different cluster solutions, we also extract 4 clusters and repeat the procedure
.4cl <-cutree(fam.ward1, k = 4)
fam.ward
$fam.ward.4cl<-fam.ward.4cl family
We can now cross-tabulate the 4- and 5-cluster solutions storing the table into an object…
<-table(fam.ward.5cl, fam.ward.4cl) comp.ward
… that we can print at our convenience:
comp.ward
fam.ward.4cl
fam.ward.5cl 1 2 3 4
1 289 0 0 0
2 0 353 0 0
3 0 147 0 0
4 0 0 386 0
5 0 0 0 691
We use the ?wcKMedRange
command to obtain the cluster
quality indicators for a number of clusters between 2 and 10 (see the
kvals
option) using the PAM algorithm. We also specify
weights (see weights
option). Further, we specify the
initialclust
option by using the
?hclust
-generated object we used above
<- wcKMedRange(partner.child.year.om,
fam.pam.ward weights = family$weight40,
kvals = 2:10,
initialclust = fam.ward1)
Let’s consider the cluster quality criteria for this PAM clustering
fam.pam.ward
PBC HG HGSD ASW ASWw CH R2 CHsq R2sq HC
cluster2 0.43 0.52 0.50 0.22 0.23 348.56 0.13 676.34 0.22 0.24
cluster3 0.49 0.61 0.60 0.24 0.24 318.83 0.21 642.69 0.35 0.20
cluster4 0.56 0.72 0.71 0.25 0.25 272.71 0.26 577.80 0.43 0.15
cluster5 0.57 0.77 0.76 0.25 0.25 237.22 0.29 521.18 0.47 0.13
cluster6 0.48 0.71 0.69 0.19 0.19 212.05 0.31 462.26 0.50 0.18
cluster7 0.49 0.75 0.73 0.20 0.20 200.27 0.34 450.94 0.54 0.16
cluster8 0.50 0.78 0.77 0.21 0.21 194.91 0.37 456.15 0.58 0.15
cluster9 0.50 0.81 0.79 0.20 0.20 187.59 0.39 448.61 0.61 0.14
cluster10 0.52 0.87 0.85 0.23 0.24 184.81 0.42 471.62 0.64 0.11
We first extract a 4-cluster solution
.4cl<-fam.pam.ward$clustering$cluster4 fam.pam.ward
… attach the vector with the 4-cluster solution to the main
data.frame
:
$fam.pam.ward.4cl<-fam.pam.ward.4cl family
… and re-label clusters from 1 to 4 instead of medoid identifier by
using the ?recode
command:
$fam.pam.ward.4cl<-car::recode(family$fam.pam.ward.4cl,
family"1532=1; 1664=2; 1643=3; 985=4")
We then extract the 5-cluster solution with the same procedure:
.5cl<-fam.pam.ward$clustering$cluster5
fam.pam.ward
$fam.pam.ward.5cl<-fam.pam.ward.5cl
family
$fam.pam.ward.5cl<-car::recode(family$fam.pam.ward.5cl,
family"982=1; 790=2; 373=3; 1643=4; 985=5")
We want to compare the 4 and 5-cluster solutions for PAM+Ward
clustering. We can use the ?table
command and store the
table into an object…
<-table(family$fam.pam.ward.5cl, family$fam.pam.ward.4cl) comp.pam.ward
…to be printed at our convenience
comp.pam.ward
1 2 3 4
1 235 12 2 43
2 54 240 11 21
3 7 232 14 6
4 2 4 266 0
5 0 23 0 694
We now compute the average silhouette width by cluster: we can use
the ?silhouette
command and specify the dissimilarity
matrix in the dmatrix
option. We do that for the 4- and the
5-cluster solutions. As usual, we store the results into an object…
.5cl <- silhouette(family$fam.pam.ward.5cl,
silh.pam.warddmatrix = partner.child.year.om)
.4cl <- silhouette(family$fam.pam.ward.4cl,
silh.pam.warddmatrix = partner.child.year.om)
…that can be plotted:
par(mfrow=c(1,2))
plot(silh.pam.ward.4cl, main = "(a) four-cluster solution",
col="grey", border=NA)
plot(silh.pam.ward.5cl, main = "(b) five-cluster solution",
col="grey", border=NA)
dev.off()
We now proceed with the 5-cluster solution and generate substantively meaningful label for the clusters. Notice that to do this, we need to first visualize the clusters and ideally explore each of them with the descriptive tools presented in Chapter 2
.5cl <- c("Early parenthood in cohabitation",
fam.pam.ward.lab"LAT and long-lasting cohabitation without children",
"Early marriage with 1 child",
"Long-lasting singleness and childlessness",
"Early marriage with 2+ children")
We attach the labels to the clusters by generating a factor variable:
beware of the order of the clusters (option levels
) and of
the labels (option labels
which is specified with the
object created above), as it has to correspond!
.5cl <- factor(family$fam.pam.ward.5cl,
fam.pam.ward.factorlevels = c(1,2,3,4,5),
labels=fam.pam.ward.lab.5cl)
We can now attach the cluster-vector (with labels) to the main
data.frame
:
$fam.pam.ward.factor.5cl<-fam.pam.ward.factor.5cl family
To be sure, one might want to confirm that family with the newly attached vectors is a whole data frame
<-data.frame(family) family
Among the many options for visualization introduced in Chapter 2, we
use the relative frequency plot. The standard ?seqplot.rf
command for this graph produces a combined figure: the sequences and the
distance-to-medoid plot. One can also plot only one of the two by
specifying the option which.plot
. In the archive
Chapter04.zip in the download area of this webpage, you can find a code
to efficiently plot only the sequences part of the relative frequency
plot especially if you want to plot the clusters in the same figure.
Assuming you are working within .Rproj
environment, you can
use the following code
source(here("source", "rfplotsleft.R"))
Notice that we recommend to always display the distance-to-medoid
plot for each cluster at least in the appendix of your article: an
example of this can be found in Struffolino and Van Winkle 2021
(open-access version here).
For a nice display of the clusters using the sourced code
rfplotsleft.R
, we first generate separate objects
containing the sequences allocated to each clusters:
<-(partner.child.year.seq[family$fam.pam.ward.factor.5cl=="Early parenthood in cohabitation",1:22])
cl1_5cl
<-(partner.child.year.seq[family$fam.pam.ward.factor.5cl=="LAT and long-lasting cohabitation without children",1:22])
cl2_5cl
<-(partner.child.year.seq[family$fam.pam.ward.factor.5cl=="Early marriage with 1 child",1:22])
cl3_5cl
<-(partner.child.year.seq[family$fam.pam.ward.factor.5cl=="Long-lasting singleness and childlessness",1:22])
cl4_5cl
<-(partner.child.year.seq[family$fam.pam.ward.factor.5cl=="Early marriage with 2+ children",1:22]) cl5_5cl
We then compute the dissimilarity matrix for each separate cluster:
<- seqdist(cl1_5cl, method="OM", indel=1, sm= "CONSTANT")
cl1_5cl.om
<- seqdist(cl2_5cl, method="OM", indel=1, sm= "CONSTANT")
cl2_5cl.om
<- seqdist(cl3_5cl, method="OM", indel=1, sm= "CONSTANT")
cl3_5cl.om
<- seqdist(cl4_5cl, method="OM", indel=1, sm= "CONSTANT")
cl4_5cl.om
<- seqdist(cl5_5cl, method="OM", indel=1, sm= "CONSTANT") cl5_5cl.om
In preparation for the graph, we generate labels for x-axis, 22 time-points in steps of 2, labeled with numbers from 18 to 40, that is the age span our sequences cover
<- seq(from = 0, to = 22, by = 2)
count <- seq(from = 18, to = 40, by = 2) years
We now code the combined relative frequency plot (only the sequences
part) to display all clusters at once: for each cluster, we set 50
medoid sequences. Notice that here we use the seqplot.rf.l
command that is included in the sourced code rfplotsleft.R
.
Note that the line below the x-axis label “Age” of each cluster plot
reports the R2 and F-stat values generated by running the standard
?seqplot.rf
command. It is your choice to report them, but
we recommend to do so. Unfortunately, for now our solution is to impute
this information manually as you can see in the code below:
<- par(no.readonly = TRUE)
def.par
par(oma = c(0, 2, 2, 0))
<- matrix(c(1,2,3,4,5,6), 3, 2, byrow = TRUE)
m <- layout(mat = m, heights = c(0.8,0.8,0.8))
nf layout.show(nf)
par(mar = c(5, 3, 3, 3))
seqplot.rf.l(cl1_5cl, diss=cl1_5cl.om,
k=50, xlab="",cex.main=1.6,
title="1.Early parenthood in cohabitation (15.7%)",
ylab=FALSE,
cex.lab=1.4, axes=FALSE)
mtext("Age",side = 1, line = 2.5, cex=0.9)
mtext("Medoid sequences",side = 2, line = 1.5, cex=0.9)
mtext("R2=0.27 / F-stat=1.79",side = 1, line = 4, cex=0.7)
axis(1, at = count, labels = years, font = 1, cex.axis = 1, lwd = 1)
seqplot.rf.l(cl2_5cl, diss=cl2_5cl.om,
k=50, xlab="",cex.main=1.6,
title="2.LAT/long-lasting cohabitation without children (17.5%)",
ylab=FALSE,
cex.lab=1.4, axes=FALSE)
mtext("Age",side = 1, line = 2.5, cex=0.9)
mtext("Medoid sequences",side = 2, line = 1.5, cex=0.9)
mtext("R2=0.30 / F-stat=2.40",side = 1, line = 4, cex=0.7)
axis(1, at = count, labels = years, font = 1, cex.axis = 1, lwd = 1)
seqplot.rf.l(cl3_5cl, diss=cl3_5cl.om,
k=50, xlab="",cex.main=1.6,
title="3.Early marriage with 1 child (13.9%)",
ylab=FALSE,
cex.lab=1.4, axes=FALSE)
mtext("Age",side = 1, line = 2.5, cex=0.9)
mtext("Medoid sequences",side = 2, line = 1.5, cex=0.9)
mtext("R2=0.45 / F-stat=3.47",side = 1, line = 4, cex=0.7)
axis(1, at = count, labels = years, font = 1, cex.axis = 1, lwd = 1)
seqplot.rf.l(cl4_5cl, diss=cl4_5cl.om,
k=50, xlab="",cex.main=1.6,
title="4.Long-lasting singleness and childlessness (14.6%)",
ylab=FALSE,
cex.lab=1.4, axes=FALSE)
mtext("Age",side = 1, line = 2.5, cex=0.9)
mtext("Medoid sequences",side = 2, line = 1.5, cex=0.9)
mtext("R2=0.32 / F-stat=2.13",side = 1, line = 4, cex=0.7)
axis(1, at = count, labels = years, font = 1, cex.axis = 1, lwd = 1)
seqplot.rf.l(cl5_5cl, diss=cl5_5cl.om,
k=50, xlab="",cex.main=1.6,
title="5.Early marriage with 2+ children (38.4%)",
ylab=FALSE,
cex.lab=1.4, axes=FALSE)
mtext("Age",side = 1, line = 2.5, cex=0.9)
mtext("Medoid sequences",side = 2, line = 1.5, cex=0.9)
mtext("R2=0.35 / F-stat=7.26",side = 1, line = 4, cex=0.7)
axis(1, at = count, labels = years, font = 1, cex.axis = 1, lwd = 1)
plot(1, type = "n", axes=FALSE, xlab="", ylab="")
legend(x = "center",inset = 0,
legend = longlab.partner.child,
col=colspace.partner.child,
lwd=10, cex=1.3, ncol=1)
dev.off()
We use the ?cmdscale
to calculate the multidimensional
scaling of the dissimilarity matrix partner.child.year.om
.
We the k
option (the maximum dimension of the space which
the data are to be represented in) at 2
<-cmdscale(partner.child.year.om, k = 2) mds.year.om
For illustrative purposes, we extract 3 clusters based on the hierarchical clustering (Ward) and the PAM clustering following the procedure presented in detail above
# Ward
<- hclust(as.dist(partner.child.year.om),
mds_ward method = "ward.D",
members = family$weight40)
# Cut the clustering at 3
<-cutree(mds_ward, k = 3)
mds_ward_3
#PAM
<- wcKMedoids(partner.child.year.om, k = 3)
mds_pam
# Extract the 3 clusters
<- mds_pam$clustering
mds_pam
# Re-label the clusters 1 to 3 instead of medois id
<- car::recode(mds_pam, "156=1; 985=2; 1735=3") mds_pam
We can now use the objects mds.year.om
,
mds_ward
, mds_pam
to visualize the
distribution of the sequences in the samples by cluster (extracted by
using two different clustering method) in a bi-dimensional space
par(mfrow=c(1,2))
par(mar = c(7, 7, 3, 3))
plot(mds.year.om, type = "n",
main="Hierachical clustering (Ward's linkage)",
ylab="Coordinate 2",
xlab="Coordinate 1",
cex.lab=1.7,
cex.main=1.7)
points(mds.year.om[mds_ward_3 == 1, ], pch = 19, col = "black")
points(mds.year.om[mds_ward_3 == 2, ], pch = 21, col = "black")
points(mds.year.om[mds_ward_3 == 3, ], pch = 24, col = "black")
legend("bottomright", pch = c(19,21,24), legend = c("cluster 1", "cluster 2", "cluster 3"), cex=1.6)
plot(silh.pam.ward.5cl, main = "(b) MDS PAM", col="grey", border=NA)
par(mar = c(7, 7, 3, 3))
plot(mds.year.om, type = "n",
main="Partitional clustering (k-medoid)",
ylab="Coordinate 2",
xlab="Coordinate 1",
cex.lab=1.7,
cex.main=1.7)
points(mds.year.om[mds_pam == 1, ], pch = 19, col = "black")
points(mds.year.om[mds_pam == 2, ], pch = 21, col = "black")
points(mds.year.om[mds_pam == 3, ], pch = 24, col = "black")
legend("bottomright", pch = c(19,21,24), legend = c("cluster 1", "cluster 2", "cluster 3"), cex=1.6)
dev.off()
We adopt the same strategy used above for yearly sequences to the case of monthly sequences.
Hierarchical cluster analysis, non-squared dissimilarities, weighted
<- hclust(as.dist(partner.child.month.om),
fam.ward1.month method = "ward.D",
members = family$weight40)
PAM cluster analysis initialized with Ward, weighted
<- wcKMedRange(partner.child.month.om,
fam.pam.ward.month weights = family$weight40,
kvals = 2:10,
initialclust = fam.ward1.month)
Print the quality test for different cluster solutions
fam.pam.ward.month
PBC HG HGSD ASW ASWw CH R2 CHsq R2sq HC
cluster2 0.41 0.47 0.47 0.22 0.22 336.85 0.13 645.83 0.22 0.25
cluster3 0.51 0.61 0.60 0.23 0.24 304.42 0.21 613.04 0.34 0.19
cluster4 0.49 0.62 0.62 0.21 0.21 265.61 0.25 548.78 0.41 0.20
cluster5 0.51 0.68 0.68 0.21 0.21 234.84 0.29 495.70 0.46 0.17
cluster6 0.51 0.72 0.72 0.21 0.21 219.85 0.32 487.01 0.51 0.16
cluster7 0.52 0.76 0.76 0.22 0.22 208.00 0.35 473.79 0.55 0.14
cluster8 0.50 0.76 0.76 0.20 0.20 195.64 0.37 450.95 0.57 0.15
cluster9 0.51 0.80 0.80 0.22 0.23 190.03 0.39 458.18 0.61 0.13
cluster10 0.53 0.85 0.85 0.24 0.24 185.71 0.42 474.29 0.65 0.10
Extract the 5-cluster solution
.5cl<-fam.pam.ward.month$clustering$cluster5 fam.pam.ward.month
Attach the vector with the 5-cluster solution to the main
data.frame
$fam.pam.ward.month.5cl<-fam.pam.ward.month.5cl family
Re-label clusters from 1 to 5 instead of medoid identifiers
$fam.pam.ward.month.5cl<-car::recode(family$fam.pam.ward.month.5cl,
family"1652=1; 1032=2; 412=3; 869=4; 927=5")
We can now compare the cluster assignment in the case of yearly and monthly data. In both cases we extracted 5 clusters from the PAM+Ward option…
<-table(family$fam.pam.ward.5cl, family$fam.pam.ward.month.5cl) comp.y.m
…and print the resulting table
comp.y.m
1 2 3 4 5
1 222 38 7 3 22
2 91 108 115 11 1
3 0 35 212 11 1
4 5 29 2 236 0
5 2 217 5 0 493
If you see mistakes or want to suggest changes, please create an issue on the source repository.
Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. Source code is available at https://github.com/sa-book/sa-book.github.io, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".