Chapter 3.7 Beyond the standard full-sample pairwise sequence comparison
readme.html
and run
3-0_ChapterSetup.R
. This will create
3-0_ChapterSetup.RData
in the sub folder
data/R
. This file contains the data required to produce the
plots shown below.legend_large_box
to
your environment in order to render the tweaked version of the legend
described below. You find this file in the source
folder of
the unzipped Chapter 3 archive.LoadInstallPackages.R
# assuming you are working within .Rproj environment
library(here)
# install (if necessary) and load other required packages
source(here("source", "LoadInstallPackages.R"))
# load environment generated in "3-0_ChapterSetup.R"
load(here("data", "R", "3-0_ChapterSetup.RData"))
In chapter 3.7, we introduce techniques to compare all sequences in a dataset to a reference sequences instead of conducting a full-sample pairwise comparison. The data come from a sub-sample of the German Family Panel - pairfam. For further information on the study and on how to access the full scientific use file see here.
First, we generate an object that contains the theoretical sequence id the SPS format of the same length of the sequences in the sample:
<-as.matrix("(S,5)-(MAR,2)-(MARc1,3)-(MARc2+,12)") theo.seq
…and print it
theo.seq
[,1]
[1,] "(S,5)-(MAR,2)-(MARc1,3)-(MARc2+,12)"
We then recode the theoretical sequence from SPS to STS and overwrite the object that contains it
<- seqformat(theo.seq, from = "SPS", to = "STS") theo.seq
We overwrite it again, by formally define it as a sequence using the
TraMineR
?seqdef
command:
<-seqdef(theo.seq,
theo.seqstates = shortlab.partner.child,
alphabet = shortlab.partner.child,
xtstep = 1)
We can now compute the dissimilarity between the theoretical sequence
and all sequences in the data. Here we use OM with constant substitution
costs and indel
=1. The theo.seq
object has to
be specified as argument to the option refseq
<-seqdist(partner.child.year.seq,
dist.theomethod = "OM",
indel = 1,
sm = "CONSTANT",
refseq = theo.seq)
We can identify the most frequent sequence by using the
?seqtab
command and specify the idxs
as
follows and store it in a separate object, here called
mostfreq.seq
<-seqtab(partner.child.year.seq,
mostfreq.seqidxs = 1,
weighted = FALSE,
format = "SPS")
We print the most frequent sequence
mostfreq.seq
Freq Percent
S/22 14 0.75
… and compute the dissimilarity between the most frequent sequence
and all sequences in the data. The mostfreq.seq
object has
to be specified as argument to the option refseq
<-seqdist(partner.child.year.seq,
dist.mostfreqmethod = "OM",
indel = 1,
sm = "CONSTANT",
refseq = mostfreq.seq)
First, we have to compute a dissimilarity matrix, here we use OM with
constant substitution costs of 2 and indel
=1
<-seqdist(partner.child.year.seq,
om.s2.i1method = "OM",
indel = 1,
sm = "CONSTANT")
We can then identify the medoid by using the ?disscenter
command and specify the medoids.index
options as
follows
<- disscenter(om.s2.i1,
medoid.seq medoids.index="first")
The medoid can be also printed directly by using the following code
print(partner.child.year.seq[medoid.seq,],
format="SPS")
Sequence
1544 (S,2)-(LAT,1)-(S,3)-(LAT,3)-(COH,2)-(MAR,2)-(MARc1,4)-(MARc2+,5)
We can now compute the dissimilarity between the medoid sequence and
all sequences in the data. The medoid.seq
object has to be
specified as argument to the option refseq
<-seqdist(partner.child.year.seq,
dist.medoidmethod = "OM",
indel = 1,
sm = "CONSTANT",
refseq = medoid.seq[1])
We display the dissimilarity values for the first three sequences in the sample and the three reference sequences:
1:3] dist.theo[
1 2 3
42 24 36
1:3] dist.mostfreq[
1 2 3
44 32 42
1:3] dist.medoid[
1 2 3
40 16 30
For a visual comparison, here is the version in colors of Figure 3.3, displayed in black and white in the book:
Here the code to generate the figure above
<- layout(matrix(c(1,2,3,4,5), 5, 1, byrow = TRUE),
layout.fig1 heights = c(.20,.20,.20,.42,.20))
layout.show(layout.fig1)
par(mar=c(3, 3, 3, 2))
#theoretical
seqiplot(theo.seq,
with.legend=FALSE,
border = TRUE,
axes = FALSE,
yaxis = FALSE, ylab = NA,
main="",
cex.main = 2,
cpal = colspace.partner.child)
mtext(text = "(a) Theoretical traditional family formation sequence",
side = 3,#side 1 = bottom
line = 1,
las=1)
#most freqent
seqiplot(mostfreq.seq,
with.legend=FALSE,
border = TRUE,
axes = FALSE,
yaxis = FALSE, ylab = NA,
main="",
cex.main = 2,
cpal = colspace.partner.child)
mtext(text = "(b) Most frequent family formation sequence in the example data",
side = 3,#side 1 = bottom
line = 1,
las=1)
# medoid
seqiplot(partner.child.year.seq[medoid.seq,],
with.legend=FALSE,
border = TRUE,
axes = FALSE,
yaxis = FALSE, ylab = NA,
main="",
cex.main = 2,
cpal = colspace.partner.child)
mtext(text = "(c) Medoid family formation sequence in the example data",
side = 3,#side 1 = bottom
line = 1,
las=1)
#example 3 seq
seqiplot(partner.child.year.seq [1:3, ],
with.legend=FALSE,
border = TRUE,
axes = FALSE,
yaxis = FALSE, ylab = NA,
main="",
cex.main = 2,
cpal = colspace.partner.child,
weighted=FALSE)
par(mgp=c(3,1,-0.5)) # adjust parameters for x-axis
axis(1, at=(seq(0,22, 2)), labels = seq(18,40, by = 2))
par(mgp=c(3,1,0.3), las=1)
axis(2, at=c(0.7,1.9,3), labels = c(1,2,3))
mtext(text = "(d) Three illustrative sequences of family formation",
side = 3,#side 1 = bottom
line = 1)
mtext(text = "Age",
side = 1,#side 1 = bottom
line = 2)
#legend
plot(NULL ,xaxt='n',yaxt='n',bty='n',ylab='',xlab='', xlim=0:1, ylim=0:1)
legend(x = "top",inset = 0,
legend = longlab.partner.child,
col=colspace.partner.child,
lwd=12,
cex=1.2,
ncol=3,lty = 7)
par(mar=c(1, 1, 4, 1))
title(main = "",
line = 2, font.main = 1)
dev.off()
If you see mistakes or want to suggest changes, please create an issue on the source repository.
Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. Source code is available at https://github.com/sa-book/sa-book.github.io, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".