Chapter 3.3 Alignment-based extensions of OM
readme.html
and run
3-0_ChapterSetup.R
. This will create
3-0_ChapterSetup.RData
in the sub folder
data/R
. This file contains the data required to produce the
plots shown below.legend_large_box
to
your environment in order to render the tweaked version of the legend
described below. You find this file in the source
folder of
the unzipped Chapter 3 archive.LoadInstallPackages.R
# assuming you are working within .Rproj environment
library(here)
# install (if necessary) and load other required packages
source(here("source", "load_libraries.R"))
# load environment generated in "3-0_ChapterSetup.R"
load(here("data", "R", "3-0_ChapterSetup.RData"))
In chapter 3.3, we introduce alignment-based extensions of OM. The data come from a sub-sample of the German Family Panel - pairfam. For further information on the study and on how to access the full scientific use file see here.
We generate a substitution costs matrix to be used in the next step based on the following theory-based costs:
substitution=1 if one if different number of children but same partnership status
substitution=2 any substitution of different partnership statuses
<- matrix(
theo c(0,1,2,2,2,2,2,2,2,
1,0,2,2,2,2,2,2,2,
2,2,0,1,2,2,2,2,2,
2,2,1,0,2,2,2,2,2,
2,2,2,2,0,1,2,2,2,
2,2,2,2,1,0,2,2,2,
2,2,2,2,2,2,0,1,1,
2,2,2,2,2,2,1,0,1,
2,2,2,2,2,2,1,1,0),
nrow = 9,
ncol = 9,
byrow = TRUE,
dimnames = list(longlab.partner.child,
shortlab.partner.child))
and display the resulting matrix:
theo
S Sc LAT LATc COH COHc MAR MARc1 MARc2+
Single, no child 0 1 2 2 2 2 2 2 2
Single, child(ren) 1 0 2 2 2 2 2 2 2
LAT, no child 2 2 0 1 2 2 2 2 2
LAT, child(ren) 2 2 1 0 2 2 2 2 2
Cohabiting, no child 2 2 2 2 0 1 2 2 2
Cohabiting, child(ren) 2 2 2 2 1 0 2 2 2
Married, no child 2 2 2 2 2 2 0 1 1
Married, 1 child 2 2 2 2 2 2 1 0 1
Married, 2+ children 2 2 2 2 2 2 1 1 0
We use this matrix to calculate pairwise dissimilarities:
<-seqdist(partner.child.year.seq, method = "OM", indel = 1,sm = theo) om.theo
Let’s consider the first three sequences in the initial sample:
In the book, this corresponds to Figure 3.1: if you want to generate this figure in color as we do here, you can use the following code
<- layout(matrix(seq(1:2), 1, 2, byrow = TRUE),
layout.fig1 widths = c(.7, .3))
layout.show(layout.fig1)
par(mar = c(5, 4, 4, 0) + 0.1, las = 1,
mgp=c(2,1,-.5))
seqiplot(partner.child.year.seq[1:3, ],
ylab = "Sequences",
with.legend = "FALSE",
border = TRUE,
axes = FALSE,
cpal = colspace.partner.child,
main = "",
weighted=FALSE)
par(mgp=c(3,1,0.25))
axis(1, at=(seq(0,22, 2)), labels = seq(18,40, by = 2))
mtext(text = "Age", cex = .8,
side = 1,#side 1 = bottom
line = 2.5)
par(mar = c(5, 0, 4, 0) + 0.1)
seqlegend(partner.child.year.seq,
cpal = colspace.partner.child,
cex = .95, position = "center")
dev.off()
We can now inspect a selection of the dissimilarity matrix between the three sequences above:
1:3, 1:3] om.theo[
1 2 3
1 0 37 40
2 37 0 18
3 40 18 0
Generate two vectors with the following properties for each possible state:
partner: 0=single, 1=couple
child: 0=no, 1=yes
<- c(0, 0, 1, 1, 1, 1, 1,1,1)
partner <- c(0,1,0,1,0,1,0,1,1) child
Create a data.frame
with the vectors:
<- data.frame(partner = partner, child = child) alphabetprop
Label the rows of the data.frame
after the states’
names:
rownames(alphabetprop) <- alphabet(partner.child.year.seq)
… and display the data.frame
alphabetprop
partner child
S 0 0
Sc 0 1
LAT 1 0
LATc 1 1
COH 1 0
COHc 1 1
MAR 1 0
MARc1 1 1
MARc2+ 1 1
Generate a substitution costs matrix based on the properties with “Gower” algorithm and print it
<- as.matrix(daisy(alphabetprop,
prop metric = "gower"))
prop
S Sc LAT LATc COH COHc MAR MARc1 MARc2+
S 0.0 0.5 0.5 1.0 0.5 1.0 0.5 1.0 1.0
Sc 0.5 0.0 1.0 0.5 1.0 0.5 1.0 0.5 0.5
LAT 0.5 1.0 0.0 0.5 0.0 0.5 0.0 0.5 0.5
LATc 1.0 0.5 0.5 0.0 0.5 0.0 0.5 0.0 0.0
COH 0.5 1.0 0.0 0.5 0.0 0.5 0.0 0.5 0.5
COHc 1.0 0.5 0.5 0.0 0.5 0.0 0.5 0.0 0.0
MAR 0.5 1.0 0.0 0.5 0.0 0.5 0.0 0.5 0.5
MARc1 1.0 0.5 0.5 0.0 0.5 0.0 0.5 0.0 0.0
MARc2+ 1.0 0.5 0.5 0.0 0.5 0.0 0.5 0.0 0.0
Alternatively, this code can be used within the ?seqcost
command (where FEATURES
corresponds to the properties)
<- seqcost(partner.child.year.seq,
prop2 method="FEATURES",
state.features = alphabetprop,
feature.weights = NULL)
prop2
$indel
[1] 0.5
$sm
S Sc LAT LATc COH COHc MAR MARc1 MARc2+
S 0.0 0.5 0.5 1.0 0.5 1.0 0.5 1.0 1.0
Sc 0.5 0.0 1.0 0.5 1.0 0.5 1.0 0.5 0.5
LAT 0.5 1.0 0.0 0.5 0.0 0.5 0.0 0.5 0.5
LATc 1.0 0.5 0.5 0.0 0.5 0.0 0.5 0.0 0.0
COH 0.5 1.0 0.0 0.5 0.0 0.5 0.0 0.5 0.5
COHc 1.0 0.5 0.5 0.0 0.5 0.0 0.5 0.0 0.0
MAR 0.5 1.0 0.0 0.5 0.0 0.5 0.0 0.5 0.5
MARc1 1.0 0.5 0.5 0.0 0.5 0.0 0.5 0.0 0.0
MARc2+ 1.0 0.5 0.5 0.0 0.5 0.0 0.5 0.0 0.0
We then use the matrix to calculate the pairwise dissimilarities, and display a selection of the dissimilarity matrix between sequences in the figure above (Fig. 3.1):
<-seqdist(partner.child.year.seq,
om.propmethod = "OM",
indel = 1,
sm = prop)
1:3, 1:3] om.prop[
1 2 3
1 0.0 15.5 9.0
2 15.5 0.0 6.5
3 9.0 6.5 0.0
Here an alternative code to use the matrix to calculate pairwise dissimilarities and display the dissimilarity matrix between the usual sequences (Fig. 3.1)
<-seqdist(partner.child.year.seq,
om.prop2method = "OM",
indel = 1,
sm = prop2$sm)
1:3, 1:3] om.prop2[
1 2 3
1 0.0 15.5 9.0
2 15.5 0.0 6.5
3 9.0 6.5 0.0
For illustrative purpose, we use three example sequences (6 time-points, 3 states: A, B, C)
<- c("A-B-B-C-C-C", "A-B-B-B-B-B", "B-C-C-C-B-B")
ch3.ex2
<- seqdef(ch3.ex2) ch3.ex2.seq
The command ?seqcost
generates substitution costs, in
this case based on transition rates. The command returns also the value
of indel
.
<-seqcost(ch3.ex2.seq, method="TRATE") trate1
We can display the substitution costs matrix based on transition rates
trate1
$indel
[1] 1
$sm
A B C
A 0 1.00 2.00
B 1 0.00 1.55
C 2 1.55 0.00
As an alternative code, ?seqsubm
generates substitution
costs (here based on transition rates as above), but does not return
indel
<- seqsubm(ch3.ex2.seq, method="TRATE") trate2
We can display the substitution costs matrix based on transition rates
trate2
A B C
A 0 1.00 2.00
B 1 0.00 1.55
C 2 1.55 0.00
We now want to extract the transition rates between states for the three example sequences:
<-seqtrate(ch3.ex2.seq, sel.states = NULL,
tr.ratestime.varying = FALSE,
weighted = TRUE,
lag = 1,
with.missing = FALSE,
count = FALSE)
…and use the ?seqalign
command to get a summary of the
optimal number and costs of OM operations with data-driven (transition
rates) substitution costs:
<- seqalign(ch3.ex2.seq, 1:2,
al1.OM.dd indel=1,
sm=trate2)
print(al1.OM.dd)
1 2 3 4 5 6
Seq 1 A B B C C C
Operations E E E S S S
Costs 0.00 0.00 0.00 1.55 1.55 1.55
Seq 2 A B B B B B
OM distance between sequences: 4.65
<- seqalign(ch3.ex2.seq, c
al2.OM.dd 1,3),
(indel=1,
sm=trate2)
print(al2.OM.dd)
1 2 3 4 5 6 7 8
Seq 1 A B B C C C - -
Operations D D E E E E I I
Costs 1 1 0 0 0 0 1 1
Seq 2 - - B C C C B B
OM distance between sequences: 4
<- seqalign(ch3.ex2.seq,
al3.OM.dd 2:3,
indel=1,
sm=trate2)
print(al3.OM.dd)
1 2 3 4 5 6 7
Seq 1 A B - B B B B
Operations D E I S S E E
Costs 1.00 0.00 1.00 1.55 1.55 0.00 0.00
Seq 2 - B C C C B B
OM distance between sequences: 5.1
We proceed by using the three example sequences and compute the
dissimilarity matrix using the following costs: indel
=1,
sm
=dynamic Hamming distance (DHD)
<-seqdist(ch3.ex2.seq,
dhd.dissmethod="DHD")
…and display the resulting DHD-based dissimilarity matrix:
dhd.diss
[1] [2] [3]
[1] 0.0 10.5 16.0
[2] 10.5 0.0 11.5
[3] 16.0 11.5 0.0
If you see mistakes or want to suggest changes, please create an issue on the source repository.
Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. Source code is available at https://github.com/sa-book/sa-book.github.io, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".