Alternative OM-bases metrics to align sequences

Chapter 3.3 Alignment-based extensions of OM

Click here to get instructions…
# assuming you are working within .Rproj environment
library(here)

# install (if necessary) and load other required packages
source(here("source", "load_libraries.R"))

# load environment generated in "3-0_ChapterSetup.R"
load(here("data", "R", "3-0_ChapterSetup.RData"))


In chapter 3.3, we introduce alignment-based extensions of OM. The data come from a sub-sample of the German Family Panel - pairfam. For further information on the study and on how to access the full scientific use file see here.

Theory-based costs

We generate a substitution costs matrix to be used in the next step based on the following theory-based costs:

substitution=1 if one if different number of children but same partnership status

substitution=2 any substitution of different partnership statuses

theo <- matrix(
  c(0,1,2,2,2,2,2,2,2,
    1,0,2,2,2,2,2,2,2,
    2,2,0,1,2,2,2,2,2,
    2,2,1,0,2,2,2,2,2,
    2,2,2,2,0,1,2,2,2,
    2,2,2,2,1,0,2,2,2,
    2,2,2,2,2,2,0,1,1,
    2,2,2,2,2,2,1,0,1,
    2,2,2,2,2,2,1,1,0),
  nrow = 9, 
  ncol = 9, 
  byrow = TRUE,
  dimnames = list(longlab.partner.child, 
                  shortlab.partner.child))

and display the resulting matrix:

theo
                       S Sc LAT LATc COH COHc MAR MARc1 MARc2+
Single, no child       0  1   2    2   2    2   2     2      2
Single, child(ren)     1  0   2    2   2    2   2     2      2
LAT, no child          2  2   0    1   2    2   2     2      2
LAT, child(ren)        2  2   1    0   2    2   2     2      2
Cohabiting, no child   2  2   2    2   0    1   2     2      2
Cohabiting, child(ren) 2  2   2    2   1    0   2     2      2
Married, no child      2  2   2    2   2    2   0     1      1
Married, 1 child       2  2   2    2   2    2   1     0      1
Married, 2+ children   2  2   2    2   2    2   1     1      0

We use this matrix to calculate pairwise dissimilarities:

om.theo<-seqdist(partner.child.year.seq, method = "OM", indel = 1,sm = theo)

Let’s consider the first three sequences in the initial sample:

In the book, this corresponds to Figure 3.1: if you want to generate this figure in color as we do here, you can use the following code

layout.fig1 <- layout(matrix(seq(1:2), 1, 2, byrow = TRUE),
                      widths = c(.7, .3))
layout.show(layout.fig1)

par(mar = c(5, 4, 4, 0) + 0.1, las = 1, 
    mgp=c(2,1,-.5))
seqiplot(partner.child.year.seq[1:3, ], 
         ylab = "Sequences",
         with.legend = "FALSE", 
         border = TRUE, 
         axes = FALSE,
         cpal = colspace.partner.child, 
         main = "",
         weighted=FALSE)
par(mgp=c(3,1,0.25))
axis(1, at=(seq(0,22, 2)), labels = seq(18,40, by = 2))
mtext(text = "Age", cex = .8,
      side = 1,#side 1 = bottom
      line = 2.5)

par(mar = c(5, 0, 4, 0) + 0.1)
seqlegend(partner.child.year.seq, 
          cpal = colspace.partner.child, 
          cex = .95, position = "center")

dev.off()

We can now inspect a selection of the dissimilarity matrix between the three sequences above:

om.theo[1:3, 1:3]
   1  2  3
1  0 37 40
2 37  0 18
3 40 18  0

Costs based on state properties

Generate two vectors with the following properties for each possible state:

 partner: 0=single, 1=couple 
 child: 0=no, 1=yes
partner <- c(0, 0, 1, 1, 1, 1, 1,1,1)
child <- c(0,1,0,1,0,1,0,1,1)

Create a data.frame with the vectors:

alphabetprop <- data.frame(partner = partner, child = child)

Label the rows of the data.frame after the states’ names:

rownames(alphabetprop) <- alphabet(partner.child.year.seq)

… and display the data.frame

alphabetprop
       partner child
S            0     0
Sc           0     1
LAT          1     0
LATc         1     1
COH          1     0
COHc         1     1
MAR          1     0
MARc1        1     1
MARc2+       1     1

Generate a substitution costs matrix based on the properties with “Gower” algorithm and print it

prop <- as.matrix(daisy(alphabetprop, 
                        metric = "gower"))
prop
         S  Sc LAT LATc COH COHc MAR MARc1 MARc2+
S      0.0 0.5 0.5  1.0 0.5  1.0 0.5   1.0    1.0
Sc     0.5 0.0 1.0  0.5 1.0  0.5 1.0   0.5    0.5
LAT    0.5 1.0 0.0  0.5 0.0  0.5 0.0   0.5    0.5
LATc   1.0 0.5 0.5  0.0 0.5  0.0 0.5   0.0    0.0
COH    0.5 1.0 0.0  0.5 0.0  0.5 0.0   0.5    0.5
COHc   1.0 0.5 0.5  0.0 0.5  0.0 0.5   0.0    0.0
MAR    0.5 1.0 0.0  0.5 0.0  0.5 0.0   0.5    0.5
MARc1  1.0 0.5 0.5  0.0 0.5  0.0 0.5   0.0    0.0
MARc2+ 1.0 0.5 0.5  0.0 0.5  0.0 0.5   0.0    0.0

Alternatively, this code can be used within the ?seqcost command (where FEATURES corresponds to the properties)

prop2 <- seqcost(partner.child.year.seq, 
                 method="FEATURES",
                 state.features = alphabetprop, 
                 feature.weights = NULL)
prop2
$indel
[1] 0.5

$sm
         S  Sc LAT LATc COH COHc MAR MARc1 MARc2+
S      0.0 0.5 0.5  1.0 0.5  1.0 0.5   1.0    1.0
Sc     0.5 0.0 1.0  0.5 1.0  0.5 1.0   0.5    0.5
LAT    0.5 1.0 0.0  0.5 0.0  0.5 0.0   0.5    0.5
LATc   1.0 0.5 0.5  0.0 0.5  0.0 0.5   0.0    0.0
COH    0.5 1.0 0.0  0.5 0.0  0.5 0.0   0.5    0.5
COHc   1.0 0.5 0.5  0.0 0.5  0.0 0.5   0.0    0.0
MAR    0.5 1.0 0.0  0.5 0.0  0.5 0.0   0.5    0.5
MARc1  1.0 0.5 0.5  0.0 0.5  0.0 0.5   0.0    0.0
MARc2+ 1.0 0.5 0.5  0.0 0.5  0.0 0.5   0.0    0.0

We then use the matrix to calculate the pairwise dissimilarities, and display a selection of the dissimilarity matrix between sequences in the figure above (Fig. 3.1):

om.prop<-seqdist(partner.child.year.seq, 
                 method = "OM", 
                 indel = 1,
                 sm = prop)
om.prop[1:3, 1:3]
     1    2   3
1  0.0 15.5 9.0
2 15.5  0.0 6.5
3  9.0  6.5 0.0

Here an alternative code to use the matrix to calculate pairwise dissimilarities and display the dissimilarity matrix between the usual sequences (Fig. 3.1)

om.prop2<-seqdist(partner.child.year.seq, 
                  method = "OM", 
                  indel = 1, 
                  sm = prop2$sm)
om.prop2[1:3, 1:3]
     1    2   3
1  0.0 15.5 9.0
2 15.5  0.0 6.5
3  9.0  6.5 0.0

Data driven-costs

For illustrative purpose, we use three example sequences (6 time-points, 3 states: A, B, C)

ch3.ex2 <- c("A-B-B-C-C-C", "A-B-B-B-B-B", "B-C-C-C-B-B")

ch3.ex2.seq <- seqdef(ch3.ex2)

The command ?seqcost generates substitution costs, in this case based on transition rates. The command returns also the value of indel.

trate1<-seqcost(ch3.ex2.seq, method="TRATE")

We can display the substitution costs matrix based on transition rates

trate1
$indel
[1] 1

$sm
  A    B    C
A 0 1.00 2.00
B 1 0.00 1.55
C 2 1.55 0.00

As an alternative code, ?seqsubm generates substitution costs (here based on transition rates as above), but does not return indel

trate2 <- seqsubm(ch3.ex2.seq, method="TRATE")

We can display the substitution costs matrix based on transition rates

trate2
  A    B    C
A 0 1.00 2.00
B 1 0.00 1.55
C 2 1.55 0.00

We now want to extract the transition rates between states for the three example sequences:

tr.rates<-seqtrate(ch3.ex2.seq, sel.states = NULL, 
                   time.varying = FALSE, 
                   weighted = TRUE,
                   lag = 1, 
                   with.missing = FALSE, 
                   count = FALSE)

…and use the ?seqalign command to get a summary of the optimal number and costs of OM operations with data-driven (transition rates) substitution costs:

al1.OM.dd <- seqalign(ch3.ex2.seq, 1:2, 
                      indel=1, 
                      sm=trate2)

print(al1.OM.dd)
           1    2    3    4    5    6   
Seq 1      A    B    B    C    C    C   
Operations E    E    E    S    S    S   
Costs      0.00 0.00 0.00 1.55 1.55 1.55
Seq 2      A    B    B    B    B    B   

OM distance between sequences: 4.65 
al2.OM.dd <- seqalign(ch3.ex2.seq, c
                      (1,3), 
                      indel=1, 
                      sm=trate2)

print(al2.OM.dd)
           1 2 3 4 5 6 7 8
Seq 1      A B B C C C - -
Operations D D E E E E I I
Costs      1 1 0 0 0 0 1 1
Seq 2      - - B C C C B B

OM distance between sequences: 4 
al3.OM.dd <- seqalign(ch3.ex2.seq, 
                      2:3, 
                      indel=1, 
                      sm=trate2)

print(al3.OM.dd)
           1    2    3    4    5    6    7   
Seq 1      A    B    -    B    B    B    B   
Operations D    E    I    S    S    E    E   
Costs      1.00 0.00 1.00 1.55 1.55 0.00 0.00
Seq 2      -    B    C    C    C    B    B   

OM distance between sequences: 5.1 

Dynamic Hamming distance

We proceed by using the three example sequences and compute the dissimilarity matrix using the following costs: indel=1, sm=dynamic Hamming distance (DHD)

dhd.diss<-seqdist(ch3.ex2.seq, 
                  method="DHD")

…and display the resulting DHD-based dissimilarity matrix:

dhd.diss
     [1]  [2]  [3]
[1]  0.0 10.5 16.0
[2] 10.5  0.0 11.5
[3] 16.0 11.5  0.0

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. Source code is available at https://github.com/sa-book/sa-book.github.io, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".