Create example sequences & calculate the pairwise dissimilarity matrix using different aligment techniques

Chapter 3.2 Alignment techniques

Click here to get instructions…
# assuming you are working within .Rproj environment
library(here)

# install (if necessary) and load other required packages
source(here("source", "load_libraries.R"))

# load environment generated in "3-0_ChapterSetup.R"
load(here("data", "R", "3-0_ChapterSetup.RData"))


In chapter 3.2, we introduces techniques to compare whole sequences.

Create example sequences

We first create two vectors 6-element-long (of equal length). Each element is coded with a letter: A, B, or C.

ch3.ex1 <- c("A-B-B-C-C-C", "A-B-B-B-B-B")

We then display the two vectors:

[1] "A-B-B-C-C-C" "A-B-B-B-B-B"

We create sequence objects from the three vectors:

ch3.ex1.seq <- seqdef(ch3.ex1)

The three resulting sequences are then displayed:

ch3.ex1.seq
    Sequence   
[1] A-B-B-C-C-C
[2] A-B-B-B-B-B

Optimal matching (OM)

In this case sequences are short and few in number, so that we immediately recognize that they are not the identical. In case of longer and more complex sequences, one might want to know if two sequences are the identical but doing that visually might not be feasible. The following code allows you to do that as it returns [TRUE] if the sequences are the same and [FALSE] if they are not. In this case we compare sequence 1 and 2 above:

seqcomp(ch3.ex1.seq[1,],ch3.ex1.seq[2,])
[1] FALSE

However, the two sequences might have some similarity even if they are not exactly the same. To obtain the number of matching positions between the two sequences, we can use the code:

seqmpos(ch3.ex1.seq[1,],ch3.ex1.seq[2,])
[1] 3

We now know that sequence 1 and sequence 2 share 3 matching positions.

Alternatively, we can compute the length of the longest common subsequence (elements composed of tokens - states and combinations of subsequent states - that occur in the same order along the sequence) of two sequences:

seqLLCS(ch3.ex1.seq[1,],ch3.ex1.seq[2,])
[1] 3

We now know that the longest common subsequence between sequence 1 and sequence 2 is 3-element-long.

Aligning them pairwise to make them the same. Optimal Matching (OM) does so by changing the order of the states or the length of a spell.

In the simplest example possible, both indel and substitution operations can be assigned the same cost of 1 (Levenshtein I). We first generate a substitution costs matrix to be used in the next steps:

costs1 <- matrix(
  c(0,1,1,
    1,0,1,
    1,1,0
    ),
  nrow = 3, ncol = 3, byrow = TRUE)

We can display it:

     [,1] [,2] [,3]
[1,]    0    1    1
[2,]    1    0    1
[3,]    1    1    0

We then use the following code to obtain a summary of the optimal number and costs of operations the OM identifies as the cheapest to align the two sequences. indel and sm) require the indel cost and the substitution cost matrix respectively to be specified. In what follows, we use ?seqalign command:

al.ch3.1.2 <- seqalign(ch3.ex1.seq, 1:2, indel=1, sm=costs1)
print(al.ch3.1.2)
           1 2 3 4 5 6
Seq 1      A B B C C C
Operations E E E S S S
Costs      0 0 0 1 1 1
Seq 2      A B B B B B

OM distance between sequences: 3 

Assigning costs to the alignment operations

Let’s consider 3 example sequences

ch3.ex2 <- c("A-B-B-C-C-C", "A-B-B-B-B-B", "B-C-C-C-B-B")

…define them as sequences objects…

ch3.ex2.seq <- seqdef(ch3.ex2)

…and display them:

ch3.ex2.seq
    Sequence   
[1] A-B-B-C-C-C
[2] A-B-B-B-B-B
[3] B-C-C-C-B-B

We now generate a substitution costs matrix with the following costs to be used in the next step: indel=1, sm=1 - which corresponds to Levenshtein I. Note that in this case the pairwise dissimilarity equals to the number of non-matched positions

costsL1 <- matrix(
  c(0,1,1,
    1,0,1,
    1,1,0
  ),
  nrow = 3, ncol = 3, byrow = TRUE)

One can display the newly-generated matrix:

costsL1
     [,1] [,2] [,3]
[1,]    0    1    1
[2,]    1    0    1
[3,]    1    1    0

Using the ?seqalign command we obtain a summary of the optimal number and costs of operations. The OM identifies the following as the cheapest to align the three sequences pairwise:

al1.L1 <- seqalign(ch3.ex2.seq, 1:2, indel=1, sm=costsL1)
print(al1.L1)
           1 2 3 4 5 6
Seq 1      A B B C C C
Operations E E E S S S
Costs      0 0 0 1 1 1
Seq 2      A B B B B B

OM distance between sequences: 3 
al2.L1 <- seqalign(ch3.ex2.seq, c(1,3), indel=1, sm=costsL1)
print(al2.L1)
           1 2 3 4 5 6 7
Seq 1      A B B C C - C
Operations D E S E E I S
Costs      1 0 1 0 0 1 1
Seq 2      - B C C C B B

OM distance between sequences: 4 
al3.L1 <- seqalign(ch3.ex2.seq, 2:3, indel=1, sm=costsL1)
print(al3.L1)
           1 2 3 4 5 6
Seq 1      A B B B B B
Operations S S S S E E
Costs      1 1 1 1 0 0
Seq 2      B C C C B B

OM distance between sequences: 4 

We now generate an alternative substitution costs matrix with sm=4. Combined with indel=1 this corresponds to Levenshtein II. Note that indel costs are lower than half of the minimum substitution cost, so that only indel operations will be used by the OM.

costsL2 <- matrix(
  c(0,4,4,
    4,0,4,
    4,4,0
  ),
  nrow = 3, ncol = 3, byrow = TRUE)

One can display the newly-generated matrix:

costsL2
     [,1] [,2] [,3]
[1,]    0    4    4
[2,]    4    0    4
[3,]    4    4    0

Using the ?seqalign command, we obtain a summary of the optimal number and costs of operations. The OM identifies the following as the cheapest to align the three sequences pairwise:

al1.L2  <- seqalign(ch3.ex2.seq, 1:2, indel=1, sm=costsL2)
print(al1.L2)
           1 2 3 4 5 6 7 8 9
Seq 1      A B B C C C - - -
Operations E E E D D D I I I
Costs      0 0 0 1 1 1 1 1 1
Seq 2      A B B - - - B B B

OM distance between sequences: 6 
al2.L2 <- seqalign(ch3.ex2.seq, c(1,3), indel=1, sm=costsL2)
print(al2.L2)
           1 2 3 4 5 6 7 8
Seq 1      A B B C C C - -
Operations D D E E E E I I
Costs      1 1 0 0 0 0 1 1
Seq 2      - - B C C C B B

OM distance between sequences: 4 
al3.L2 <- seqalign(ch3.ex2.seq, 2:3, indel=1, sm=costsL2)
print(al3.L2)
           1 2 3 4 5 6 7 8 9
Seq 1      A B B B - - - B B
Operations D D D E I I I E E
Costs      1 1 1 0 1 1 1 0 0
Seq 2      - - - B C C C B B

OM distance between sequences: 6 

The next test corresponds to the Hamming distance setting, that is, only substitution costs are used. The costs setting will be: indel=NA, sm=1. As above, we generate a substitution costs matrix:

costsHAM <- matrix(
  c(0,1,1,
    1,0,1,
    1,1,0
  ),
  nrow = 3, ncol = 3, byrow = TRUE)

…and display it:

costsHAM
     [,1] [,2] [,3]
[1,]    0    1    1
[2,]    1    0    1
[3,]    1    1    0

Using the ?seqalign command, we obtain a summary of the optimal number and costs of operations: the OM identifies the following as the cheapest to align the three sequences pairwise. Note that here we have to specify indel=4 (very high) just to make sure that the algorithm does not use them as with seqalign is not possible to specify directly dist=HAM

al1.HAM <- seqalign(ch3.ex2.seq, 1:2, indel=4, sm=costsHAM)
print(al1.HAM)
           1 2 3 4 5 6
Seq 1      A B B C C C
Operations E E E S S S
Costs      0 0 0 1 1 1
Seq 2      A B B B B B

OM distance between sequences: 3 
al2.HAM <- seqalign(ch3.ex2.seq, c(1,3), indel=4, sm=costsHAM)
print(al2.HAM)
           1 2 3 4 5 6
Seq 1      A B B C C C
Operations S S S E S S
Costs      1 1 1 0 1 1
Seq 2      B C C C B B

OM distance between sequences: 5 
al3.HAM <- seqalign(ch3.ex2.seq, 2:3, indel=4, sm=costsHAM)
print(al3.HAM) 
           1 2 3 4 5 6
Seq 1      A B B B B B
Operations S S S S E E
Costs      1 1 1 1 0 0
Seq 2      B C C C B B

OM distance between sequences: 4 

We now try the option with the following costs: indel=1 , substitution=2, generating the substitution costs matrix first:

costsOMcl <- matrix(
  c(0,2,2,
    2,0,2,
    2,2,0
  ),
  nrow = 3, ncol = 3, byrow = TRUE)

We display the resulting matrix of substitution costs:

costsOMcl
     [,1] [,2] [,3]
[1,]    0    2    2
[2,]    2    0    2
[3,]    2    2    0

Using the ?seqalign command, we obtain a summary of the optimal number and costs of operation. The OM identifies the following as the cheapest to align the three sequences pairwise.

al1.OMcl <- seqalign(ch3.ex2.seq, 1:2, indel=1, sm=costsOMcl)
print(al1.OMcl)
           1 2 3 4 5 6
Seq 1      A B B C C C
Operations E E E S S S
Costs      0 0 0 2 2 2
Seq 2      A B B B B B

OM distance between sequences: 6 
al2.OMcl <- seqalign(ch3.ex2.seq, c(1,3), indel=1, sm=costsOMcl)
print(al2.OMcl)
           1 2 3 4 5 6 7 8
Seq 1      A B B C C C - -
Operations D D E E E E I I
Costs      1 1 0 0 0 0 1 1
Seq 2      - - B C C C B B

OM distance between sequences: 4 
al3.OMcl <- seqalign(ch3.ex2.seq, 2:3, indel=1, sm=costsOMcl)
print(al3.OMcl)
           1 2 3 4 5 6 7
Seq 1      A B - B B B B
Operations D E I S S E E
Costs      1 0 1 2 2 0 0
Seq 2      - B C C C B B

OM distance between sequences: 6 

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. Source code is available at https://github.com/sa-book/sa-book.github.io, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".