Chapter 3.2 Alignment techniques
readme.html and run
3-0_ChapterSetup.R. This will create
3-0_ChapterSetup.RData in the sub folder
data/R. This file contains the data required to produce the
plots shown below.legend_large_box to
your environment in order to render the tweaked version of the legend
described below. You find this file in the source folder of
the unzipped Chapter 3 archive.LoadInstallPackages.R# assuming you are working within .Rproj environment
library(here)
# install (if necessary) and load other required packages
source(here("source", "load_libraries.R"))
# load environment generated in "3-0_ChapterSetup.R"
load(here("data", "R", "3-0_ChapterSetup.RData"))
In chapter 3.2, we introduces techniques to compare whole sequences.
We first create two vectors 6-element-long (of equal length). Each element is coded with a letter: A, B, or C.
ch3.ex1 <- c("A-B-B-C-C-C", "A-B-B-B-B-B")We then display the two vectors:
[1] "A-B-B-C-C-C" "A-B-B-B-B-B"
We create sequence objects from the three vectors:
ch3.ex1.seq <- seqdef(ch3.ex1)The three resulting sequences are then displayed:
ch3.ex1.seq Sequence
[1] A-B-B-C-C-C
[2] A-B-B-B-B-B
In this case sequences are short and few in number, so that we immediately recognize that they are not the identical. In case of longer and more complex sequences, one might want to know if two sequences are the identical but doing that visually might not be feasible. The following code allows you to do that as it returns [TRUE] if the sequences are the same and [FALSE] if they are not. In this case we compare sequence 1 and 2 above:
seqcomp(ch3.ex1.seq[1,],ch3.ex1.seq[2,])[1] FALSE
However, the two sequences might have some similarity even if they are not exactly the same. To obtain the number of matching positions between the two sequences, we can use the code:
seqmpos(ch3.ex1.seq[1,],ch3.ex1.seq[2,])[1] 3
We now know that sequence 1 and sequence 2 share 3 matching positions.
Alternatively, we can compute the length of the longest common subsequence (elements composed of tokens - states and combinations of subsequent states - that occur in the same order along the sequence) of two sequences:
seqLLCS(ch3.ex1.seq[1,],ch3.ex1.seq[2,])[1] 3
We now know that the longest common subsequence between sequence 1 and sequence 2 is 3-element-long.
Aligning them pairwise to make them the same. Optimal Matching (OM) does so by changing the order of the states or the length of a spell.
In the simplest example possible, both indel and substitution operations can be assigned the same cost of 1 (Levenshtein I). We first generate a substitution costs matrix to be used in the next steps:
costs1 <- matrix(
c(0,1,1,
1,0,1,
1,1,0
),
nrow = 3, ncol = 3, byrow = TRUE)We can display it:
[,1] [,2] [,3]
[1,] 0 1 1
[2,] 1 0 1
[3,] 1 1 0
We then use the following code to obtain a summary of the optimal
number and costs of operations the OM identifies as the cheapest to
align the two sequences. indel and sm) require
the indel cost and the substitution cost matrix respectively to be
specified. In what follows, we use ?seqalign command:
al.ch3.1.2 <- seqalign(ch3.ex1.seq, 1:2, indel=1, sm=costs1)
print(al.ch3.1.2) 1 2 3 4 5 6
Seq 1 A B B C C C
Operations E E E S S S
Costs 0 0 0 1 1 1
Seq 2 A B B B B B
OM distance between sequences: 3
Let’s consider 3 example sequences
ch3.ex2 <- c("A-B-B-C-C-C", "A-B-B-B-B-B", "B-C-C-C-B-B")…define them as sequences objects…
ch3.ex2.seq <- seqdef(ch3.ex2)…and display them:
ch3.ex2.seq Sequence
[1] A-B-B-C-C-C
[2] A-B-B-B-B-B
[3] B-C-C-C-B-B
We now generate a substitution costs matrix with the following costs
to be used in the next step: indel=1, sm=1 -
which corresponds to Levenshtein I. Note that in this case the pairwise
dissimilarity equals to the number of non-matched positions
costsL1 <- matrix(
c(0,1,1,
1,0,1,
1,1,0
),
nrow = 3, ncol = 3, byrow = TRUE)One can display the newly-generated matrix:
costsL1 [,1] [,2] [,3]
[1,] 0 1 1
[2,] 1 0 1
[3,] 1 1 0
Using the ?seqalign command we obtain a summary of the
optimal number and costs of operations. The OM identifies the following
as the cheapest to align the three sequences pairwise:
al1.L1 <- seqalign(ch3.ex2.seq, 1:2, indel=1, sm=costsL1)
print(al1.L1) 1 2 3 4 5 6
Seq 1 A B B C C C
Operations E E E S S S
Costs 0 0 0 1 1 1
Seq 2 A B B B B B
OM distance between sequences: 3
al2.L1 <- seqalign(ch3.ex2.seq, c(1,3), indel=1, sm=costsL1)
print(al2.L1) 1 2 3 4 5 6 7
Seq 1 A B B C C - C
Operations D E S E E I S
Costs 1 0 1 0 0 1 1
Seq 2 - B C C C B B
OM distance between sequences: 4
al3.L1 <- seqalign(ch3.ex2.seq, 2:3, indel=1, sm=costsL1)
print(al3.L1) 1 2 3 4 5 6
Seq 1 A B B B B B
Operations S S S S E E
Costs 1 1 1 1 0 0
Seq 2 B C C C B B
OM distance between sequences: 4
We now generate an alternative substitution costs matrix with
sm=4. Combined with indel=1 this corresponds
to Levenshtein II. Note that indel costs are lower than half of the
minimum substitution cost, so that only indel operations will be used by
the OM.
costsL2 <- matrix(
c(0,4,4,
4,0,4,
4,4,0
),
nrow = 3, ncol = 3, byrow = TRUE)One can display the newly-generated matrix:
costsL2 [,1] [,2] [,3]
[1,] 0 4 4
[2,] 4 0 4
[3,] 4 4 0
Using the ?seqalign command, we obtain a summary of the
optimal number and costs of operations. The OM identifies the following
as the cheapest to align the three sequences pairwise:
al1.L2 <- seqalign(ch3.ex2.seq, 1:2, indel=1, sm=costsL2)
print(al1.L2) 1 2 3 4 5 6 7 8 9
Seq 1 A B B C C C - - -
Operations E E E D D D I I I
Costs 0 0 0 1 1 1 1 1 1
Seq 2 A B B - - - B B B
OM distance between sequences: 6
al2.L2 <- seqalign(ch3.ex2.seq, c(1,3), indel=1, sm=costsL2)
print(al2.L2) 1 2 3 4 5 6 7 8
Seq 1 A B B C C C - -
Operations D D E E E E I I
Costs 1 1 0 0 0 0 1 1
Seq 2 - - B C C C B B
OM distance between sequences: 4
al3.L2 <- seqalign(ch3.ex2.seq, 2:3, indel=1, sm=costsL2)
print(al3.L2) 1 2 3 4 5 6 7 8 9
Seq 1 A B B B - - - B B
Operations D D D E I I I E E
Costs 1 1 1 0 1 1 1 0 0
Seq 2 - - - B C C C B B
OM distance between sequences: 6
The next test corresponds to the Hamming distance setting, that is,
only substitution costs are used. The costs setting will be:
indel=NA, sm=1. As above, we generate a
substitution costs matrix:
costsHAM <- matrix(
c(0,1,1,
1,0,1,
1,1,0
),
nrow = 3, ncol = 3, byrow = TRUE)…and display it:
costsHAM [,1] [,2] [,3]
[1,] 0 1 1
[2,] 1 0 1
[3,] 1 1 0
Using the ?seqalign command, we obtain a summary of the
optimal number and costs of operations: the OM identifies the following
as the cheapest to align the three sequences pairwise. Note that here we
have to specify indel=4 (very high) just to make sure that
the algorithm does not use them as with seqalign is not
possible to specify directly dist=HAM
al1.HAM <- seqalign(ch3.ex2.seq, 1:2, indel=4, sm=costsHAM)
print(al1.HAM) 1 2 3 4 5 6
Seq 1 A B B C C C
Operations E E E S S S
Costs 0 0 0 1 1 1
Seq 2 A B B B B B
OM distance between sequences: 3
al2.HAM <- seqalign(ch3.ex2.seq, c(1,3), indel=4, sm=costsHAM)
print(al2.HAM) 1 2 3 4 5 6
Seq 1 A B B C C C
Operations S S S E S S
Costs 1 1 1 0 1 1
Seq 2 B C C C B B
OM distance between sequences: 5
al3.HAM <- seqalign(ch3.ex2.seq, 2:3, indel=4, sm=costsHAM)
print(al3.HAM) 1 2 3 4 5 6
Seq 1 A B B B B B
Operations S S S S E E
Costs 1 1 1 1 0 0
Seq 2 B C C C B B
OM distance between sequences: 4
We now try the option with the following costs: indel=1
, substitution=2, generating the substitution costs matrix
first:
costsOMcl <- matrix(
c(0,2,2,
2,0,2,
2,2,0
),
nrow = 3, ncol = 3, byrow = TRUE)We display the resulting matrix of substitution costs:
costsOMcl [,1] [,2] [,3]
[1,] 0 2 2
[2,] 2 0 2
[3,] 2 2 0
Using the ?seqalign command, we obtain a summary of the
optimal number and costs of operation. The OM identifies the following
as the cheapest to align the three sequences pairwise.
al1.OMcl <- seqalign(ch3.ex2.seq, 1:2, indel=1, sm=costsOMcl)
print(al1.OMcl) 1 2 3 4 5 6
Seq 1 A B B C C C
Operations E E E S S S
Costs 0 0 0 2 2 2
Seq 2 A B B B B B
OM distance between sequences: 6
al2.OMcl <- seqalign(ch3.ex2.seq, c(1,3), indel=1, sm=costsOMcl)
print(al2.OMcl) 1 2 3 4 5 6 7 8
Seq 1 A B B C C C - -
Operations D D E E E E I I
Costs 1 1 0 0 0 0 1 1
Seq 2 - - B C C C B B
OM distance between sequences: 4
al3.OMcl <- seqalign(ch3.ex2.seq, 2:3, indel=1, sm=costsOMcl)
print(al3.OMcl) 1 2 3 4 5 6 7
Seq 1 A B - B B B B
Operations D E I S S E E
Costs 1 0 1 2 2 0 0
Seq 2 - B C C C B B
OM distance between sequences: 6
If you see mistakes or want to suggest changes, please create an issue on the source repository.
Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. Source code is available at https://github.com/sa-book/sa-book.github.io, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".