Chapter 3.2 Alignment techniques
readme.html
and run
3-0_ChapterSetup.R
. This will create
3-0_ChapterSetup.RData
in the sub folder
data/R
. This file contains the data required to produce the
plots shown below.legend_large_box
to
your environment in order to render the tweaked version of the legend
described below. You find this file in the source
folder of
the unzipped Chapter 3 archive.LoadInstallPackages.R
# assuming you are working within .Rproj environment
library(here)
# install (if necessary) and load other required packages
source(here("source", "load_libraries.R"))
# load environment generated in "3-0_ChapterSetup.R"
load(here("data", "R", "3-0_ChapterSetup.RData"))
In chapter 3.2, we introduces techniques to compare whole sequences.
We first create two vectors 6-element-long (of equal length). Each element is coded with a letter: A, B, or C.
<- c("A-B-B-C-C-C", "A-B-B-B-B-B") ch3.ex1
We then display the two vectors:
[1] "A-B-B-C-C-C" "A-B-B-B-B-B"
We create sequence objects from the three vectors:
<- seqdef(ch3.ex1) ch3.ex1.seq
The three resulting sequences are then displayed:
ch3.ex1.seq
Sequence
[1] A-B-B-C-C-C
[2] A-B-B-B-B-B
In this case sequences are short and few in number, so that we immediately recognize that they are not the identical. In case of longer and more complex sequences, one might want to know if two sequences are the identical but doing that visually might not be feasible. The following code allows you to do that as it returns [TRUE] if the sequences are the same and [FALSE] if they are not. In this case we compare sequence 1 and 2 above:
seqcomp(ch3.ex1.seq[1,],ch3.ex1.seq[2,])
[1] FALSE
However, the two sequences might have some similarity even if they are not exactly the same. To obtain the number of matching positions between the two sequences, we can use the code:
seqmpos(ch3.ex1.seq[1,],ch3.ex1.seq[2,])
[1] 3
We now know that sequence 1 and sequence 2 share 3 matching positions.
Alternatively, we can compute the length of the longest common subsequence (elements composed of tokens - states and combinations of subsequent states - that occur in the same order along the sequence) of two sequences:
seqLLCS(ch3.ex1.seq[1,],ch3.ex1.seq[2,])
[1] 3
We now know that the longest common subsequence between sequence 1 and sequence 2 is 3-element-long.
Aligning them pairwise to make them the same. Optimal Matching (OM) does so by changing the order of the states or the length of a spell.
In the simplest example possible, both indel and substitution operations can be assigned the same cost of 1 (Levenshtein I). We first generate a substitution costs matrix to be used in the next steps:
<- matrix(
costs1 c(0,1,1,
1,0,1,
1,1,0
),nrow = 3, ncol = 3, byrow = TRUE)
We can display it:
[,1] [,2] [,3]
[1,] 0 1 1
[2,] 1 0 1
[3,] 1 1 0
We then use the following code to obtain a summary of the optimal
number and costs of operations the OM identifies as the cheapest to
align the two sequences. indel
and sm
) require
the indel cost and the substitution cost matrix respectively to be
specified. In what follows, we use ?seqalign
command:
1.2 <- seqalign(ch3.ex1.seq, 1:2, indel=1, sm=costs1)
al.ch3.print(al.ch3.1.2)
1 2 3 4 5 6
Seq 1 A B B C C C
Operations E E E S S S
Costs 0 0 0 1 1 1
Seq 2 A B B B B B
OM distance between sequences: 3
Let’s consider 3 example sequences
<- c("A-B-B-C-C-C", "A-B-B-B-B-B", "B-C-C-C-B-B") ch3.ex2
…define them as sequences objects…
<- seqdef(ch3.ex2) ch3.ex2.seq
…and display them:
ch3.ex2.seq
Sequence
[1] A-B-B-C-C-C
[2] A-B-B-B-B-B
[3] B-C-C-C-B-B
We now generate a substitution costs matrix with the following costs
to be used in the next step: indel=1
, sm=1
-
which corresponds to Levenshtein I. Note that in this case the pairwise
dissimilarity equals to the number of non-matched positions
<- matrix(
costsL1 c(0,1,1,
1,0,1,
1,1,0
),nrow = 3, ncol = 3, byrow = TRUE)
One can display the newly-generated matrix:
costsL1
[,1] [,2] [,3]
[1,] 0 1 1
[2,] 1 0 1
[3,] 1 1 0
Using the ?seqalign
command we obtain a summary of the
optimal number and costs of operations. The OM identifies the following
as the cheapest to align the three sequences pairwise:
<- seqalign(ch3.ex2.seq, 1:2, indel=1, sm=costsL1)
al1.L1 print(al1.L1)
1 2 3 4 5 6
Seq 1 A B B C C C
Operations E E E S S S
Costs 0 0 0 1 1 1
Seq 2 A B B B B B
OM distance between sequences: 3
<- seqalign(ch3.ex2.seq, c(1,3), indel=1, sm=costsL1)
al2.L1 print(al2.L1)
1 2 3 4 5 6 7
Seq 1 A B B C C - C
Operations D E S E E I S
Costs 1 0 1 0 0 1 1
Seq 2 - B C C C B B
OM distance between sequences: 4
<- seqalign(ch3.ex2.seq, 2:3, indel=1, sm=costsL1)
al3.L1 print(al3.L1)
1 2 3 4 5 6
Seq 1 A B B B B B
Operations S S S S E E
Costs 1 1 1 1 0 0
Seq 2 B C C C B B
OM distance between sequences: 4
We now generate an alternative substitution costs matrix with
sm=4
. Combined with indel=1
this corresponds
to Levenshtein II. Note that indel costs are lower than half of the
minimum substitution cost, so that only indel operations will be used by
the OM.
<- matrix(
costsL2 c(0,4,4,
4,0,4,
4,4,0
),nrow = 3, ncol = 3, byrow = TRUE)
One can display the newly-generated matrix:
costsL2
[,1] [,2] [,3]
[1,] 0 4 4
[2,] 4 0 4
[3,] 4 4 0
Using the ?seqalign
command, we obtain a summary of the
optimal number and costs of operations. The OM identifies the following
as the cheapest to align the three sequences pairwise:
<- seqalign(ch3.ex2.seq, 1:2, indel=1, sm=costsL2)
al1.L2 print(al1.L2)
1 2 3 4 5 6 7 8 9
Seq 1 A B B C C C - - -
Operations E E E D D D I I I
Costs 0 0 0 1 1 1 1 1 1
Seq 2 A B B - - - B B B
OM distance between sequences: 6
<- seqalign(ch3.ex2.seq, c(1,3), indel=1, sm=costsL2)
al2.L2 print(al2.L2)
1 2 3 4 5 6 7 8
Seq 1 A B B C C C - -
Operations D D E E E E I I
Costs 1 1 0 0 0 0 1 1
Seq 2 - - B C C C B B
OM distance between sequences: 4
<- seqalign(ch3.ex2.seq, 2:3, indel=1, sm=costsL2)
al3.L2 print(al3.L2)
1 2 3 4 5 6 7 8 9
Seq 1 A B B B - - - B B
Operations D D D E I I I E E
Costs 1 1 1 0 1 1 1 0 0
Seq 2 - - - B C C C B B
OM distance between sequences: 6
The next test corresponds to the Hamming distance setting, that is,
only substitution costs are used. The costs setting will be:
indel=NA
, sm=1
. As above, we generate a
substitution costs matrix:
<- matrix(
costsHAM c(0,1,1,
1,0,1,
1,1,0
),nrow = 3, ncol = 3, byrow = TRUE)
…and display it:
costsHAM
[,1] [,2] [,3]
[1,] 0 1 1
[2,] 1 0 1
[3,] 1 1 0
Using the ?seqalign
command, we obtain a summary of the
optimal number and costs of operations: the OM identifies the following
as the cheapest to align the three sequences pairwise. Note that here we
have to specify indel=4
(very high) just to make sure that
the algorithm does not use them as with seqalign
is not
possible to specify directly dist=HAM
<- seqalign(ch3.ex2.seq, 1:2, indel=4, sm=costsHAM)
al1.HAM print(al1.HAM)
1 2 3 4 5 6
Seq 1 A B B C C C
Operations E E E S S S
Costs 0 0 0 1 1 1
Seq 2 A B B B B B
OM distance between sequences: 3
<- seqalign(ch3.ex2.seq, c(1,3), indel=4, sm=costsHAM)
al2.HAM print(al2.HAM)
1 2 3 4 5 6
Seq 1 A B B C C C
Operations S S S E S S
Costs 1 1 1 0 1 1
Seq 2 B C C C B B
OM distance between sequences: 5
<- seqalign(ch3.ex2.seq, 2:3, indel=4, sm=costsHAM)
al3.HAM print(al3.HAM)
1 2 3 4 5 6
Seq 1 A B B B B B
Operations S S S S E E
Costs 1 1 1 1 0 0
Seq 2 B C C C B B
OM distance between sequences: 4
We now try the option with the following costs: indel=1
, substitution=2
, generating the substitution costs matrix
first:
<- matrix(
costsOMcl c(0,2,2,
2,0,2,
2,2,0
),nrow = 3, ncol = 3, byrow = TRUE)
We display the resulting matrix of substitution costs:
costsOMcl
[,1] [,2] [,3]
[1,] 0 2 2
[2,] 2 0 2
[3,] 2 2 0
Using the ?seqalign
command, we obtain a summary of the
optimal number and costs of operation. The OM identifies the following
as the cheapest to align the three sequences pairwise.
<- seqalign(ch3.ex2.seq, 1:2, indel=1, sm=costsOMcl)
al1.OMcl print(al1.OMcl)
1 2 3 4 5 6
Seq 1 A B B C C C
Operations E E E S S S
Costs 0 0 0 2 2 2
Seq 2 A B B B B B
OM distance between sequences: 6
<- seqalign(ch3.ex2.seq, c(1,3), indel=1, sm=costsOMcl)
al2.OMcl print(al2.OMcl)
1 2 3 4 5 6 7 8
Seq 1 A B B C C C - -
Operations D D E E E E I I
Costs 1 1 0 0 0 0 1 1
Seq 2 - - B C C C B B
OM distance between sequences: 4
<- seqalign(ch3.ex2.seq, 2:3, indel=1, sm=costsOMcl)
al3.OMcl print(al3.OMcl)
1 2 3 4 5 6 7
Seq 1 A B - B B B B
Operations D E I S S E E
Costs 1 0 1 2 2 0 0
Seq 2 - B C C C B B
OM distance between sequences: 6
If you see mistakes or want to suggest changes, please create an issue on the source repository.
Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. Source code is available at https://github.com/sa-book/sa-book.github.io, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".