Chapter 2.3 Description of Sequence Data I: The Basics
We use a simple alphabet differentiating four partnership states to
illustrate {TraMineR}
’s capabilities of producing
simple descriptive statistics on our sequence data. The sequences cover
the partnership biographies from age 18 to 40 (monthly data). The
sequences are stored in the object partner.month.seq
.
State | Short Label |
---|---|
Single | S |
LAT | LAT |
Cohabiting | COH |
Married | MAR |
The function seqmeant
computes the mean total time spent
in each state of the alphabet. The arguments serr = TRUE
and prop = TRUE
provide standard errors and relative
frequencies instead of the average number of months spent in each
state.
seqmeant(partner.month.seq, serr = TRUE)
seqmeant(partner.month.seq, prop = TRUE)
We also can use seqmeant
to identify the average number
of episodes for each state of the alphabet by applying it to a sequence
data object in the DSS format.
seqmeant(seqdss(partner.month.seq),serr = TRUE)
Taken together these commands yield the results shown in Table 2.3.
State | Mean | SD | relative freq. | Mean | SD |
---|---|---|---|---|---|
S | 72.5 | 69.8 | 0.27 | 1.6 | 1.2 |
LAT | 48.0 | 43.9 | 0.18 | 1.8 | 1.3 |
COH | 48.6 | 53.3 | 0.18 | 1.0 | 0.8 |
MAR | 95.0 | 78.9 | 0.36 | 0.8 | 0.5 |
The person-specific number of transitions between states can be
obtained by seqtransn
. However, we are usually interested
in the average number of transitions. We use wtd.mean
and
wtd.var
from the {Hmisc}
package to compute the weighted
mean and standard deviation. The weights are stored in the variable
weight40
of the data frame family
which served
as source for generating the sequence object
partner.month.seq
.
wtd.mean(seqtransn(partner.month.seq), family$weight40)
sqrt(wtd.var(seqtransn(partner.month.seq), family$weight40))
We did the same computation for the sequence object
partner.year.seq
which comes with a yearly instead
of a monthly granularity (see chapter 2.2).
Granularity | Mean | SD |
---|---|---|
Monthly data | 4.3 | 2.6 |
Yearly data | 3.3 | 1.9 |
Transition rates between states can be computed by using
seqtrate
. Again, we display the transitions rates using
sequences with monthly and yearly granularity
(partner.month.seq
and partner.year.seq
). It
is also possible to calculate transitions rates at specific positions of
the sequence by typing time.varying = TRUE
. This option is
also used for generating the animated illustration below (Table
2.4).
seqtrate(partner.month.seq)
seqtrate(partner.year.seq)
State at t | S | LAT | COH | MAR |
---|---|---|---|---|
S | 0.98 | 0.02 | 0.00 | 0.00 |
LAT | 0.02 | 0.96 | 0.02 | 0.00 |
COH | 0.00 | 0.00 | 0.98 | 0.01 |
MAR | 0.00 | 0.00 | 0.00 | 1.00 |
State at t | S | LAT | COH | MAR |
---|---|---|---|---|
S | 0.81 | 0.14 | 0.04 | 0.01 |
LAT | 0.12 | 0.68 | 0.16 | 0.04 |
COH | 0.04 | 0.02 | 0.80 | 0.14 |
MAR | 0.01 | 0.01 | 0.00 | 0.98 |
Additional insights can be gained by using sequence data stored in the DSS format. A computation based on this format provides transition rates between episodes of distinct states. Note, that we only use monthly sequence data for this exercise in order to keep track of short lasting spells which might be obscured in the yearly data (Table 2.5).
seqtrate(seqdss(partner.month.seq))
State at t | S | LAT | COH | MAR |
---|---|---|---|---|
S | 0.00 | 0.91 | 0.07 | 0.02 |
LAT | 0.42 | 0.00 | 0.50 | 0.08 |
COH | 0.20 | 0.12 | 0.00 | 0.68 |
MAR | 0.44 | 0.46 | 0.11 | 0.00 |
Among other things seqstatd
computes the distribution of
states at each position in the sequence. Usually this information is
displayed in a graphical fashion (state distribution plot) rather than
as a descriptive table.
Usually the tabular presentation of the state distribution requires to extract the distribution for a selection of (meaningful) positions of the sequence. In the example below we display the distribution at age 18, 20, 24, 28, 32, 36, and 40. Note that time is measured in months. Hence, we do not extract the descriptives at positions 1 and 3 but at positions 1 and 24 to obtain the state distribution at age 18 and 20 (Table 2.6).
seqstatd(partner.month.seq)$Frequencies[,c(1, seq(24, 264, by = 48))]
State | 18 | 20 | 24 | 28 | 32 | 36 | 40 |
---|---|---|---|---|---|---|---|
S | 0.65 | 0.52 | 0.36 | 0.25 | 0.18 | 0.15 | 0.14 |
LAT | 0.31 | 0.32 | 0.25 | 0.17 | 0.12 | 0.10 | 0.05 |
COH | 0.03 | 0.11 | 0.23 | 0.25 | 0.22 | 0.15 | 0.13 |
MAR | 0.01 | 0.05 | 0.17 | 0.33 | 0.48 | 0.60 | 0.68 |
The following code extracts the corresponding cross-sectional values of the Shannon entropy for the same time points (Table 2.7).
seqstatd(partner.month.seq)$Entropy[c(1, seq(24, 264, by = 48))]
18 | 20 | 24 | 28 | 32 | 36 | 40 |
---|---|---|---|---|---|---|
0.58 | 0.78 | 0.97 | 0.98 | 0.9 | 0.8 | 0.69 |
At age 28 the states are most evenly distributed. As a result the enropy value is highest at this age.
Finally, the seqstatd
function can also be used to
figure out how many cases never spent any time in specific states.
as_tibble(seqistatd(partner.month.seq)) %>%
mutate_all(~case_when(. == 0 ~ 1,
TRUE ~ 0)) %>%
summarise_all(~(weighted.mean(., w = family$weight40)))
S | LAT | COH | MAR |
---|---|---|---|
0.12 | 0.07 | 0.22 | 0.25 |
According to the monthly partnership data 25% of the sample did not spend a single month in wedlock. Another 12% never were observed outside some sort of partnership.
The distribution above indicates the dominance of the partnership states “Single” and “Married” at the beginning and ending of the sequence. This is also reflected in the sequence of modal states. The following commands extract the modal sequences using sequence data with monthly and yearly granularity (see Chapter 2.2).
<- seqdef(as_tibble(seqmodst(partner.month.seq)))
modal.month.seq print(modal.month,seq, format = "SPS")
<- seqdef(as_tibble(seqmodst(partner.year.seq)))
modal.year.seq print(modal.year.seq, format = "SPS")
Granularity | Modal Sequence |
---|---|
Monthly data | (S,102)-(MAR,162) |
Yearly data | (S,9)-(MAR,13) |
Usually the modal sequence is a hypothetical sequence that is not
actually observed in the data. In contrast, seqrep
aims at
identifying those sequences that represent the data best. Note that this
approach requires the computation of a distance matrix. The following
commands illustrate the identification of representative sequences using
yearly sequences (partner.year.seq
). For details on
computing sequence distances (seqdist
) see Chapter
3.
<- seqdist(partner.year.seq,
partner.year.om method="OM", sm="CONSTANT")
<- seqrep(partner.year.seq,
partner.year.rep diss = partner.year.om,
criterion="density")
The following command prints the sequences in the more accessible SPS format.
print(partner.year.seq[attributes(partner.year.rep)$Index,], format = "SPS")
Descriptive statistics on the quality of the representative Sequence
are stored in attributes(partner.year.rep)$Statistics
.
These can be easily accessed by typing:
summary(partner.year.rep)
The table below presents a set of representative sequences (SPS format) and the corresponding coverage statistics. In addition, it shows how many cases are (more or less) represented by each the extracted sequences (Table 2.8):
Sequence |
Coverage (in %) |
Assigned (in %) |
---|---|---|
(S,1)-(LAT,2)-(MAR,19) | 5.7 | 6.5 |
(S,20)-(MAR,2) | 4.4 | 25.2 |
(S,4)-(LAT,1)-(COH,1)-(MAR,16) | 3.8 | 5.3 |
(LAT,3)-(COH,2)-(MAR,17) | 3.1 | 11.4 |
(S,2)-(LAT,2)-(COH,3)-(MAR,15) | 2.7 | 17.1 |
(S,5)-(LAT,2)-(COH,2)-(MAR,13) | 2.7 | 23.5 |
(COH,2)-(MAR,20) | 2.6 | 3.0 |
(S,1)-(LAT,5)-(MAR,16) | 2.3 | 8.0 |
Total Coverage | 27.5 | 100.0 |
Finally, we conclude this chapter by identifying the medoid sequences
of women and men using {TraMineRextras}
’s
seqrep.grp
-function which allows to extract representative
sequences for different subgroups.
<- seqrep.grp(partner.year.seq,
partner.year.sex.rep group = family$sex,
diss=partner.year.om,
criterion="dist",
nrep=1,
ret = "both")
# Medoid & coverage - men (family$sex = 0 = male)
print(partner.year.sex.rep[[1]]$` 0`, format = "SPS")
summary(partner.year.sex.rep[[1]]$` 0`)
# Medoid & coverage - women (family$sex = 1 = female)
print(partner.year.sex.rep[[1]]$` 1`, format = "SPS")
summary(partner.year.sex.rep[[1]]$` 1`)
Sex | Sequence |
Coverage (in %) |
---|---|---|
Female | (S,3)-(LAT,2)-(COH,4)-(MAR,13) | 3.38 |
Male | (S,7)-(LAT,4)-(COH,3)-(MAR,8) | 0.99 |
If you see mistakes or want to suggest changes, please create an issue on the source repository.
Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. Source code is available at https://github.com/sa-book/sa-book.github.io, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".