Describing & summarising sequence data

Chapter 2.3 Description of Sequence Data I: The Basics

We use a simple alphabet differentiating four partnership states to illustrate {TraMineR}’s capabilities of producing simple descriptive statistics on our sequence data. The sequences cover the partnership biographies from age 18 to 40 (monthly data). The sequences are stored in the object partner.month.seq.

State Short Label
Single S
LAT LAT
Cohabiting COH
Married MAR

 

Time spent in different states &
occurence of episodes

The function seqmeant computes the mean total time spent in each state of the alphabet. The arguments serr = TRUE and prop = TRUE provide standard errors and relative frequencies instead of the average number of months spent in each state.

seqmeant(partner.month.seq, serr = TRUE)
seqmeant(partner.month.seq, prop = TRUE)

We also can use seqmeant to identify the average number of episodes for each state of the alphabet by applying it to a sequence data object in the DSS format.

seqmeant(seqdss(partner.month.seq),serr = TRUE)

Taken together these commands yield the results shown in Table 2.3.

Time spent in months
Number of episodes
State Mean SD relative freq. Mean SD
S 72.5 69.8 0.27 1.6 1.2
LAT 48.0 43.9 0.18 1.8 1.3
COH 48.6 53.3 0.18 1.0 0.8
MAR 95.0 78.9 0.36 0.8 0.5

Number of transitions & transition matrix

The person-specific number of transitions between states can be obtained by seqtransn. However, we are usually interested in the average number of transitions. We use wtd.mean and wtd.var from the {Hmisc} package to compute the weighted mean and standard deviation. The weights are stored in the variable weight40 of the data frame family which served as source for generating the sequence object partner.month.seq.

wtd.mean(seqtransn(partner.month.seq), family$weight40)
sqrt(wtd.var(seqtransn(partner.month.seq), family$weight40))

We did the same computation for the sequence object partner.year.seq which comes with a yearly instead of a monthly granularity (see chapter 2.2).

Granularity Mean SD
Monthly data 4.3 2.6
Yearly data 3.3 1.9

Transition rates between states can be computed by using seqtrate. Again, we display the transitions rates using sequences with monthly and yearly granularity (partner.month.seq and partner.year.seq). It is also possible to calculate transitions rates at specific positions of the sequence by typing time.varying = TRUE. This option is also used for generating the animated illustration below (Table 2.4).

seqtrate(partner.month.seq)
seqtrate(partner.year.seq)
Monthly granularity
State at t+1
State at t S LAT COH MAR
S 0.98 0.02 0.00 0.00
LAT 0.02 0.96 0.02 0.00
COH 0.00 0.00 0.98 0.01
MAR 0.00 0.00 0.00 1.00
Yearly granularity
State at t+1
State at t S LAT COH MAR
S 0.81 0.14 0.04 0.01
LAT 0.12 0.68 0.16 0.04
COH 0.04 0.02 0.80 0.14
MAR 0.01 0.01 0.00 0.98

Additional insights can be gained by using sequence data stored in the DSS format. A computation based on this format provides transition rates between episodes of distinct states. Note, that we only use monthly sequence data for this exercise in order to keep track of short lasting spells which might be obscured in the yearly data (Table 2.5).

seqtrate(seqdss(partner.month.seq))
State at t+1
State at t S LAT COH MAR
S 0.00 0.91 0.07 0.02
LAT 0.42 0.00 0.50 0.08
COH 0.20 0.12 0.00 0.68
MAR 0.44 0.46 0.11 0.00

State distribution at different positions
(cross-sectional perspective)

Among other things seqstatd computes the distribution of states at each position in the sequence. Usually this information is displayed in a graphical fashion (state distribution plot) rather than as a descriptive table.

Usually the tabular presentation of the state distribution requires to extract the distribution for a selection of (meaningful) positions of the sequence. In the example below we display the distribution at age 18, 20, 24, 28, 32, 36, and 40. Note that time is measured in months. Hence, we do not extract the descriptives at positions 1 and 3 but at positions 1 and 24 to obtain the state distribution at age 18 and 20 (Table 2.6).

seqstatd(partner.month.seq)$Frequencies[,c(1, seq(24, 264, by = 48))]
State distribution at age
State 18 20 24 28 32 36 40
S 0.65 0.52 0.36 0.25 0.18 0.15 0.14
LAT 0.31 0.32 0.25 0.17 0.12 0.10 0.05
COH 0.03 0.11 0.23 0.25 0.22 0.15 0.13
MAR 0.01 0.05 0.17 0.33 0.48 0.60 0.68

The following code extracts the corresponding cross-sectional values of the Shannon entropy for the same time points (Table 2.7).

seqstatd(partner.month.seq)$Entropy[c(1, seq(24, 264, by = 48))]
Shannon entropy at age …
18 20 24 28 32 36 40
0.58 0.78 0.97 0.98 0.9 0.8 0.69

At age 28 the states are most evenly distributed. As a result the enropy value is highest at this age.

Finally, the seqstatd function can also be used to figure out how many cases never spent any time in specific states.

as_tibble(seqistatd(partner.month.seq)) %>%
  mutate_all(~case_when(. == 0 ~ 1,
                        TRUE ~ 0)) %>%
  summarise_all(~(weighted.mean(., w = family$weight40)))  
S LAT COH MAR
0.12 0.07 0.22 0.25

According to the monthly partnership data 25% of the sample did not spend a single month in wedlock. Another 12% never were observed outside some sort of partnership.

The distribution above indicates the dominance of the partnership states “Single” and “Married” at the beginning and ending of the sequence. This is also reflected in the sequence of modal states. The following commands extract the modal sequences using sequence data with monthly and yearly granularity (see Chapter 2.2).

modal.month.seq <- seqdef(as_tibble(seqmodst(partner.month.seq)))
print(modal.month,seq, format = "SPS")

modal.year.seq <- seqdef(as_tibble(seqmodst(partner.year.seq)))
print(modal.year.seq, format = "SPS")
Granularity Modal Sequence
Monthly data (S,102)-(MAR,162)
Yearly data (S,9)-(MAR,13)

Usually the modal sequence is a hypothetical sequence that is not actually observed in the data. In contrast, seqrep aims at identifying those sequences that represent the data best. Note that this approach requires the computation of a distance matrix. The following commands illustrate the identification of representative sequences using yearly sequences (partner.year.seq). For details on computing sequence distances (seqdist) see Chapter 3.

partner.year.om <- seqdist(partner.year.seq, 
                           method="OM", sm="CONSTANT")

partner.year.rep <- seqrep(partner.year.seq, 
                           diss = partner.year.om, 
                           criterion="density")

The following command prints the sequences in the more accessible SPS format.

print(partner.year.seq[attributes(partner.year.rep)$Index,], format = "SPS")

Descriptive statistics on the quality of the representative Sequence are stored in attributes(partner.year.rep)$Statistics. These can be easily accessed by typing:

summary(partner.year.rep)

The table below presents a set of representative sequences (SPS format) and the corresponding coverage statistics. In addition, it shows how many cases are (more or less) represented by each the extracted sequences (Table 2.8):

Sequence Coverage
(in %)
Assigned
(in %)
(S,1)-(LAT,2)-(MAR,19) 5.7 6.5
(S,20)-(MAR,2) 4.4 25.2
(S,4)-(LAT,1)-(COH,1)-(MAR,16) 3.8 5.3
(LAT,3)-(COH,2)-(MAR,17) 3.1 11.4
(S,2)-(LAT,2)-(COH,3)-(MAR,15) 2.7 17.1
(S,5)-(LAT,2)-(COH,2)-(MAR,13) 2.7 23.5
(COH,2)-(MAR,20) 2.6 3.0
(S,1)-(LAT,5)-(MAR,16) 2.3 8.0
Total Coverage 27.5 100.0

Finally, we conclude this chapter by identifying the medoid sequences of women and men using {TraMineRextras}’s seqrep.grp-function which allows to extract representative sequences for different subgroups.

partner.year.sex.rep <- seqrep.grp(partner.year.seq,
                                   group = family$sex,
                                   diss=partner.year.om,
                                   criterion="dist",
                                   nrep=1,
                                   ret = "both")

# Medoid & coverage - men (family$sex = 0 = male)
print(partner.year.sex.rep[[1]]$` 0`, format = "SPS")
summary(partner.year.sex.rep[[1]]$` 0`)

# Medoid & coverage - women (family$sex = 1 = female)
print(partner.year.sex.rep[[1]]$` 1`, format = "SPS")
summary(partner.year.sex.rep[[1]]$` 1`)
Sex Sequence Coverage
(in %)
Female (S,3)-(LAT,2)-(COH,4)-(MAR,13) 3.38
Male (S,7)-(LAT,4)-(COH,3)-(MAR,8) 0.99

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. Source code is available at https://github.com/sa-book/sa-book.github.io, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".