Sequence Analysis - Companion Site: Measuring Associations Between Sequences and Covariates II

Click here to get instructions…

Please download and unzip the replication files for Chapter 6 ( Chapter06.zip).
Read readme.html and run 6-0_ChapterSetup.R. This will create 6-0_ChapterSetup.RData in the sub folder data/R. This file contains the data required to produce the table shown at the bottom of this page.
We also recommend to load the libraries listed in the Chapter 6’s LoadInstallPackages.R

# assuming you are working within .Rproj environment
library(here)

# install (if necessary) and load other required packages
source(here("source", "LoadInstallPackages.R"))

# load environment generated in "6-0_ChapterSetup.R"
load(here("data", "R", "6-0_ChapterSetup.RData"))

Table 6.4 in Chapter 6.2.2 presents group comparisons based on a discrepancy analysis and an alternative approach recently proposed by Liao and Fasang (LiaoFasang2021?). In this example we examine the labor market trajectories stored in activity.year.seq. The group comparisons are using dummy variables stored in activity: east (living in East vs West Germany), sex (male vs. female), highschool (at least highschool degree: yes vs. no).

The analysis are done with {TraMineR}’s dissassoc function and the seqCompare function from the {TraMineRextras} package.

In the book we present the following table (note that we highlighted one line we are going to elborate on in the following paragraphs):

		Likelihood Ratio Test	Discrepancy Analysis
	BIC diff.	p-value	pseudo-R2	p-value	N
West vs East
Overall	0.01	0.028	0.01	0.001	1027
Men	3.69	0.013	0.00	0.019	506
Women	0.68	0.018	0.02	0.001	521
Men vs Women
Overall	24.85	0.000	0.09	0.001	1027
West	26.72	0.000	0.11	0.001	629
East	14.41	0.000	0.07	0.001	398
No highschool degree	50.67	0.000	0.12	0.001	602
West	62.98	0.000	0.17	0.001	338
East	19.91	0.000	0.09	0.001	264
Highschool degree	17.40	0.000	0.06	0.001	425
West	17.80	0.000	0.08	0.001	291
East	10.09	0.000	0.04	0.001	134

Obviously, the table includes a lot of group comparisons and it would require a lot of repetitive coding if each single comparisons would be explicitly spelled out. For that reason, we were using {purrr}’s map function to to do the comparisons within loops. We also wrote several lines of code for extracting and formatting the results to be displayed in a nice table using {knitr}’s kable and the {kableExtra} package. In the accompanying script file 6-2_Table_6-4_Discrepancies-BIC.R stored in Chapter06.zip) you can find the full code required to produce the table shown above.

On this page, however, we do not further elaborate on all these technicalities and rather focus on briefly showcasing the plain {TraMineR} commands required to obtain the results shown in the table. As you see, this does not require a lot of coding:

We illustrate the procedure by calling the functions required to obtain the figures shown in the highlighted row of table displayed above. The functions dissassoc and seqCompareare easy to use. The most “tricky” part for R novices is probably to apply the functions to the correct subset of the data. In our example we are interested in the question of whether the sequences of East German women are systematically different from the sequences of their West German counterparts. Hence, our analysis have to be restricted to women. While our sequences are stored in the object activity.year.seq any further information characterizing the individual who experience these sequences is stored in the dataframe activity. We start our analysis by inspecting how many women reside in East and West Germany. (table(activity$east[activity$sex==1])).

	Women
West Germany	325
East Germany	196

We then proceed by generating a logical vector (women) indicating which of the 1027 rows of activity is referring to a female respondent.

women <- activity$sex==1

# visual inspection
activity %>% 
  select(sex) %>% 
  mutate (women = women)

# A tibble: 1,027 × 2
          sex women
    <dbl+lbl> <lgl>
 1 1 [Female] TRUE 
 2 0 [Male]   FALSE
 3 1 [Female] TRUE 
 4 0 [Male]   FALSE
 5 0 [Male]   FALSE
 6 1 [Female] TRUE 
 7 0 [Male]   FALSE
 8 0 [Male]   FALSE
 9 1 [Female] TRUE 
10 1 [Female] TRUE 
# … with 1,017 more rows
# ℹ Use `print(n = ...)` to see more rows

We use this vector to subset women from the dissimilarity matrix activity.year.om, the sequence object activity.year.seq and the vector activity$east when we call dissassoc and seqCompare. Note that subsetting the pairwise distance matrix requires to extract the correct rows and columns [women,women], while subsetting activity.year.seq only requires to extract the desired rows [women, ].

Below we compute the values shown in the table. While the discrepancy analysis requires a dissimilarity matrix (activity.year.om[women,women]) as an input, seqCompare computes the dissimilarity matrix on its own and requires a sequence object (activity.year.seq[women,]) and instructions on how to compute the distances (method="OM", sm="CONSTANT") as an input. If you want to compare the results of the two approaches, the distances should be computed with the same method. Both functions require a vector (activity$east[women])specifying the groups to be compared. Note that seqCompare currently only accepts a dichotomous grouping indicator.

discr.region <- dissassoc(activity.year.om[women,women], 
                          group = activity$east[women])

bic.region <- seqCompare(activity.year.seq[women,], 
                         group = activity$east[women],
                         method="OM", sm="CONSTANT")

The output of seqCompare is quite parsimonious. For Table 6.4 we extracted the p-value (bic.region[,2]) and the BIC difference (bic.region[,3])

bic.region %>%
  kable(digits = c(2,3,2,2,2)) %>% 
  column_spec(2:3, background = "#8DCEA4") %>% 
  kable_styling(full_width =FALSE)

LRT	p-value	BIC diff.	Bayes Factor (Avg)	Bayes Factor (From Avg BIC)
5.98	0.018	0.68	1.67	1.41

dissassoc, on the other hand, produces quite a lot of output which can be stored as a list object. The results shown in the table are the total number of cases (discr.region$groups[3,1]), the pseudo-$R^2$ (discr.region$stat[3,1]), and the corresponding p-value (discr.region$stat[3,2]).

We close this section double checking if the subsetting worked as intended by inspecting the case numbers stored in discr.region$groups. The first row refers to East German women and the second one to West German women. The case numbers are as expected (see frequency table above). Hence, we can be confident that we achieved what we wanted.

kable(discr.region$groups) %>% 
    kable_styling(full_width =FALSE)

	n	discrepancy
0	325	11.87963
1	196	11.94612
Total	521	12.11270

For the more complex subsetting including all groups shown in Table 6.4 we refer you to the script 6-2_Table_6-4_Discrepancies-BIC.R. There we also illustrate how to extract, arrange, and format the outputs from dissassoc and seqCompare to produce the table shown at the beginning of this page.

Measuring Associations Between Sequences and Covariates II

References

Corrections

Reuse