Running sequenzo from R using reticulate
After our book was published, Yuqi Liang and her team developed Sequenzo, a package for social sequence analysis. Beyond making sequence analysis more accessible to Python users, this package offers significant advantages due to Python’s powerful computing capabilities: it is considerably faster than available R tools and better suited for larger datasets. Since we are not Python users ourselves, this brief introduction is specifically aimed at R users who want to leverage the power of Sequenzo from within R (as part of an R script) using the {reticulate} package.
If you have not installed Python and sequenzo yet you have to run the following two {reticulate} functions first:
# use (and install if necessary) pacman package
if (!require("pacman")) install.packages("pacman")
library(pacman)
# load and install (if necessary) required packages for this course
pacman::p_load(
kable, # Table
reticulate, # R interface to Python
tictoc, # for measuring the duration of distance computation
tidyverse, # universal toolkit for data wrangling and plotting
TraMineR # The sequence analysis toolkit for R
)
# Install Python and sequenzo using reticulate (needs to be done only once)
install_python()
py_install("sequenzo")Now you are ready to import the data for example application and sequenzo.
# Import Python modules
sequenzo <- import("sequenzo")
# load example data: family biographies from PAIRFAM
family <- readRDS("familybio.rds")The imported data frame family contains sequence data from 1,866 respondents of the German Family Panel (pairfam). The 264 sequence variables are numbered and start with the prefix state. They provide monthly information on family biographies - a combination of partnership status and parity—from age 18 to 40.
| # | State | Short Label |
|---|---|---|
| 1 | Single, no child | S |
| 2 | Single, child(ren) | Sc |
| 3 | LAT, no child | LAT |
| 4 | LAT, child(ren) | LATc |
| 5 | Cohabiting, no child | COH |
| 6 | Cohabiting, child(ren) | COHc |
| 7 | Married, no child | MAR |
| 8 | Married, 1 child | MARc1 |
| 9 | Married, 2+ children | MARc2+ |
# define long and short labels for sequence vars
shortlab.family <- c("S", "Sc",
"LAT", "LATc",
"COH", "COHc",
"MAR", "MARc1", "MARc2+")
longlab.family <-
c("Single, no child", "Single, child(ren)",
"LAT, no child", "LAT, child(ren)",
"Cohabiting, no child", "Cohabiting, child(ren)",
"Married, no child", "Married, 1 child", "Married, 2+ children")
# define sequence object in TraMineR
family.seq <- seqdef(data = select(family, starts_with("state")),
states = shortlab.family,
labels = longlab.family,
alphabet = 1:9,
id = family$id)sequenzo using {reticulate}# get data into recommended format
# see: https://sequenzo.yuqi-liang.tech/en/function-library/sequence-data
seqdata <- family |>
mutate(across(everything(), as.character)) |>
rename_with(~ str_remove_all(.x, "state"))
# sequence data to pyhton
df_py <- r_to_py(seqdata)
# set parameters
time_list <- as.character(1:264)
states <- as.character(1:9)
labels <- longlab.family
# define sequence data in sequenzo
dataset <- sequenzo$SequenceData(
df_py,
time = time_list,
id_col = "id",
states = states,
labels = labels
)
[>] SequenceData initialized successfully! Here's a summary:
[>] Number of sequences: 1866
[>] Number of time points: 264
[>] Min/Max sequence length: 264 / 264
[>] States: ['1', '2', '3', '4', '5', '6', '7', '8', '9']
[>] Labels: ['Single, no child', 'Single, child(ren)', 'LAT, no child', 'LAT, child(ren)', 'Cohabiting, no child', 'Cohabiting, child(ren)', 'Married, no child', 'Married, 1 child', 'Married, 2+ children']
[>] Weights: Not provided
Now we are set to compute the dissimilarity matrices. For this example we choose Optimal Matching with indel costs of 1 and constant substitution costs of 2.
# OM with TraMineR
om.const<-seqdist(family.seq,
method = "OM",
indel = 1,
sm = "CONSTANT",
norm = "none")
# OM with sequenzo
om.sequenzo <- sequenzo$get_distance_matrix(
seqdata = dataset,
method = "OM",
sm = "CONSTANT",
norm = "none",
full_matrix = TRUE
)As indicated by the developers, sequenzo’s get_distance_matrix is notably faster than TraMineR’s seqdist. On our test machine (Surface Pro 7+, 11th Gen Intel Core i7-1165G7 @ 2.80GHz, 4 cores; 16 GB RAM), TraMineR requires 5.62 minutes, whereas sequenzo needs only 2.18 minutes (39% of the time) to compute the distances.
Finally, we confirm that the results from both packages are identical.
# Visual inspection
om.sequenzo[1:5,1:5] 111000 1624000 2767000 2931000 3167000
111000 0 498 498 498 498
1624000 498 0 274 210 226
2767000 498 274 0 382 214
2931000 498 210 382 0 340
3167000 498 226 214 340 0
om.const[1:5,1:5] 111000 1624000 2767000 2931000 3167000
111000 0 498 498 498 498
1624000 498 0 274 210 226
2767000 498 274 0 382 214
2931000 498 210 382 0 340
3167000 498 226 214 340 0
# Test if distances are the same
all.equal(as.matrix(om.sequenzo), as.matrix(om.const))[1] TRUE
If you see mistakes or want to suggest changes, please create an issue on the source repository.
Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. Source code is available at https://github.com/sa-book/sa-book.github.io, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".