Running sequenzo from R using reticulate
After our book was published, Yuqi Liang and her team developed Sequenzo
, a package for social sequence analysis. Beyond making sequence analysis more accessible to Python users, this package offers significant advantages due to Python’s powerful computing capabilities: it is considerably faster than available R tools and better suited for larger datasets. Since we are not Python users ourselves, this brief introduction is specifically aimed at R users who want to leverage the power of Sequenzo
from within R (as part of an R script) using the {reticulate}
package.
If you have not installed Python and sequenzo yet you have to run the following two {reticulate}
functions first:
# use (and install if necessary) pacman package
if (!require("pacman")) install.packages("pacman")
library(pacman)
# load and install (if necessary) required packages for this course
::p_load(
pacman# Table
kable, # R interface to Python
reticulate, # for measuring the duration of distance computation
tictoc, # universal toolkit for data wrangling and plotting
tidyverse, # The sequence analysis toolkit for R
TraMineR
)
# Install Python and sequenzo using reticulate (needs to be done only once)
install_python()
py_install("sequenzo")
Now you are ready to import the data for example application and sequenzo
.
# Import Python modules
<- import("sequenzo")
sequenzo
# load example data: family biographies from PAIRFAM
<- readRDS("familybio.rds") family
The imported data frame family
contains sequence data from 1,866 respondents of the German Family Panel (pairfam). The 264 sequence variables are numbered and start with the prefix state
. They provide monthly information on family biographies - a combination of partnership status and parity—from age 18 to 40.
# | State | Short Label |
---|---|---|
1 | Single, no child | S |
2 | Single, child(ren) | Sc |
3 | LAT, no child | LAT |
4 | LAT, child(ren) | LATc |
5 | Cohabiting, no child | COH |
6 | Cohabiting, child(ren) | COHc |
7 | Married, no child | MAR |
8 | Married, 1 child | MARc1 |
9 | Married, 2+ children | MARc2+ |
# define long and short labels for sequence vars
<- c("S", "Sc",
shortlab.family "LAT", "LATc",
"COH", "COHc",
"MAR", "MARc1", "MARc2+")
<-
longlab.family c("Single, no child", "Single, child(ren)",
"LAT, no child", "LAT, child(ren)",
"Cohabiting, no child", "Cohabiting, child(ren)",
"Married, no child", "Married, 1 child", "Married, 2+ children")
# define sequence object in TraMineR
<- seqdef(data = select(family, starts_with("state")),
family.seq states = shortlab.family,
labels = longlab.family,
alphabet = 1:9,
id = family$id)
sequenzo
using {reticulate}
# get data into recommended format
# see: https://sequenzo.yuqi-liang.tech/en/function-library/sequence-data
<- family |>
seqdata mutate(across(everything(), as.character)) |>
rename_with(~ str_remove_all(.x, "state"))
# sequence data to pyhton
<- r_to_py(seqdata)
df_py
# set parameters
<- as.character(1:264)
time_list <- as.character(1:9)
states <- longlab.family
labels
# define sequence data in sequenzo
<- sequenzo$SequenceData(
dataset
df_py,time = time_list,
id_col = "id",
states = states,
labels = labels
)
[>] SequenceData initialized successfully! Here's a summary:
[>] Number of sequences: 1866
[>] Number of time points: 264
[>] Min/Max sequence length: 264 / 264
[>] States: ['1', '2', '3', '4', '5', '6', '7', '8', '9']
[>] Labels: ['Single, no child', 'Single, child(ren)', 'LAT, no child', 'LAT, child(ren)', 'Cohabiting, no child', 'Cohabiting, child(ren)', 'Married, no child', 'Married, 1 child', 'Married, 2+ children']
[>] Weights: Not provided
Now we are set to compute the dissimilarity matrices. For this example we choose Optimal Matching with indel costs of 1 and constant substitution costs of 2.
# OM with TraMineR
<-seqdist(family.seq,
om.constmethod = "OM",
indel = 1,
sm = "CONSTANT",
norm = "none")
# OM with sequenzo
<- sequenzo$get_distance_matrix(
om.sequenzo seqdata = dataset,
method = "OM",
sm = "CONSTANT",
norm = "none",
full_matrix = TRUE
)
As indicated by the developers, sequenzo
’s get_distance_matrix
is notably faster than TraMineR
’s seqdist
. On our test machine (Surface Pro 7+, 11th Gen Intel Core i7-1165G7 @ 2.80GHz, 4 cores; 16 GB RAM), TraMineR
requires 5.62 minutes, whereas sequenzo
needs only 2.18 minutes (39% of the time) to compute the distances.
Finally, we confirm that the results from both packages are identical.
# Visual inspection
1:5,1:5] om.sequenzo[
111000 1624000 2767000 2931000 3167000
111000 0 498 498 498 498
1624000 498 0 274 210 226
2767000 498 274 0 382 214
2931000 498 210 382 0 340
3167000 498 226 214 340 0
1:5,1:5] om.const[
111000 1624000 2767000 2931000 3167000
111000 0 498 498 498 498
1624000 498 0 274 210 226
2767000 498 274 0 382 214
2931000 498 210 382 0 340
3167000 498 226 214 340 0
# Test if distances are the same
all.equal(as.matrix(om.sequenzo), as.matrix(om.const))
[1] TRUE
If you see mistakes or want to suggest changes, please create an issue on the source repository.
Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. Source code is available at https://github.com/sa-book/sa-book.github.io, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".