SA with Python and sequenzo

Running sequenzo from R using reticulate

After our book was published, Yuqi Liang and her team developed Sequenzo, a package for social sequence analysis. Beyond making sequence analysis more accessible to Python users, this package offers significant advantages due to Python’s powerful computing capabilities: it is considerably faster than available R tools and better suited for larger datasets. Since we are not Python users ourselves, this brief introduction is specifically aimed at R users who want to leverage the power of Sequenzo from within R (as part of an R script) using the {reticulate} package.

Setup

If you have not installed Python and sequenzo yet you have to run the following two {reticulate} functions first:

# use (and install if necessary) pacman package 
if (!require("pacman")) install.packages("pacman")
library(pacman)

# load and install (if necessary) required packages for this course
pacman::p_load(
  kable,      # Table
  reticulate, # R interface to Python
  tictoc,     # for measuring the duration of distance computation
  tidyverse,  # universal toolkit for data wrangling and plotting 
  TraMineR    # The sequence analysis toolkit for R
  )

# Install Python and sequenzo using reticulate (needs to be done only once)
install_python()
py_install("sequenzo")

Now you are ready to import the data for example application and sequenzo.

# Import Python modules
sequenzo <- import("sequenzo")


# load example data: family biographies from PAIRFAM
family <- readRDS("familybio.rds")

Preparing the data

The imported data frame family contains sequence data from 1,866 respondents of the German Family Panel (pairfam). The 264 sequence variables are numbered and start with the prefix state. They provide monthly information on family biographies - a combination of partnership status and parity—from age 18 to 40.

# State Short Label
1 Single, no child S
2 Single, child(ren) Sc
3 LAT, no child LAT
4 LAT, child(ren) LATc
5 Cohabiting, no child COH
6 Cohabiting, child(ren) COHc
7 Married, no child MAR
8 Married, 1 child MARc1
9 Married, 2+ children MARc2+

Step 1: Define a sequence object in R

# define long and short labels for sequence vars
shortlab.family <- c("S", "Sc", 
                     "LAT", "LATc", 
                     "COH", "COHc",
                     "MAR", "MARc1", "MARc2+")

longlab.family <- 
  c("Single, no child", "Single, child(ren)",
    "LAT, no child", "LAT, child(ren)", 
    "Cohabiting, no child", "Cohabiting, child(ren)", 
    "Married, no child", "Married, 1 child", "Married, 2+ children")

# define sequence object in TraMineR
family.seq <- seqdef(data = select(family, starts_with("state")),
                     states = shortlab.family,
                     labels = longlab.family, 
                     alphabet = 1:9,
                     id = family$id)

Step 2: Define a sequence object with sequenzo using {reticulate}

# get data into recommended format 
# see: https://sequenzo.yuqi-liang.tech/en/function-library/sequence-data
seqdata <- family |> 
  mutate(across(everything(), as.character)) |> 
  rename_with(~ str_remove_all(.x, "state"))
  
# sequence data to pyhton 
df_py <- r_to_py(seqdata)

# set parameters
time_list <- as.character(1:264)
states <- as.character(1:9)
labels <- longlab.family

# define sequence data in sequenzo
dataset <- sequenzo$SequenceData(
  df_py,
  time = time_list,
  id_col = "id",
  states = states,
  labels = labels
  )

[>] SequenceData initialized successfully! Here's a summary:
[>] Number of sequences: 1866
[>] Number of time points: 264
[>] Min/Max sequence length: 264 / 264
[>] States: ['1', '2', '3', '4', '5', '6', '7', '8', '9']
[>] Labels: ['Single, no child', 'Single, child(ren)', 'LAT, no child', 'LAT, child(ren)', 'Cohabiting, no child', 'Cohabiting, child(ren)', 'Married, no child', 'Married, 1 child', 'Married, 2+ children']
[>] Weights: Not provided

Compute dissimilarities & compare performance

Now we are set to compute the dissimilarity matrices. For this example we choose Optimal Matching with indel costs of 1 and constant substitution costs of 2.

# OM with TraMineR
om.const<-seqdist(family.seq,
                  method = "OM", 
                  indel = 1,
                  sm = "CONSTANT",
                  norm = "none")

# OM with sequenzo
om.sequenzo <- sequenzo$get_distance_matrix(
  seqdata = dataset, 
  method = "OM",
  sm = "CONSTANT",
  norm = "none",
  full_matrix = TRUE
)

As indicated by the developers, sequenzo’s get_distance_matrix is notably faster than TraMineR’s seqdist. On our test machine (Surface Pro 7+, 11th Gen Intel Core i7-1165G7 @ 2.80GHz, 4 cores; 16 GB RAM), TraMineR requires 5.62 minutes, whereas sequenzo needs only 2.18 minutes (39% of the time) to compute the distances.

Check results

Finally, we confirm that the results from both packages are identical.

# Visual inspection
om.sequenzo[1:5,1:5]
        111000 1624000 2767000 2931000 3167000
111000       0     498     498     498     498
1624000    498       0     274     210     226
2767000    498     274       0     382     214
2931000    498     210     382       0     340
3167000    498     226     214     340       0
om.const[1:5,1:5]
        111000 1624000 2767000 2931000 3167000
111000       0     498     498     498     498
1624000    498       0     274     210     226
2767000    498     274       0     382     214
2931000    498     210     382       0     340
3167000    498     226     214     340       0
# Test if distances are the same
all.equal(as.matrix(om.sequenzo), as.matrix(om.const))
[1] TRUE

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. Source code is available at https://github.com/sa-book/sa-book.github.io, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".