… ggplotify sequence data plots
readme.html
and run
2-0_ChapterSetup.R
. This will create
2-0_ChapterSetup.RData
in the sub folder
data/R
. This file contains the data required to produce the
state distribution plot shown below.LoadInstallPackages.R
# assuming you are working within .Rproj environment
library(here)
# install (if necessary) and load other required packages
source(here("source", "load_libraries.R"))
# load environment generated in "2-0_ChapterSetup.R"
load(here("data", "R", "2-0_ChapterSetup.RData"))
Virtually all figures shown in the book were rendererd with
TraMineR::seqplot
which uses base R’s plot
function. While there is nothing wrong with that, many R users tend to
prefer {ggplot2}
for visualizing their
data.
We wrote a little R package named {ggseqplot}
that reproduces sequence plots from {TraMineR}
’s seqplot
using
{ggplot2}
. These plots are produced on
the basis of a sequence object defined with
TraMineR::seqdef
. The package automates the reshaping and
plotting of sequence data required to render the plots with {ggplot2}
.
{ggseqplot}
has been written after this book has been published. You can install and
load it with the following commands:
install.packages("ggseqplot")
library(ggseqplot)
For further information we refer to the {ggseqplot}
website and a blog
post on the website of the Sequence Analysis Association (SAA).
The following section briefly illustrates the package by rendering sequence distribution and index plots.
In the following example we are rendering state distribution plots of
the sequence object partner.child.year.seq
.
We start by comparing seqdplot
and
ggseqdplot
in their most basic specification:
seqdplot(partner.child.year.seq)
ggseqdplot(partner.child.year.seq)
Note that the default labeling of the x-axis values might produce
non-ideal results. The default labeling behavior forces labels to appear
at the first and the last position of the sequence data. The states
in-between are labeled according to base R’s pretty
function.
Of course, the labels can also be adjusted according to your need
using the regular {ggplot2}
functions. If you want to
access the value labels for the x-axis, you can either get them from the
sequence object you are plotting
(attributes(partner.child.year.seq)$names
) or by saving
your ggseqdplot
as an object and then extracting the
required information from the data
column
k
.
In the following example we utilize ggplot2::guide_axis
to provide enough space for printing the complete set of values labels
for the x-axis. Overlapping labels are avoided by printing the labels in
two rows (n.dodge = 2
):
<- ggseqdplot(partner.child.year.seq)
dplot
+ scale_x_discrete(labels = levels(dplot$data$k),
dplot guide = guide_axis(n.dodge = 2))
Alternatively, you could get rid of the long value labels altogether and just print the sequence position as labels on the x-axis.
ggseqdplot(partner.child.year.seq) +
scale_x_discrete()
As ggseqdplot
has been written with the intention to
mimic the behavior of seqdplot
it’s not very surprising
that both plots look very similar. Just like with seqplot
the ggplot
version of the plot allows to turn off weights
and to generate faceted plots by specifying the group
argument. Turn to the documentation for further possibilities to adjust
the outcome of ggseqdplot
:
ggseqdplot(partner.child.year.seq,
group = as_factor(family$sex), # `as_factor` not necessary; only used to get nice labels
weighted = FALSE) +
scale_x_discrete()
Note that it is not possible to obtain the faceted output by
manually specifying facet_wrap(~family$sex)
. That
is because ggseqdplot
internally reshapes the sequence data
into an aggregated format that is suitable for generating the plot. In
our example the resulting dataset includes one row for every combination
of sequence position \(k\) (22
positions in our case) and state of the alphabet \(A\) (9 distinct states). With one group
this yields a data set with \(22*9=198\) rows. The data for the faceted
version shown above has 396 rows with 198 for men and women each. The
grouping variable family$sex
, however, is of length 1866
(one row for each case).
If you want to access the data used for the plot you can save the plot as a list object:
<- ggseqdplot(partner.child.year.seq,
dplot group = as_factor(family$sex),
weighted = FALSE)
$data dplot
# A tibble: 396 × 6
group state k x value grouplab
<fct> <fct> <fct> <fct> <dbl> <glue>
1 Female Single, no child state1 1 0.494 Female (n=1039)
2 Female Single, no child state13 2 0.409 Female (n=1039)
3 Female Single, no child state25 3 0.330 Female (n=1039)
4 Female Single, no child state37 4 0.276 Female (n=1039)
5 Female Single, no child state49 5 0.244 Female (n=1039)
6 Female Single, no child state61 6 0.218 Female (n=1039)
7 Female Single, no child state73 7 0.183 Female (n=1039)
8 Female Single, no child state85 8 0.140 Female (n=1039)
9 Female Single, no child state97 9 0.133 Female (n=1039)
10 Female Single, no child state109 10 0.122 Female (n=1039)
# … with 386 more rows
# ℹ Use `print(n = ...)` to see more rows
Of course, the figures produced with {ggseqplot}
can be adjusted like any other {ggplot2}
figure
library(hrbrthemes)
ggseqdplot(partner.child.year.seq) +
scale_x_discrete(labels = 18:39) +
labs(title = "State distribution plot",
x = "Age") +
guides(fill=guide_legend(title="Alphabet")) +
theme_ipsum_rc() +
theme(plot.title = element_text(size = 34,
margin=margin(0,0,20,0)),
plot.title.position = "plot",
axis.title.x = element_text(size=12),
axis.title.y = element_text(size=12))
We again start with a comparison of the default outputs for
seqIplot
and ggseqiplot
. We plot a sample of
50 sequences and sort the remaining sequences according to the time
spent in marriage (states 7 to 9).
set.seed(1234)
<- sample(1:nrow(partner.child.year.seq),50)
subset
seqIplot(partner.child.year.seq[subset,],
sortv = rowSums(seqistatd(partner.child.year.seq[subset,])[,7:9]))
ggseqiplot(partner.child.year.seq[subset,],
sortv = rowSums(seqistatd(partner.child.year.seq[subset,])[,7:9]))
The following code chunk illustrates how the plot can be fine-tuned
in the well-known {ggplot2}
fashion:
ggseqiplot(partner.child.year.seq[subset,],
sortv = rowSums(seqistatd(partner.child.year.seq[subset,])[,7:9])) +
scale_y_continuous(expand = expansion(add = c(.8, 0))) +
scale_x_discrete(breaks = c(seq(1,21,2),22),
labels = c(seq(18,38,2),39)) +
labs(title = "Sequence index plot",
x = "Age",
fill = "Alphabet",
caption = "Sequences sorted by marriage duration") +
guides(color="none") +
theme_ipsum_rc() +
theme(plot.title = element_text(size = 34,
margin=margin(0,0,20,0)),
plot.title.position = "plot",
plot.caption = element_text(size=12),
axis.title.x = element_text(size=12),
axis.title.y = element_text(size=12))
If you see mistakes or want to suggest changes, please create an issue on the source repository.
Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. Source code is available at https://github.com/sa-book/sa-book.github.io, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".