using StatsBase, DataFrames, CategoricalArrays, RDatasets
using CairoMakie, AlgebraOfGraphics
set_aog_theme!()
6 The AlgebraOfGraphics
package
The StatsPlots
package has been used to illustrate the standard graphics of exploratory statistics. That package leverages Plots
, a Julia
interface to multiple plotting backends. The GR
one renders the images seen. There are a few alternatives. The Makie
(Danisch and Krumbiegel 2021) plotting system along with the AlgebraOfGraphics
(Vertechi et al. n.d.) package makes a very compelling one.
The AlgebraOfGraphics
packages offers a declarative style to create statistical graphics. An example from the documentation shows the code to do the following “declare the dataset; declare the analysis; declare the arguments used; declare the grouping and the respective visual attribute; draw the visualization.” This is all done through a series of composable commands, illustrated by example below. The Pumas project has a much more extensive tutorial than is presented here.
We will see that it is very easy to visualize multiple variables through an appropriate choice of graphic or transformation, with further choices of coloring, faceting, or other means to demarcate different factors. The “declarative” style shines here, as the user simply specifies a variable, and the package converts this, as needed, to a color or shape .
We begin by loading the packages. The CairoMakie
backend is used here, GLMakie
is good for interactive usage at the command line, WGLMakie
is for web-based graphics, all are part of the same Makie
plotting ecosystem.
We use the color theme of aog
, as declared in the last command. The packages are compute-intensive and can take a while to load.
Following the package tutorial, we load the Palmer penguins data set of Allison Horst. This includes data collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network. The data can be downloaded from the GitHub site, but it is also wrapped into a Julia
package:
using PalmerPenguins
= dropmissing(DataFrame(PalmerPenguins.load()))
penguins first(penguins, 3)
Row | species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex |
---|---|---|---|---|---|---|---|
String15 | String15 | Float64 | Float64 | Int64 | Int64 | String7 | |
1 | Adelie | Torgersen | 39.1 | 18.7 | 181 | 3750 | male |
2 | Adelie | Torgersen | 39.5 | 17.4 | 186 | 3800 | female |
3 | Adelie | Torgersen | 40.3 | 18.0 | 195 | 3250 | female |
This data set has several correlated numeric variables on bill length, bill depth, flipper length, and body mass; and several categorical variables, such as species, island, and sex. An even more complete data set can be downloaded from the GitHub site.
6.1 Univariate graphical summaries
We run through the basic graphics for univariate statistics. We shall see that the framework makes multi-variate display quite direct, and at times easier than a univariate display.
6.1.1 Boxplot and violin plot
A boxplot (Figure 6.1) for each species is created by composing a series of declarative commands:
= data(penguins) *
p visual(BoxPlot) *
mapping(:species, :bill_length_mm => "Bill length (mm)", color=:species);
This illustrates many of the idioms used in the AlgebraOfGraphics
.
The data(penguins)
command sets up the data. Here a data frame is passed, but this can be any Tables
compatible structure, such as a struct of arrays such as data((;x, y))
for some pair of variables x
and y
.
The mapping
call takes values in the data to positions and attributes of the graphic. It uses position to identify the x
, y
, and (at times) z
values for the graphic. The y
variable specification above illustrates a mini language nearly identical to the DataFrames
mini language. For a box plot, an indicator of the groups goes in the x
position, the data values in the y
position. The color=:species
argument uses a mapping between the levels of the :species
variable and color ramp to give the graphic a distinct color for each species. Omitting this argument produces a monotone graphic with the chosen theme.
The visual(BoxPlot)
command declares the visualization or transformation to be used to view the data. The visual
function expects a type indicating the plot type to use and optional keyword arguments. In this case, BoxPlot
is the type associated with the Makie.boxplot
function. At times this type must be qualified, such as with Text
, for annotations.
Both the mapping
and visual
calls can be used to set attributes:
visual
is used to set attributes for each element independent of the data. For example, a box plot has the argumentorientation
which is not data dependent, so is adjusted within thevisual
call.mapping
is used to have attributes depend on values of a variable, likecolor
is used above.
The attributes are those for the underlying plotting function. For visual(BoxPlot)
, these can be seen at the help page for boxplot
, displayed with the command ?boxplot
.
The mapping
calls shows two uses of the mini language for data manipulation. The basic form is source => function => target
and works very much like the DataFrames mini language does for select
or transform
, but unlike those, the function is always applied by row. This makes some transformations, such as \(z\)-scores not possible within this call – transformations requiring the entire column need to be done within the values passed to data
. The abbreviated forms are just source
, as used with the color=:species
argument; source => function
; and source => target
, such as :bill_length_mm => "bill length (mm)"
used to rename the variable for labeling purposes. When the source involves more than one column selector, tuples should be used to group them.
A few functions are provided to bypass the usual mapping of the data. (For example, color
maps levels of a factor to a color ramp behind the scenes.) Among these are nonnumeric
to pass a numeric variable to a value expecting a categorical variable and verbatim
to avoid this mapping. The latter, => verbatim
, will be necessary to add when annotating a figure.
The object p
can be rendered to the screen with the draw
method resulting in Figure 6.1. Just draw(p)
will render the graphic, the following also shows how the figure
keyword argument can be used to set attributes using a named tuple, in this case the figure size. Similarly axis
values can be modified in this manner. In the following, we set a title attribute for the axis.
draw(p; figure=(;size=(600,400)),
=(; title="Bill length")) axis
The output of draw
is used to render to the screen and also to save to a file (as a .png
or .svg
file). The pattern save("filename.[png|svg]", draw(p))
will save the image to the named file using the given extension to specify the format used.
This is the basic pattern where different choices are combined, or merged, with the *
operation. The pieces can be re-purposed. In the following, we make use of this data:
= data(penguins); d
Box plots are very effective for quickly comparing distributions of a numeric variable across the levels of some factor. The calling syntax preferences that style, where both an x
and y
value are specified to mapping
. To create a box plot of a single variable, without grouping, the graphic takes a bit more to construct. In the following we create a single valued x
variable to produce the upper left graphic in Figure 6.2:
= d * visual(BoxPlot) *
p1 mapping(1 => one, :bill_length_mm => "Bill length (mm)");
The mini language is used above two different ways: with a function to create the single value for x
(AlgebraOfGraphics
will treat this to a factor, so one
isn’t needed, just some single-valued function) and with a target for labeling the y
variable. As mentioned, such transformations can also be done within the data frame before it is passed to data
, which is necessary for some types of transformations.
To add another layer, in this case a scatter plot, we can add the plotting objects:
= d * visual(BoxPlot) * mapping(:species, :bill_length_mm, color=:species)
p2a = d * visual(Scatter) * mapping(:species, :bill_length_mm)
p2b = p2a + p2b; p2
The Scatter
transformation plots pairs of points in a Cartesian plane.
Combinations with +
add a layer; those with *
merge layers. The algebra name also refers to algebraically desirable short cuts. For example, we repeat d
and the mapping
for each p2a
and p2b
, but these can be used just once by distributing them:
= mapping(:species, :bill_length_mm => "bill length (mm)", color=:species);
m = d * ( visual(BoxPlot) + visual(Scatter) ) * m; p3
Both p2
and p3
are shown in the lower row of Figure 6.2. There is just one slight difference, the dots representing the data in p2
are not colored, as the mapping did not instruct that in forming p2b
.
Specifying a violin plot requires just a slight modification to the above: we change the BoxPlot
visual to Violin
. Violin plots have an argument side
that allows both sides of the violin to reflect an extra grouping variable. We use the :sex
variable in the following, as it has only two levels. With this, each side of the violin plot reflects grouping by the :sex
factor, the legend is used to lookup which level of the factor is represented.
= d * visual(Violin) * mapping(:species, :bill_length_mm, color=:species, side=:sex); p4
The visual(Violin)
call wraps the function Makie.violin
whose documentation contains additional possible arguments beyond side
.
The AlgebraOfGraphics
package builds on the Makie
package and can use its layout system. Makie’s layout system leverages matrix notation to specify cell position. The draw!
method accepts a figure object as a first argument. In Figure 6.2 we layout 2 rows and 2 columns of figures, as follows:
= Figure()
f draw!(f[1,1], p1)
draw!(f[1,2], p4)
draw!(f[2,1], p2)
draw!(f[2,2], p3)
f
6.1.2 Dot plot
The boxplot does an excellent job of summarizing a data set with a few indicators making it quite useful when there are many data points. A dot plot is useful when there are a limited number of values and advantageous as the graphic shows all the data.
A dot plot (Figure 6.3) can be constructed easily enough by ensuring, in this case, the y
variable is non-numeric:
= penguins[sample(1:size(penguins,1), 50),:] # a sample
huddle
= data(huddle) * visual(Scatter) *
p1 mapping(:bill_length_mm=>"Bill length (mm)", :species => nonnumeric);
(In this example, species
is categorical, so the extra => nonnumeric
is unnecessary.)
Compare the above to a boxplot of the same sampled data:
= data(huddle) * visual(BoxPlot) *
p2 mapping(:species, :bill_length_mm => "Bill length (mm)"; color=:species);
The boxplot makes it easy to compare medians across the levels of the species
factor to gauge graphically if there is a differentiated effect on the response.
The following is an enhanced dot plot which emphasizes a comparison of center by adding a line and sorting so that this line only moves to the right as the eye travels up the levels of the factor. The code is a modification of some from (Alday et al. 2022).
"`dotplot`: show values for each group as dotplot sorted by some center"
function _arrange_dotplot_data(df, value::Symbol, group::Symbol, center=mean;
=true)
jittertransform!(df, value => Array, group => CategoricalArray;
=false) # set up types
renamecols
= combine(groupby(df, group), value => center => value)
sumry sort!(sumry, value)
= string.(sumry[!, group])
ordered_levels levels!(sumry[!, group], ordered_levels) # relevel, used in plotting
levels!(df[!, group], ordered_levels)
&& (df = combine(groupby(df, group),
jitter => (x -> x .+ std(x)/100), renamecols=false))
value
df, sumryend
= _arrange_dotplot_data(huddle, :bill_length_mm, :species, median)
df, sumry = mapping(:bill_length_mm => "Bill length (mm)", :species)
mm = data(df) * mm *
p3 visual(Scatter; marker='○', markersize=12) # use a character for a marker
+= data(sumry) * mm * visual(Lines); # add summary line p3
All these figures appear in Figure 6.3.
6.1.3 Faceting
The package also supports faceting where different panels share the same scales allowing easy cross comparison. Faceting is specified through the keyword layout
or either (or both) of row
and col
keywords. The layout
keyword uses levels of the variable name it is passed and arranges the plots over these levels. A col
declaration will make columns for each level of the specified variable, whereas a row
declaration will create rows for each level of the specified variables. By default both the x
and y
axes are linked. These linkings can be decoupled when draw
ing by passing in values to the facet
argument, along the lines of: draw(p, facet=(; linkxaxes=:none, linkyaxes=:none))
.
6.1.4 Histograms
The AlgebraOfGraphics
has certain functions it refers to as transformations of the data. These include histogram
, density
, frequency
, linear
, smooth
, and expectation
; most all will be illustrated by example below.
These are used like visual
was above, but arguments are passed directly to the transformation.
The histogram
function plays the role of visual
in this graphic. (The visual
function is still useful to apply data-independent attributes.) Here we arrange to color by species
:
= d * histogram() * mapping(:bill_length_mm, color=:species); p1
The histograms overlap. The layout
command can be used to declare one panel per level. We do this with :sex
:
= d * histogram() * mapping(:bill_length_mm, color=:species, layout=:sex); p2
See Figure 6.4 for the graphics.
6.1.5 Density plot
The histogram
function has options for overriding the default bin selection and has several options for scaling the figure through its normalization
argument. We use this in the next graphic which layers a density plot over a scaled histogram using the :pdf
scaling. The density
transformation is qualified with the module name to prevent a conflict with one in Makie
1.
= histogram(normalization=:pdf) + AlgebraOfGraphics.density()
layers = d * layers * mapping(:bill_length_mm, color=:species, layout=:sex); p3
In this next figure we add in a scatter plot of the data on top of the density plots. For the scatter plot, we use the Scatter
visual for which we create jittered \(y\) values to disambiguate the data, these are added as a column to the data in d1
, below:
= d * AlgebraOfGraphics.density() *
p4a mapping(:bill_length_mm, color=:species)
= data(transform(penguins,
d1 :bill_length_mm => ByRow(x -> 0.02 * rand()) => :ys))
= d1 * visual(Scatter) * mapping(:bill_length_mm, :ys, color=:species)
p4b = p4a + p4b; p4
6.1.6 Quantile-normal plots
The QQNorm
and QQPlot
visuals are used to make quantile-quantile plots; QQNorm
expects a mapping to :x
(first position) whereas QQPlot
expects mappings to :x
and :y
(the first two positions).
The following will give a visual check if bill length is normally distributed, the graphic indicates slightly shorter tails than expected
= data(penguins) * visual(QQNorm, qqline=:fit) *
p1 mapping(:bill_length_mm);
The following will give a visual check if bill length has a similarly shaped distribution as bill depth, in this case with each species highlighted:
= data(penguins) * visual(QQPlot, qqline=:fit) *
p2 mapping(:bill_length_mm, :bill_depth_mm, color=:species);
Both are shown in Figure 6.5.
6.2 Line plots
A scatter plot shows \(x\) and \(y\) pairs as points, a line plot connects these points. There are numerous ways to draw lines with the AlgebraOfGraphics
including: visual(Lines)
, for connect-the-dots lines; visual(LinesFill)
, for shading; visual(HLines)
and visual(VLines)
, for horizontal and vertical lines; visual(Rangebars)
to draw vertical or horizontal line segments.
The graph of a function can be drawn using Lines
, as in this example, where we add in different range bars to emphasize the role that the two parameters play in this function’s graph:
ϕ(x; μ=0, σ=1) = 1/sqrt(2*pi*σ^2) * exp(-(1/(2σ)) * (x - μ)^2)
= range(-3, 3, length=251)
xs = ϕ.(xs)
ys = data((x=xs, y=ys)) * visual(Lines) * mapping(:x, :y)
c
+= data(DataFrame(x=0, hi=ϕ(0), lo=0)) * visual(Rangebars) *
c mapping(:x, :hi, :lo)
+= data(DataFrame(xmin=0, xmax=1, y=ϕ(1))) * visual(Rangebars, direction=:x) *
c mapping(:y, :xmin, :xmax)
+= data((x=[1/10, 1/2], y=[0, ϕ(1)], label=["μ", "σ"])) *
c visual(Makie.Text) *
mapping(:x, :y, text = :label => verbatim)
draw(c)
The Rangebars
visual has a direction
argument, used above to make a horizontal range bar.
The annotation has two subtleties: the qualification of Makie.Text
is needed, as there is a Text
type in base Julia
. More idiosyncratically, the use of verbatim
in mapping
is needed to avoid an attempt to map the labels to a glyph, such as a pre-defined marker.
6.3 Bivariate relationships
Scatterplots with trend lines are easily produced within the AlgebraOfGraphics
framework: the Scatter
visual creates scatter plots; for trend lines there is the smooth
transformation to fit a loess line, and the linear
transformation to fit linear models.
This first set of commands shows how to fit a smoother (upper left graphic in Figure 6.6). The smooth
function has arguments which pass on to Loess.loess
.
= visual(Scatter) + smooth()
layers = d * layers * mapping(:bill_length_mm, :bill_depth_mm); p1
The linear
function draws the fitted regression line and shades an interval automatically (the interval
argument). Linear prediction under model assumptions provides a means to identify confidence intervals for the mean response (the average value were the covariates held fixed and the response repeatedly samples) and for the predicted response for a single observation. The latter are wider, as single observations have more variability than averages of observations. A value of nothing
suppresses this aspect.
This next set of commands shows (upper-right figure of Figure 6.6) one way to add a linear regression line. As the mapping for linear
does not include the grouping variable, (color
) the line is based on all the data:
= d * mapping(:bill_length_mm, :bill_depth_mm)
d1 = d1 * visual(Scatter) * mapping(color=:species)
p2a = d1 * linear()
p2b = p2a + p2b; p2
Whereas with this next specification, color
is mapped for both the linear
transformation and the Scatter
visual. This groups the data and separate lines are fit to each. We can see (lower-left figure of Figure 6.6) that whereas the entire data shows a negative correlation, the cohorts are all positively correlated, an example of Simpson’s paradox.
= visual(Scatter) + linear()
layers = d1 * layers * mapping(color=:species); p3
Adding layout=:sex
shows more clearly (lower-right figure of Figure 6.6) that each group has a regression line fit, that is the multiplicative model is fit.
= d1 * layers * mapping(color=:species, layout=:sex); p4
6.3.1 Corner plot
A corner plot, as produced by the PairPlots
package through its pairplot
function, is a quick plot to show pair-wise relations amongst multiple numeric values. The graphic uses the lower part of a grid to show paired scatterplots with, by default, contour lines highlighting the relationship. On the diagonal are univariate density plots.
using PairPlots
= names(penguins, 3:5)
nms = select(penguins, nms .=> replace.(nms, "_mm" => "", "_" => " ")) # adjust names
p pairplot(p)
6.3.2 3D scatterplots
A 3-d scatter plot of 3 numeric variables can be readily arranged, with just one unexpected trick:
The
mapping
object should contain anx
,y
, andz
variable specification with numeric variables.The
draw
call should include anaxis = (type = Axis3,)
call, specifying that a 3D (Makie
) axis should be used in the display.
= data(penguins)
d = d * mapping(:bill_length_mm => :bl, :bill_depth_mm => :bd, :flipper_length_mm=>:fl; color=:species,
p =:sex, col=:island)
rowdraw(p, axis=((type=Axis3,)))
6.4 Categorical data
The distribution of the surveyed species is not the same. A bar chart can illustrate (upper-left graphic of Figure 6.8). The frequency
transform does the counting:
= d * frequency() * mapping(:species); p1
Two categories can be illustrated, we need dodge
set here to avoid overplotting of the bars. In this example, following the AlgebraOfGraphics
tutorial, we add in information about the island. This shows (upper-right graphic of Figure 6.8) that two species are found on just 1 island, whereas Adelie is found on all three.
= d * frequency() *
p2 mapping(:species, color=:island, dodge=:island);
Using stack
in place of dodge
presents a stacked bar chart (lower-left graphic of Figure 6.8):
= d * frequency() *
p3 mapping(:species, color=:island, stack=:island);
A third category can be introduced using layout
, col
, or row
(lower-right graphic of Figure 6.8):
= d * frequency() *
p4 mapping(:species, color=:island, stack=:island) *
mapping(row=:sex);
6.5 Customizing plots through axis
There are a numerous customizations available when drawing a plot. We discuss a small handful of them here. See the PumasAI tutorial and the documentation for more details.
The draw
command allows the passing of values to the axis
mechanism of Makie
. This allows customization of various features such as the title, the ticks, the aspect ration, and the grids.
Makie
plots are themeable. In the above we used set_aog_theme!()
. This theme sets a number of defaults for the axis attributes:
Axis = (
xgridvisible=false,
ygridvisible=false,
topspinevisible=false,
rightspinevisible=false,
bottomspinecolor=:darkgray,
leftspinecolor=:darkgray,
xtickcolor=:darkgray,
ytickcolor=:darkgray,
xticklabelfont=lightfont,
yticklabelfont=lightfont,
xlabelfont=mediumfont,
ylabelfont=mediumfont,
titlefont=mediumfont,
)
To override these or pass other attributes on to the rendering, the axis
keyword argument accepts a named tuple of values. So, for example, to set the graphics title, we would see axis=(; title="Some title")
, to instruct the labels in a barplot on the x
axis to be rotated, we would see axis=(; xticklabelrotation = pi/2)
. Of course these would typically combined, as above.
The following lists some useful attributes. A complete list is in the Makie docs for the Axis
constructor.
The aspect ratio for a graphic is adjustable through the aspect
attribute.
The following labeling attributes can be adjusted: title
, subtitle
, xlabel
, ylabel
. These take a string (or an observable) for the value to display. This value can be adjusted, for example, there are titlealign
, titlecolor
, titlefont
, titlesize
, and titlevisible
attributes. Similar attributes exist for the other labels.
An axis has ticks. These are often numbers. For the ticks on an x
axis there are attributes xticks
, xtickcolor
, xtickformat
, xticksize
, and xtickwidth
. Similarly with y
. There are also minor ticks, adjustable with, for example, xminorticks
, xminortickcolor
, xminorticksize
, etc.
For ticks representing categorical values, labels are used. Attributes for tick labels include: xticklabelalign
, xticklabelcolor
, xticklabelfont
, xticklabelrotation
, and xticklabelsize
.
The displayed grid is adjustable through attributes like xgridcolor
, xgridstyle
, xgridvisible
, xgridwidth
, along with “minor
” versions.
For 3 dimension plots, the Axis3
object is used for display. This has similarly named attributes for z
values.
The
Makie
density function could be accessed throughvisual(Density)
without module qualification. Thedensity
function inAlgebraOfGraphics
has a nice transparency feature which makes its use desirable.↩︎