6  The AlgebraOfGraphics package

The StatsPlots package has been used to illustrate the standard graphics of exploratory statistics. That package leverages Plots, a Julia interface to multiple plotting backends. The GR one renders the images seen. There are a few alternatives. The Makie (Danisch and Krumbiegel 2021) plotting system along with the AlgebraOfGraphics (Vertechi et al. n.d.) package makes a very compelling one.

The AlgebraOfGraphics packages offers a declarative style to create statistical graphics. An example from the documentation shows the code to do the following “declare the dataset; declare the analysis; declare the arguments used; declare the grouping and the respective visual attribute; draw the visualization.” This is all done through a series of composable commands, illustrated by example below. The Pumas project has a much more extensive tutorial than is presented here.

We will see that it is very easy to visualize multiple variables through an appropriate choice of graphic or transformation, with further choices of coloring, faceting, or other means to demarcate different factors. The “declarative” style shines here, as the user simply specifies a variable, and the package converts this, as needed, to a color or shape .

We begin by loading the packages. The CairoMakie backend is used here, GLMakieis good for interactive usage at the command line, WGLMakie is for web-based graphics, all are part of the same Makie plotting ecosystem.

using StatsBase, DataFrames, CategoricalArrays, RDatasets
using CairoMakie, AlgebraOfGraphics
set_aog_theme!()

We use the color theme of aog, as declared in the last command. The packages are compute-intensive and can take a while to load.

Following the package tutorial, we load the Palmer penguins data set of Allison Horst. This includes data collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network. The data can be downloaded from the GitHub site, but it is also wrapped into a Julia package:

using PalmerPenguins
penguins = dropmissing(DataFrame(PalmerPenguins.load()))
first(penguins, 3)
3×7 DataFrame
Row species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
String15 String15 Float64 Float64 Int64 Int64 String7
1 Adelie Torgersen 39.1 18.7 181 3750 male
2 Adelie Torgersen 39.5 17.4 186 3800 female
3 Adelie Torgersen 40.3 18.0 195 3250 female

This data set has several correlated numeric variables on bill length, bill depth, flipper length, and body mass; and several categorical variables, such as species, island, and sex. An even more complete data set can be downloaded from the GitHub site.

6.1 Univariate graphical summaries

We run through the basic graphics for univariate statistics. We shall see that the framework makes multi-variate display quite direct, and at times easier than a univariate display.

6.1.1 Boxplot and violin plot

A boxplot (Figure 6.1) for each species is created by composing a series of declarative commands:

p = data(penguins) *
    visual(BoxPlot) *
    mapping(:species, :bill_length_mm => "Bill length (mm)", color=:species);

This illustrates many of the idioms used in the AlgebraOfGraphics.

The data(penguins) command sets up the data. Here a data frame is passed, but this can be any Tables compatible structure, such as a struct of arrays such as data((;x, y)) for some pair of variables x and y.

The mapping call takes values in the data to positions and attributes of the graphic. It uses position to identify the x, y, and (at times) z values for the graphic. The y variable specification above illustrates a mini language nearly identical to the DataFrames mini language. For a box plot, an indicator of the groups goes in the x position, the data values in the y position. The color=:species argument uses a mapping between the levels of the :species variable and color ramp to give the graphic a distinct color for each species. Omitting this argument produces a monotone graphic with the chosen theme.

The visual(BoxPlot) command declares the visualization or transformation to be used to view the data. The visual function expects a type indicating the plot type to use and optional keyword arguments. In this case, BoxPlot is the type associated with the Makie.boxplot function. At times this type must be qualified, such as with Text, for annotations.

Both the mapping and visual calls can be used to set attributes:

  • visual is used to set attributes for each element independent of the data. For example, a box plot has the argument orientation which is not data dependent, so is adjusted within the visual call.
  • mapping is used to have attributes depend on values of a variable, like color is used above.

The attributes are those for the underlying plotting function. For visual(BoxPlot), these can be seen at the help page for boxplot, displayed with the command ?boxplot.

The mapping calls shows two uses of the mini language for data manipulation. The basic form is source => function => target and works very much like the DataFrames mini language does for select or transform, but unlike those, the function is always applied by row. This makes some transformations, such as \(z\)-scores not possible within this call – transformations requiring the entire column need to be done within the values passed to data. The abbreviated forms are just source, as used with the color=:species argument; source => function; and source => target, such as :bill_length_mm => "bill length (mm)" used to rename the variable for labeling purposes. When the source involves more than one column selector, tuples should be used to group them.

A few functions are provided to bypass the usual mapping of the data. (For example, color maps levels of a factor to a color ramp behind the scenes.) Among these are nonnumeric to pass a numeric variable to a value expecting a categorical variable and verbatim to avoid this mapping. The latter, => verbatim, will be necessary to add when annotating a figure.

The object p can be rendered to the screen with the draw method resulting in Figure 6.1. Just draw(p) will render the graphic, the following also shows how the figure keyword argument can be used to set attributes using a named tuple, in this case the figure size. Similarly axis values can be modified in this manner. In the following, we set a title attribute for the axis.

draw(p; figure=(;size=(600,400)),
     axis=(; title="Bill length"))
Figure 6.1: Boxplots of bill length for the three penguin species in penguins.
To save a figure to a file

The output of draw is used to render to the screen and also to save to a file (as a .png or .svg file). The pattern save("filename.[png|svg]", draw(p)) will save the image to the named file using the given extension to specify the format used.

This is the basic pattern where different choices are combined, or merged, with the * operation. The pieces can be re-purposed. In the following, we make use of this data:

d = data(penguins);

Box plots are very effective for quickly comparing distributions of a numeric variable across the levels of some factor. The calling syntax preferences that style, where both an x and y value are specified to mapping. To create a box plot of a single variable, without grouping, the graphic takes a bit more to construct. In the following we create a single valued x variable to produce the upper left graphic in Figure 6.2:

p1 = d * visual(BoxPlot) *
    mapping(1 => one, :bill_length_mm => "Bill length (mm)");

The mini language is used above two different ways: with a function to create the single value for x (AlgebraOfGraphics will treat this to a factor, so one isn’t needed, just some single-valued function) and with a target for labeling the y variable. As mentioned, such transformations can also be done within the data frame before it is passed to data, which is necessary for some types of transformations.

To add another layer, in this case a scatter plot, we can add the plotting objects:

p2a = d * visual(BoxPlot) * mapping(:species, :bill_length_mm, color=:species)
p2b = d * visual(Scatter) * mapping(:species, :bill_length_mm)
p2 = p2a + p2b;

The Scatter transformation plots pairs of points in a Cartesian plane.

Combinations with + add a layer; those with * merge layers. The algebra name also refers to algebraically desirable short cuts. For example, we repeat d and the mapping for each p2a and p2b, but these can be used just once by distributing them:

m = mapping(:species, :bill_length_mm => "bill length (mm)", color=:species);
p3 =  d * ( visual(BoxPlot) + visual(Scatter) ) * m;

Both p2 and p3 are shown in the lower row of Figure 6.2. There is just one slight difference, the dots representing the data in p2 are not colored, as the mapping did not instruct that in forming p2b.

Specifying a violin plot requires just a slight modification to the above: we change the BoxPlot visual to Violin. Violin plots have an argument side that allows both sides of the violin to reflect an extra grouping variable. We use the :sex variable in the following, as it has only two levels. With this, each side of the violin plot reflects grouping by the :sex factor, the legend is used to lookup which level of the factor is represented.

p4 = d * visual(Violin) * mapping(:species, :bill_length_mm, color=:species, side=:sex);

The visual(Violin) call wraps the function Makie.violin whose documentation contains additional possible arguments beyond side.

The AlgebraOfGraphics package builds on the Makie package and can use its layout system. Makie’s layout system leverages matrix notation to specify cell position. The draw! method accepts a figure object as a first argument. In Figure 6.2 we layout 2 rows and 2 columns of figures, as follows:

f = Figure()
draw!(f[1,1], p1)
draw!(f[1,2], p4)
draw!(f[2,1], p2)
draw!(f[2,2], p3)
f
Figure 6.2: Figure showing four different graphics displayed. In this case, a single boxplot; a violin plot; a boxplot with scatter; and a similar one with the data and mapping easily reused for each visual.

6.1.2 Dot plot

The boxplot does an excellent job of summarizing a data set with a few indicators making it quite useful when there are many data points. A dot plot is useful when there are a limited number of values and advantageous as the graphic shows all the data.

A dot plot (Figure 6.3) can be constructed easily enough by ensuring, in this case, the y variable is non-numeric:

huddle = penguins[sample(1:size(penguins,1), 50),:] # a sample

p1 = data(huddle) * visual(Scatter) *
    mapping(:bill_length_mm=>"Bill length (mm)", :species => nonnumeric);

(In this example, species is categorical, so the extra => nonnumeric is unnecessary.)

Compare the above to a boxplot of the same sampled data:

p2 = data(huddle) * visual(BoxPlot) *
    mapping(:species, :bill_length_mm => "Bill length (mm)"; color=:species);

The boxplot makes it easy to compare medians across the levels of the species factor to gauge graphically if there is a differentiated effect on the response.

The following is an enhanced dot plot which emphasizes a comparison of center by adding a line and sorting so that this line only moves to the right as the eye travels up the levels of the factor. The code is a modification of some from (Alday et al. 2022).

"`dotplot`: show values for each group as dotplot sorted by some center"
function _arrange_dotplot_data(df, value::Symbol, group::Symbol, center=mean;
                              jitter=true)
    transform!(df, value => Array, group => CategoricalArray;
               renamecols=false) # set up types

    sumry = combine(groupby(df, group), value => center => value)
    sort!(sumry, value)
    ordered_levels = string.(sumry[!, group])
    levels!(sumry[!, group], ordered_levels) # relevel, used in plotting
    levels!(df[!, group], ordered_levels)
    jitter && (df = combine(groupby(df, group),
                            value => (x -> x .+ std(x)/100), renamecols=false))


    df, sumry
end


df, sumry = _arrange_dotplot_data(huddle, :bill_length_mm, :species, median)
mm = mapping(:bill_length_mm => "Bill length (mm)", :species)
p3 = data(df) * mm *
    visual(Scatter; marker='○', markersize=12)  # use a character for a marker
p3 += data(sumry) * mm * visual(Lines);         # add summary line

All these figures appear in Figure 6.3.

Figure 6.3: A basic dot plot, a comparable box plot, and an enhanced dot plot. For small data sets, the dot plot can show comparisons of spread and center quite well; reordering based on the center emphasizes the differentiated effect on the response of the grouping variable.

6.1.3 Faceting

The package also supports faceting where different panels share the same scales allowing easy cross comparison. Faceting is specified through the keyword layout or either (or both) of row and col keywords. The layout keyword uses levels of the variable name it is passed and arranges the plots over these levels. A col declaration will make columns for each level of the specified variable, whereas a row declaration will create rows for each level of the specified variables. By default both the x and y axes are linked. These linkings can be decoupled when drawing by passing in values to the facet argument, along the lines of: draw(p, facet=(; linkxaxes=:none, linkyaxes=:none)).

6.1.4 Histograms

The AlgebraOfGraphics has certain functions it refers to as transformations of the data. These include histogram, density, frequency, linear, smooth, and expectation; most all will be illustrated by example below.

These are used like visual was above, but arguments are passed directly to the transformation.

The histogram function plays the role of visual in this graphic. (The visual function is still useful to apply data-independent attributes.) Here we arrange to color by species:

p1 = d * histogram() * mapping(:bill_length_mm, color=:species);

The histograms overlap. The layout command can be used to declare one panel per level. We do this with :sex:

p2 = d * histogram() * mapping(:bill_length_mm, color=:species, layout=:sex);

See Figure 6.4 for the graphics.

6.1.5 Density plot

The histogram function has options for overriding the default bin selection and has several options for scaling the figure through its normalization argument. We use this in the next graphic which layers a density plot over a scaled histogram using the :pdf scaling. The density transformation is qualified with the module name to prevent a conflict with one in Makie1.

layers = histogram(normalization=:pdf) + AlgebraOfGraphics.density()
p3 = d * layers * mapping(:bill_length_mm, color=:species, layout=:sex);

In this next figure we add in a scatter plot of the data on top of the density plots. For the scatter plot, we use the Scatter visual for which we create jittered \(y\) values to disambiguate the data, these are added as a column to the data in d1, below:

p4a = d *  AlgebraOfGraphics.density() *
    mapping(:bill_length_mm, color=:species)

d1 = data(transform(penguins,
                    :bill_length_mm => ByRow(x -> 0.02 * rand()) => :ys))

p4b = d1 * visual(Scatter) * mapping(:bill_length_mm, :ys, color=:species)
p4 = p4a + p4b;
Figure 6.4: Histogram and Density plots.

6.1.6 Quantile-normal plots

The QQNorm and QQPlot visuals are used to make quantile-quantile plots; QQNorm expects a mapping to :x (first position) whereas QQPlot expects mappings to :x and :y (the first two positions).

The following will give a visual check if bill length is normally distributed, the graphic indicates slightly shorter tails than expected

p1 = data(penguins) * visual(QQNorm, qqline=:fit) *
    mapping(:bill_length_mm);

The following will give a visual check if bill length has a similarly shaped distribution as bill depth, in this case with each species highlighted:

p2 = data(penguins) * visual(QQPlot, qqline=:fit) *
    mapping(:bill_length_mm, :bill_depth_mm, color=:species);

Both are shown in Figure 6.5.

Figure 6.5: Quantile-quantile plots. The left graphic uses a reference normal distribution (through QQNorm), the right one uses QQPlot to compare the distribution of two variables after grouping by species.

6.2 Line plots

A scatter plot shows \(x\) and \(y\) pairs as points, a line plot connects these points. There are numerous ways to draw lines with the AlgebraOfGraphics including: visual(Lines), for connect-the-dots lines; visual(LinesFill), for shading; visual(HLines) and visual(VLines), for horizontal and vertical lines; visual(Rangebars) to draw vertical or horizontal line segments.

The graph of a function can be drawn using Lines, as in this example, where we add in different range bars to emphasize the role that the two parameters play in this function’s graph:

ϕ(x; μ=0, σ=1) = 1/sqrt(2*pi*σ^2) * exp(-(1/(2σ)) * (x - μ)^2)

xs = range(-3, 3, length=251)
ys = ϕ.(xs)
c = data((x=xs, y=ys)) * visual(Lines) * mapping(:x, :y)

c += data(DataFrame(x=0, hi=ϕ(0), lo=0)) * visual(Rangebars) *
    mapping(:x, :hi, :lo)

c += data(DataFrame(xmin=0, xmax=1, y=ϕ(1))) * visual(Rangebars, direction=:x) *
    mapping(:y, :xmin, :xmax)

c += data((x=[1/10, 1/2], y=[0, ϕ(1)], label=["μ", "σ"])) *
    visual(Makie.Text) *
    mapping(:x, :y, text = :label => verbatim)
draw(c)

The Rangebars visual has a direction argument, used above to make a horizontal range bar.

The annotation has two subtleties: the qualification of Makie.Text is needed, as there is a Text type in base Julia. More idiosyncratically, the use of verbatim in mapping is needed to avoid an attempt to map the labels to a glyph, such as a pre-defined marker.

6.3 Bivariate relationships

Scatterplots with trend lines are easily produced within the AlgebraOfGraphics framework: the Scatter visual creates scatter plots; for trend lines there is the smooth transformation to fit a loess line, and the linear transformation to fit linear models.

This first set of commands shows how to fit a smoother (upper left graphic in Figure 6.6). The smooth function has arguments which pass on to Loess.loess.

layers = visual(Scatter) + smooth()
p1 = d * layers * mapping(:bill_length_mm, :bill_depth_mm);

The linear function draws the fitted regression line and shades an interval automatically (the interval argument). Linear prediction under model assumptions provides a means to identify confidence intervals for the mean response (the average value were the covariates held fixed and the response repeatedly samples) and for the predicted response for a single observation. The latter are wider, as single observations have more variability than averages of observations. A value of nothing suppresses this aspect.

This next set of commands shows (upper-right figure of Figure 6.6) one way to add a linear regression line. As the mapping for linear does not include the grouping variable, (color) the line is based on all the data:

d1 = d * mapping(:bill_length_mm, :bill_depth_mm)
p2a = d1 * visual(Scatter) * mapping(color=:species)
p2b = d1 * linear()
p2 = p2a + p2b;

Whereas with this next specification, color is mapped for both the linear transformation and the Scatter visual. This groups the data and separate lines are fit to each. We can see (lower-left figure of Figure 6.6) that whereas the entire data shows a negative correlation, the cohorts are all positively correlated, an example of Simpson’s paradox.

layers = visual(Scatter) + linear()
p3 = d1 * layers *  mapping(color=:species);

Adding layout=:sex shows more clearly (lower-right figure of Figure 6.6) that each group has a regression line fit, that is the multiplicative model is fit.

p4 = d1 * layers *  mapping(color=:species, layout=:sex);
Figure 6.6: Scatter plots of bill depth by bill width produced by varying specifications.

6.3.1 Corner plot

A corner plot, as produced by the PairPlots package through its pairplot function, is a quick plot to show pair-wise relations amongst multiple numeric values. The graphic uses the lower part of a grid to show paired scatterplots with, by default, contour lines highlighting the relationship. On the diagonal are univariate density plots.

using PairPlots
nms = names(penguins, 3:5)
p = select(penguins, nms .=> replace.(nms, "_mm" => "", "_" => " ")) # adjust names
pairplot(p)

6.3.2 3D scatterplots

A 3-d scatter plot of 3 numeric variables can be readily arranged, with just one unexpected trick:

  • The mapping object should contain an x, y, and z variable specification with numeric variables.

  • The draw call should include an axis = (type = Axis3,) call, specifying that a 3D (Makie) axis should be used in the display.

d = data(penguins)
p = d * mapping(:bill_length_mm => :bl,  :bill_depth_mm => :bd,  :flipper_length_mm=>:fl; color=:species,
              row=:sex, col=:island)
draw(p, axis=((type=Axis3,)))
Figure 6.7: 3D scatter plots of bill length, bill depth, and flipper length with faceting by island and sex variables.

6.4 Categorical data

The distribution of the surveyed species is not the same. A bar chart can illustrate (upper-left graphic of Figure 6.8). The frequency transform does the counting:

p1 = d * frequency() * mapping(:species);

Two categories can be illustrated, we need dodge set here to avoid overplotting of the bars. In this example, following the AlgebraOfGraphics tutorial, we add in information about the island. This shows (upper-right graphic of Figure 6.8) that two species are found on just 1 island, whereas Adelie is found on all three.

p2 = d * frequency() *
    mapping(:species, color=:island, dodge=:island);

Using stack in place of dodge presents a stacked bar chart (lower-left graphic of Figure 6.8):

p3 = d * frequency() *
    mapping(:species, color=:island, stack=:island);

A third category can be introduced using layout, col, or row (lower-right graphic of Figure 6.8):

p4 = d * frequency() *
    mapping(:species, color=:island, stack=:island) *
    mapping(row=:sex);
Figure 6.8: Scatter plots of bill depth by bill width produced by varying specifications.

6.5 Customizing plots through axis

There are a numerous customizations available when drawing a plot. We discuss a small handful of them here. See the PumasAI tutorial and the documentation for more details.

The draw command allows the passing of values to the axis mechanism of Makie. This allows customization of various features such as the title, the ticks, the aspect ration, and the grids.

Makie plots are themeable. In the above we used set_aog_theme!(). This theme sets a number of defaults for the axis attributes:

Axis = (
        xgridvisible=false,
        ygridvisible=false,
        topspinevisible=false,
        rightspinevisible=false,
        bottomspinecolor=:darkgray,
        leftspinecolor=:darkgray,
        xtickcolor=:darkgray,
        ytickcolor=:darkgray,
        xticklabelfont=lightfont,
        yticklabelfont=lightfont,
        xlabelfont=mediumfont,
        ylabelfont=mediumfont,
        titlefont=mediumfont,
    )

To override these or pass other attributes on to the rendering, the axis keyword argument accepts a named tuple of values. So, for example, to set the graphics title, we would see axis=(; title="Some title"), to instruct the labels in a barplot on the x axis to be rotated, we would see axis=(; xticklabelrotation = pi/2). Of course these would typically combined, as above.

The following lists some useful attributes. A complete list is in the Makie docs for the Axis constructor.

The aspect ratio for a graphic is adjustable through the aspect attribute.

The following labeling attributes can be adjusted: title, subtitle, xlabel, ylabel. These take a string (or an observable) for the value to display. This value can be adjusted, for example, there are titlealign, titlecolor, titlefont, titlesize, and titlevisible attributes. Similar attributes exist for the other labels.

An axis has ticks. These are often numbers. For the ticks on an x axis there are attributes xticks, xtickcolor, xtickformat, xticksize, and xtickwidth. Similarly with y. There are also minor ticks, adjustable with, for example, xminorticks, xminortickcolor, xminorticksize, etc.

For ticks representing categorical values, labels are used. Attributes for tick labels include: xticklabelalign, xticklabelcolor, xticklabelfont, xticklabelrotation, and xticklabelsize.

The displayed grid is adjustable through attributes like xgridcolor, xgridstyle, xgridvisible, xgridwidth, along with “minor” versions.

For 3 dimension plots, the Axis3 object is used for display. This has similarly named attributes for z values.


  1. The Makie density function could be accessed through visual(Density) without module qualification. The density function in AlgebraOfGraphics has a nice transparency feature which makes its use desirable.↩︎