Using Julia for Introductory Statistics

Author

John Verzani

Published

January 6, 2025

1 Julia introduction

This is a collection of notes for using Julia for introductory statistics.

In case you haven’t heard, Julia is an open-source programming language suitable for many tasks, like scientific programming. It is designed for high performance – Julia programs compile on the fly to efficient native code. Julia has a relatively easy to learn syntax for many tasks, certainly no harder to pick up than R and Python, widely used scripting languages for the tasks illustrated herein.

Why these notes on introductory statistics? No compelling reason save I had done something similar for R when R was a fledgling S-Plus clone. No more, R is a juggernaut, and it is almost certain Julia will never replace R as the programming langauage of choice for statistics. Besides, Julia users can already interface with R quite easily through RCall. However, there are some reasons that Julia could be a useful language when learning basic inferential statistics, especially if other real strengths of the Julia ecosystem were needed. So these notes show how Julia can be used for these tasks, and, hopefully, shows that it works pretty well.

There are some great books published about using Julia (Bezanson et al. 2017) with data science, within which much of this material is covered. For example, (Kamiński 2022) is a quite thorough treatmeant, (Storopoli et al. 2021) is very well done, (Nazarathy and Klok 2021) covers topics here (cf. the JuliaCon Workshop). (Lukaszuk 2023) covers many of the topics here with a Biologist’s viewpoint. The quarto book Embrace Uncertainty (Alday et al. 2022) covers the more advanced topic of Mixed-effects Models in Julia. Nothing here couldn’t be found in those resources, these notes are just an introduction.

Contribute

These notes are a work in progress. Feel free to click the “edit this page” button or report an issue.

1.1 Installing and running Julia

Julia can be downloaded from julialang.org. The language is evolving rapidly. The latest official release is recommended. These notes should work with any version since v"1.6.0". It is recommended to use a version v"1.9.0" or later, as there are significant speedups with external packages that make the user experience even better.

Once downloaded and installed the Julia installation will provide a command line for interactive usage and a binary to run scripts. It is envisioned most users will use an alternative interface, though Julia has an excellent REPL for command-line usage.

Some alternatives to the REPL for interacting with Julia are:

IJulia: This is a means to use the Jupyter interactive environment to interact with Julia through notebooks. It is made available by installing the package IJulia (details on package installation follow below). This relies on Julia’s seamless interaction with Python and leverages many technologies developed for that langauge.
Pluto: The Pluto environment provides a notebook interface for Julia written in Julia leveraging many JavaScript technologies for the browser. It has the feature of being reactive, making it well suited for many exploratory tasks and pedagogical demonstrations.
Visual Studio Code: Julia is a supported language for the Visual Studio Code editor of Microsoft, a programmer’s IDE.

These notes use quarto to organize the mix of text, code, and graphics. The quarto publishing system is developed by Posit, the developers of the wildly sucessful RStudio interface for R. The code snippets are run as blocks (within IJulia) and the last command executed is shown. (If code is copy-and-pasted into the REPL, each line’s output will be displayed.) The code display occurs below the cell, as here, where we show that Julia can handle basic addition:

2 + 2

1.2 Overview of some basics

This section gives a quick orientation for using Julia. See this compiled collection of tutorials for more comprehensive introductions.

As will be seen, Julia use multiple dispatch (as does R) where different function methods can be called using the same generic name. Different methods are dispatched depending on the type and number of the arguments. The + sign above, is actually a function call to the + function, which in base Julia has over 200 different methods, as there are many different implementations for addition. For a beginner this is great – fewer new function names to remember.

Julia is a dynamically typed language, like R and Python, meaning variables can be reassigned to different values and with different types.¹ Dynamicness makes interactive usage at the REPL or through a notebook much easier.

Julia supports the usual mathematical operations familiar to users of a calculator, such as +, -, *, /, and ^. In addition, there a numerous built in functions such as mathematical ones like sqrt or programming oriented ones, like map.

These functions are called with arguments which may be positional (\(0\), \(1,\) or more positional arguments) or specified by keywords. Multiple dispatch considers the positions and types of arguments a function is called with.

Interactive help

Interacting with Julia primarily involves variables and functions. Most all functions have documentation, which can be called up by prefacing the function name with an immediate question mark, as in ?sqrt to see the documentation for sqrt. More than one method may be documented. A call like ?sqrt(9) will limit the help to the method called by sqrt(9) (the square root function for integers.)

Values in Julia have types. A particular instance will have a concrete type but abstract types help to organize code bases and participate in dispatch. Values can be assigned to variable names, or bindings. The ability to simply create new user-defined types makes generic programming quite accessible and Julia code very composable.

This simple example, taking the average of several numbers, shows most of this:

xs = [1, 2, 3, 7, 9]
sum(xs) / length(xs)

4.4

The first line assigns to a variable, xs, a value that is a vector of numbers, integers of type Int64 in this case. For this illustration, a vector is a container of different numbers. The second line calls three functions: sum to add the elements in the vector; length to count the number of elements in the vector; and / to divide these two quantities. All of these functions are generic, with different methods for different types of argument(s). The same pattern would work for different container types, such as a tuple:

xs = (1, 2, 3, 7, 9) # tuple
sum(xs) / length(xs)

4.4

The takeaway – we can focus more on what the computations mean, and less on how to program a particular computation.

1.3 Add-on packages

Base Julia provides a very useful programming environment which can be extended through packages. Some packages are provided by base Julia, such as Dates, others are external add-on packages, such as IJulia, mentioned previously. Julia has one key package, Pkg, to manage the installation. By default, the installation of a single package will download all dependent packages. On installation, packages are partially compiled. This speeds up the loading of a package when it is used within a session, but can slow down package installation.

Packages need be installed just once, but must be loaded each session. Loading a package is done by a command like using Statistics, which will load the built in Statistics package. At the REPL, calling using PKGNAME on an uninstalled package will lead to a prompt to install the package. For other interfaces, packages may need to be installed through the Pkg package, loaded through using Pkg.

When a package is loaded its exported functions are made available to use directly. Non-exported functions can be accessed by qualifying the function with the name of a module (conventionally the name of the package). For example, we will see the command CSV.read which calls the read function provided in the CSV package which has a CSV module.

Most packages are designed to extend generic functions that may be defined elsewhere. Not all. When there are conflicts, they can be resolved by either just importing the packages and qualifying all uses, or qualifying the uses that conflict.

These notes will utilize numerous add-on packages including:

StatsBase, to extend the built-in Statistics package;
StatsPlots, for easy-to-make statistical plots, which display on a variety of graphing backends;
AlgebraOfGraphics and CairoMakie, for more advanced statistical graphics;
CSV and DataFrames for working with tabular data;
RDatasets, for some handy datasets;
FreqTables and CategoricalArrays, for some needed functionality;
Distributions, for probability distributions;
HypothesisTests, for the computation of significance tests and confidence intervals; and
GLM, Loess, and RobustModels, for statistical modeling.

Most of these are maintained by the StatsBase organization, which provides the StatsKit package to load all these with a single command, though we don’t illustrate that.

With the one caveat that generic function names can not be reassigned as variables or vice versa.↩︎