2 Univariate data

Statistics involves the study of data. A data set can be a single value, or scalar, but more commonly is a collection of values. This chapter covers the basic containers used in Julia to store and manipulate data sets for a single variable in statistics. In addition it shows some well-known summaries of a single variable.

2.1 Data vectors

Julia offers several containers for storing ordered data, such as in a data set $x_1, x_2, \dots, x_n$.

We consider this data on the number of whale beachings in a certain area by year during a decade:

74 122 235 111 292 111 211 133 156 79

As the data in a data set is typically of the same type (integer here, or character, number, …) and may be quite large a vector is a natural choice for storage in Julia.

Vectors are created using square brackets with entries separated by commas. Storing the data in a vector and assigning to a variable, whale, is then done with:

whale = [74, 122, 235, 111, 292, 111, 211, 133, 156, 79]

10-element Vector{Int64}:
  74
 122
 235
 111
 292
 111
 211
 133
 156
  79

Without using commas in the construction, a matrix is created:

whale_matrix = [74 122 235 111 292 111 211 133 156 79]

1×10 Matrix{Int64}:
 74  122  235  111  292  111  211  133  156  79

Space-saving measure

For purposes of printing only, we use show or the trailing |> show (which uses the chaining operation to call the show function) in the following to format how a vector is printed. This is only a vertical-space-saving measure and wouldn’t normally be done when interacting with Julia.¹

show(whale)

[74, 122, 235, 111, 292, 111, 211, 133, 156, 79]

Internally a vector is a $1$-dimensional array, a matrix a $2$-dimensional array. Julia has a general $N$-dimensional array type that these are specializations of. For a matrix, one uses space to separate row elements and semicolons to separate rows. Vectors are useful for storing data; matrices can be used to store data, as will be seen through example, but are widely used to represent mathematical values, as they have an associated algebra that can compactly represent many operations (we will see a data frame is more idiomatic for storing rectangular data); arrays are not used in our discussion.

As seen in the output above, Julia prints the size and element type of a vector. In this case, a vector of integers (Int64). The constructor ([]) will promote values to a common type, which may be Any, a catch all type.

Vectors have a length, found by length, which in this case is the number of data points, often labeled $n$. More generally, arrays have a size:

length(whale), size(whale)

(10, (10,))

The size is returned as a tuple, in this case with just 1 element, as vectors are $1$ dimensional.

The length function is a reduction, returning a summary of a vector. Similarly, sum will add the elements, returning a number. Here we see how to compute the mean:

sum(whale) / length(whale)

152.4

The mean function is not part of base Julia. It can be found in the Statistics package, which comes bundled with Julia, but we prefer to use the add-on StatsBase package:

using StatsBase
mean(whale)

152.4

2.1.1 Missing values

Julia uses missing to represent missing values. For example, data on the cost of a hip replacement at various hospitals was found from their websites and is given by:

10500, 45000, 74100, unavailable, 83500,
86000, 38200, unavailable, 44300, 12500,
55700, 43900, 71900, unavailable, 62000

When the cost could not be identified, it was labeled “unavailable.” We replace the colloquial “unavailable” with the special value missing and enter the values into a vector:

hip_cost = [10500, 45000, 74100, missing, 83500, 86000, 38200, missing,
            44300, 12500, 55700, 43900, 71900, missing, 62000]
hip_cost |> permutedims

1×15 reshape(::Vector{Union{Missing, Int64}}, 1, 15) with eltype Union{Missing, Int64}:
 10500  45000  74100  missing  83500  …  55700  43900  71900  missing  62000

We highlight that the type of element in hip_cost is Union{Missing, Int64}, a type that allows an element to be either an integer or a missing element.

The missing value propagates through computations:

sum(hip_cost)

missing

In particular, all of these combinations with missing yield missing, as they should – if data is not available, combinations based on that data are still not available:

1 + missing, 1 - missing, 1*missing, 1/missing, missing^2, missing == true

(missing, missing, missing, missing, missing, missing)

Also nothing and something

In Julia there is also nothing. The semantics are a bit different, and missing is the designed choice for data. (The nothing value is useful for general programming purposes.) For floating point values, there is also NaN, or not a number. This too is often used as a sentinel value, to indicate missingness, though missing is idiomatic. The something function is used to skip over its arguments until a non-nothing value is found. The coalesce function does a similar thing with missing values.

To work with missing data, it can be removed or skipped over. The skipmissing function provides a wrapper around an object which when iterated over, will skip the missing values. For example:

sum(skipmissing(hip_cost))

The sum function iterates over the values in the container it is passed to reduce them to a value, mean is similar and works with skipmissing in an identical manner:

mean(skipmissing(hip_cost))

52300.0

(Which is helpful, as length, used above to compute a mean, does not compose with skipmissing.)

2.1.2 Names (named tuples)

It may be natural to assign the year of measurement to the values in whale, but vectors, as defined in base Julia, do not allow names as an attribute. (There are external packages that allow this, e.g. NamedArrays, which arise in our discussion of contingency tables.) If names are important, Julia provides a named tuple type, which builds on the basic tuple type.

While vectors are collections of homogeneous values, tuples are collections of heterogeneous values. Tuples are constructed with commas, and typically parentheses to delimit the commas:²

whale = (74, 122, 235, 111, 292, 111, 211, 133, 156, 79)

(74, 122, 235, 111, 292, 111, 211, 133, 156, 79)

One-element tuples are distinguished by using a trailing command and parentheses:

a_lone_whale = (101,)

(101,)

For the many purposes, tuples can be exchanged for vectors, as both are iterable³:

sum(whale), mean(whale), length(whale)

(1524, 152.4, 10)

Unlike vectors, but like numbers, tuples can not be modified after construction. This allows tuples to be quite useful – and performant – for programming purposes.

Tuples can also have names. The basic construction uses “key=value” pairs:

test_scores = (Alice = 87, Bob = 72, Shirley = 99)

(Alice = 87, Bob = 72, Shirley = 99)

The names are not quoted, and are stored internally as a tuple of symbols. Extra effort is necessary to create names with spaces or number. In the following example we show that var"..." is useful to create more complicated symbols⁴:

children = (var"X Æ A-Xii" = 1,
            var"Vivian Jenna Wilson" = 2,
            var"Exa Dark Sideræl" = 3)

(var"X Æ A-Xii" = 1, var"Vivian Jenna Wilson" = 2, var"Exa Dark Sideræl" = 3)

The above also shows that Unicode values are easily used within Julia in strings or as identifiers.

Named tuples can also have their “names” attached via parsing, through a syntax similar to keyword arguments of functions, as identifiers imply names:

a = [1,2]; b = [3,4]
(; a, b)  # same as (a=a, b=b)

(a = [1, 2], b = [3, 4])

This gives several different ways to construct $1$-element named tuples, where something is done to disambiguate the use of enclosing parentheses:

(; a), (; a=a), (; :a => a), (a=a, )

((a = [1, 2],), (a = [1, 2],), (a = [1, 2],), (a = [1, 2],))

Named tuples can have their values accessed by name using getproperty, which has a . for convenient syntax:

test_scores.Alice

The generic function keys will return the names (“keys” are used to look up a value) and values will extract the values.

Named tuples being both named and heterogeneous are a natural container for collecting data on several different variables for a single case, as will be seen in the discussion on tabular data.

Associative arrays

Abstractly, an associative array is a container for storing (key,value) pairs. Julia has the notation key => value for representing a pair. A named tuple is an immutable associative array where the keys are symbols. A dictionary is a more general mutable type with the keys being arbitrary (numbers, strings, symbols, …). There are various implementations of an AbstractDict, the Dict constructor being the most commonly used, though the one provided in the Dictionaries package will be seen. As with a named tuple, the keys are returned by keys, the values by values, and pairs of (key/value)s by pairs.

2.1.3 Indexing

The elements of a vector, like a data set, are indexed. For vectors Julia uses one-based indexing⁵. The whale data, as defined, has $10$ elements, we can get the third, fourth, and sixth with:

whale = [74, 122, 235, 111, 292, 111, 211, 133, 156, 79]
whale[3], whale[4], whale[6]

(235, 111, 111)

The above uses three function calls, and displays as a tuple does. (Since the commas produce a tuple.)

To retrieve these same values in a single call, we can pass a vector of indices:

whale[ [3,4,6] ]

3-element Vector{Int64}:
 235
 111
 111

The end keyword refers to the last index of a collection. Similarly, begin refers to the first index inside the square brackets.

whale[end], whale[begin]

(79, 74)

These values allow simple arithmetic. Here we see how to trim off the first and last through indexing⁶:

whale[begin+1:end-1] |> show

[122, 235, 111, 292, 111, 211, 133, 156]

The colon, :, refers to all indices in a vector; whale[:] will return a copy of the data in whale.

Indexing with a container of values

The range 1:1 specifies the value 1, as does just 1, but for indexing the two are different:

whale[1], whale[1:1]

(74, [74])

The design is indexing by a scalar – like 1 – can drop dimensions (the vector becomes a scalar), whereas indexing by a container – like 1:1 or, say, [1] – does not drop dimensions. The documentation for indexing of an array has: “If all the indices are scalars, then the result, X, is a single element from the array, A. Otherwise, X is an array with the same number of dimensions as the sum of the dimensionalities of all the indices.”

Views

The extraction of values from a vector, as above, necessitates the allocation of memory to store the copy of the values. When the data set is large, or is accessed many times, these allocations may be avoided using a “view” of the data. Views create a lazy reference to the underlying data in the array. The view function takes the object as its first argument, and indices for its remaining arguments⁷:

view(whale, [3,4,6])

3-element view(::Vector{Int64}, [3, 4, 6]) with eltype Int64:
 235
 111
 111

The end and begin keywords do not work with view⁸; the lastindex and firstindex functions do:

view(whale, firstindex(whale)+1:lastindex(whale)-1) |> show

[122, 235, 111, 292, 111, 211, 133, 156]

2.1.4 Assignment

The assignment

whale = [74, 122, 235, 111, 292, 111, 211, 133, 156, 79]
show(whale)

[74, 122, 235, 111, 292, 111, 211, 133, 156, 79]

binds the name whale to the container holding the $10$ numbers.

If we make another assignment, as in

whale_copy = whale
show(whale_copy)

[74, 122, 235, 111, 292, 111, 211, 133, 156, 79]

the container is copied, but – unlike if we had used copy(whale) – the two variables point to the same container. When a vector is passed into a function, the function works with the container, not a copy.

2.1.5 Modification

Data may need to be modified after read in or constructed, such as is done with data cleaning. As such, the values in the data vector may need to be reassigned. This is carried out by using the indexing notation on the left-hand side of an assignment:

whale[1] = 75

The value 75 is returned, as the right-hand side of an assignment is always the returned value. We can see that whale was modified:

whale[1], whale_copy[1]

(75, 75)

We also see that as whale_copy points to the same container, it too was modified.

Mutation

When a vector is passed to a function, if there is no copy made (as opposed to a simple naming), then changes to the vector in the function will effect the original vector as the container used in the body of the function isn’t changed. Functions which modify an argument that is passed to them (usually the first one) are conventionally named with a trailing !, though this has no semantic implication.

Multiple values can be assigned at once. For example, if the data was mis-arranged chronologically, we might have:

whale[ [1,2,3] ] = [235, 74, 122]
show(whale)

[235, 74, 122, 111, 292, 111, 211, 133, 156, 79]

The above modified the original container, so these changes would also be reflected in whale_copy.

Julia does not recycle values to be assigned by default, where recycling involves some rules where the space referenced on the left-hand side of an assignment does not match in size the space needed to store the right-hand side of an assignment. Broadcasted assignment (cf. ?.=) can produce a similar behavior. Broadcasted assignment expands the right-hand side to match the space expected on the left side (when possible) and then does the assignment. For example, if we wanted to use a sentinel value to indicate unknown data for the first 3 values, we might have:

whale[[1,2,3]] .= 999
show(whale)

[999, 999, 999, 111, 292, 111, 211, 133, 156, 79]

Notice, without the dot an error will be thrown

whale[[4,5,6]] = 999  # errors

LoadError: ArgumentError: indexed assignment with a single value to possibly many locations is not supported; perhaps use broadcasting `.=` instead?

The “recycling” above uses the left-hand side to identify the size needed. Broadcasted assignment also works with the entire collection. For example, the following command replaces the current data with the original data:

whale .= [74, 122, 235, 111, 292, 111, 211, 133, 156, 79]
show(whale)

[74, 122, 235, 111, 292, 111, 211, 133, 156, 79]

But whale is not a new object – as it would be without that dot – but rather, these values are placed into the container whale already refers to – which also is the container whale_copy points at:

show(whale_copy)

[74, 122, 235, 111, 292, 111, 211, 133, 156, 79]

This assignment avoids the needed allocation of more memory.

The use of [:] on the left-hand side also does in-place assignment⁹. So unlike whale = [...] which replaces the container whale points at, this command reuses the container:

whale[:] = [74, 122, 235, 111, 292, 111, 211, 133, 156, 79]
show(whale)

[74, 122, 235, 111, 292, 111, 211, 133, 156, 79]

Various indexing patterns.
Index style	Explanation
`x[1]`	The first element of `x`.
`x[:]`	Copy of all elements of `x`.
`x[end]`	The last element of `x`.
`x[first]`	The first element of `x`.
`x[[2,3]]`	The second and third elements of `x`.
`typeof(x)[]`	$0$-length vector of same type as `x`.
`eltype(x)`	Element type of container `x`.
`x[1] .= 5`	Assign a value of 5 to first element of `x`.
`x[[2,3]] .= 4`	Broadcasted assignment to second and third elements
`x[[2,3]] = [4,5]`	Assign values to second and third elements of `x`.
`x[:] = [1, 2, 3]`	In-place assignment. Size and type of right-hand side must match left-hand side

Vector size

A vector has a length which can not be changed during assignment. Attempting to assign to an index beyond the size will result in a BoundsError. To extend the size of a vector we can use the following generic functions:

push!(v, x): extend the vector v pushing x to the last value
pushfirst!(v, x): extend the vector v pushing x to the first value
append!(v, v1): append the entries in v1 to the end of v
insert(v, i, x): insert x into v at index i, shifting as needed
vcat(v, v1): vertically concatenate v and v1. (Unlike append! this returns a new vector, so the type of the output will be recomputed).

The vector can also be shrunk. These generic functions for containers are useful:

pop!(v): remove last element from a vector; return element
popfirst!(v): remove first element from a vector; return element
deleteat!(v, i): remove ithe element from a vector; returns vector
empty!(v): remove all elements from a vector

Again, the trailing ! in a function name is a convention to indicate to the user that the function will mutate its arguments, conventionally its first one. That is, a function like pop!(v) does two things: it returns the last element and also shortens the vector v that is passed in.

Vectors have an element type

The assignment whale = [74, 122, 235, ...] assigns a container of a specific type to the variable whale. The eltype(whale) command will return this element type. Subsequent assignments to this container must contain values that can be promoted to that type, otherwise an error will be thrown.

For example, we can’t assign a fractional number of whales:

whale[1] = 74.5

LoadError: InexactError: Int64(74.5)

An InexactError is thrown because, whale is a vector of integers, and 74.5 can’t be automatically promoted to an integer (as 74.0 could be).

A similar thing happens if we attempt to assign missing to a value, as in whale[1] = missing. In this case, a MethodError is thrown, which comes from the attempt to “convert” missing into the underlying integer type.

The way to fix this is to create a new container that allows a wider type. For Float64 values that can be achieved in the cumbersome manner of converting all the values to the wider type:

whale = convert(Vector{Float64}, whale)
whale[1] = 74.5

74.5

This assigns whale to a new container which accepts floating point values, and then reassigns the first one.

Though cumbersome, this is not typical usage, as the constructor used to create the data set will promote to a common type, so it would only matter when adjusting the initial values.

For the special case of assigning a missing value, the allowmissing function from the DataFrames package¹⁰ creates a vector with a type that allows – as well – missing values¹¹. Again, re-assignment is necessary:

using DataFrames
whale = allowmissing(whale)
whale[1] = missing

missing

Broadcasting

As seen, the functions length and sum are reductions – in this case, returning a single number, a scalar, from a vector of numbers. To compute a sample standard deviation, say, we follow the formula:

\[ s = \sqrt{\frac{\sum_i (x_i - \bar{x})^2 }{n-1}}. \]

To do so we would need to:

Subtract a scalar value, resulting for a reduction, from from each element of the data vector: ($x_i - \bar{x}$).
Apply the squaring function to each element of this difference.
Reduce the resulting data to a number through sum.
Divide a number by another number and take the square root.

Embarking on this punch list with a naive attempt at the first – whale - mean(whale) – will fail.

The subtraction of a scalar value from a vector value is not defined, as Julia is not implicitly vectorized. Rather the user must be explicit. For this, the concept of broadcasting is useful. In this context, broadcasting will expand the scalar to match the size of the vector and then use vector subtraction to find the result. Broadcasting is done simply by adding a “.” (the dot) to the function. For infix operations like - this is before the operator:

whale = [74, 122, 235, 111, 292, 111, 211, 133, 156, 79]
whale .- mean(whale)  |> show

[-78.4, -30.400000000000006, 82.6, -41.400000000000006, 139.6, -41.400000000000006, 58.599999999999994, -19.400000000000006, 3.5999999999999943, -73.4]

We also would need to square these values. This could also be done by broadcasting ^, as in:

(whale .- mean(whale)).^2  |> show

[6146.560000000001, 924.1600000000003, 6822.759999999999, 1713.9600000000005, 19488.16, 1713.9600000000005, 3433.959999999999, 376.36000000000024, 12.959999999999958, 5387.56]

To avoid having to use too many dots, it is typical to define a function for doing a scalar computation and broadcast that¹². An example will wait, but to illustrate the syntax, we can broadcast the sqrt function using a “.” after the function name and before the opening parentheses:

sqrt.(whale)  |> show

[8.602325267042627, 11.045361017187261, 15.329709716755891, 10.535653752852738, 17.08800749063506, 10.535653752852738, 14.52583904633395, 11.532562594670797, 12.489995996796797, 8.888194417315589]

Broadcasting works with functions of multiple arguments and multiple shapes.

The use of multiple shapes allows scalars and vectors to be broadcast over and is used in whale .- mean(whale). With this, the standard deviation could be computed with:

sqrt( sum((whale .- mean(whale)).^2)/(length(whale) - 1) )

71.5078861229849

Vector and row vectors can also be broadcast over, So [1,2] .+ [3,4] does vector addition, [1,2] .+ [3 4] pads out the vector to a matrix, the row vector to a matrix, then adds entry-by-entry:

(v=[1,2], rv=[3 4], va = [1,2] .+ [3,4], ma = [1, 2] .+ [3 4], ck=[1 1; 2 2] + [3 4; 3 4])

(v = [1, 2], rv = [3 4], va = [4, 6], ma = [4 5; 5 6], ck = [4 5; 5 6])

This behavior may not be desirable. Some objects broadcast as scalars, others as containers. But there may be times where broadcasting as a container may be incorrect. To force a value to broadcast like a scalar, the value can be wrapped in Ref. That is mean(whale) is the same as mean.(Ref(whale)) but mean.(whale) would broadcast mean over each element in whale. As mean for a single number is just that number, mean.(whale) would just be the same values.

2.2 Data types

An old taxonomy of levels of measurement include nominal, ordinal, interval, and ratio; where nominal values have no rank; ordinal values have a rank, but no meaning is assigned to the difference between values; interval data has a meaning between differences but $0$ is arbitrary; and ratio has a meaningful zero, such as most numeric data. As data can easily be coded (explicitly or behind the scenes) with numbers, this keeps the different types distinct. However, for use with the computer, in particular Julia here, we see it also more sense to emphasize different aspects of the data related to the underlying type.

Julia has numerous data types. Some are “abstract” types, such as Real or Integer, others are “concrete”, often indicating how the data is stored, such as Float64 or Int64, with the “64” indicating $64$ bits of memory. In Julia, these data types can be used to direct method dispatch. For statistics a few common types are used to represent data.

These are reviewed in the following.

2.2.1 Numeric data types

The Julia parser readily identifies the following values of $1$ and reads each using a different type:

1, 1.0, 1//1, 1 + 0im, big(1), BigFloat(1)

(1, 1.0, 1//1, 1 + 0im, 1, 1.0)

Respectively, these represent integer, floating point, rational, complex, and two types of “big” numbers. Julia uses a promotion machinery when different types are mixed. For example, we have:

x = 1 + 1.0 + 1//1 + 1 + 0im + big(1) + BigFloat(1)

6.0 + 0.0im

The type of x must be both complex and be able to store the underlying numbers, which may be “big” numbers:

typeof(x)

Complex{BigFloat}

The above illustrates that addition of an integer and a floating point yields a floating point, and adding to a complex number returns a complex number, etc.

Type stability

Julia programmers try to make functions “type stable,” if possible, as this generally leads to more performant code, once compiled. A consequence is addition of numbers, like 1 + 1.0 will always be the wider type (here a floating point value), even if in the case of these particular values an integer could be the answer. That is the output type of + here is determined by the input types, not the input values.

For data consisting of counts, integers are typically used. If storage is an issue (e.g., lots of data, but not a lot of different values), different forms of integers which use less data may be used.

For data on measurements, with a continuous nature, floating point values are the natural choice. Floating point can represent most integer values exactly, and fractional and irrational values either exactly or approximately. Rational numbers can represent fractional data exactly, though it should be expected that operations with rational values are less performant than for floating point values.

Complex values in Julia are based on an underlying data type holding the two numbers in a + bi.

2.2.2 Categorical data types

There are different options available for the storage of categorical data.

Character data

The String type in Julia is the default type for holding character data. Strings are created with matching single quotes or – for multiline strings – with matching triple quotes:

s = "The quick brown fox ..."
t = """
Four score and seven years ago
our fathers brought forth...
"""

"Four score and seven years ago\nour fathers brought forth...\n"

The string type in Julia can hold Unicode data, such as non-ASCII letters and even emojis.

Entering Unicode

In the REPL Unicode values may be entered using LaTeX shortcuts, e,g., \alpha[tab] to produce $\alpha$. Many interfaces also allow this.

Double quotes are used to create a string. Triple, double quotes can be used to create multi-line strings.

Single quotes are for Char types. A character represents a Unicode code point. Strings are iterable, and iteration yields back Char values. The collect function iterates over an object and returns the values. Here we see the Chars in a string:

s = "Zoë"
collect(s)

3-element Vector{Char}:
 'Z': ASCII/Unicode U+005A (category Lu: Letter, uppercase)
 'o': ASCII/Unicode U+006F (category Ll: Letter, lowercase)
 'ë': Unicode U+00EB (category Ll: Letter, lowercase)

Strings are also indexable. When indexed by a single value, a Char type is returned, when indexed by a range of a values a string is returned. Be warned, indexing into non-ascii strings may error. Here is an example using a string from Julia’s manual:

j = "jμΛIα"
length(j), j[2], j[2:4]  # 2:4 represent the value 2,3,4

(5, 'μ', "μΛ")

The value returned by j[2] is a character, whereas that returned by j[2:4] is a string¹³. However, attempting to extract j[3] will be an error.

String operations

Strings can be combined many different ways.

The * operation is used to combine strings or chars. The basic usage is straight forward:

out = "a" * "bee" * "see"

"abeesee"

The ^ operation when used with an integer exponent will repeat the argument:¹⁴

out = "dot "^3

"dot dot dot "

More generally, join will combine an iterable of strings or characters. If a delimiter is specified, it will be inserted between values (with an option to indicate the last delimiter differently):

j = "jμΛIα"
join(collect(j), ","), join(collect(j), ", ", ", and ")

("j,μ,Λ,I,α", "j, μ, Λ, I, and α")

Julia makes string interpolation easy. Within a string, a value of a variable can be inserted using a dollar sign to reference the variable. This is more general, as computations can also be inserted. Below parentheses are used to delimit the interpolated command:

x = "Alice"
out = "$x knows that 2 + 2 is equal to $(2+2)"

"Alice knows that 2 + 2 is equal to 4"

(To include a dollar sign in the string, it can be escape with a leading slash, “\”, or a raw string can be used, as in raw"$(2+2) is not evaluated".)

For formatted values, such as needed when printing floating point values, the built-in Printf package provides support. For more performant solutions, an IOBuffer can be useful.¹⁵

A common means to generate strings is from reading in delimited files, such as comma-separated files. These may produce strings with leading or trailing spaces. To strip these off, Julia offers strip, lstrip, and rstrip. The lstrip function strips the left side, rstrip the right side, and strip combines the two. An option to pass in other characters to strip besides spaces is available.

strip("   abc    "),  strip("...abc...", '.')

("abc", "abc")

Discussing data sets for different type of data adds no more complication as vectors in Julia are typed so a data set of character data might simply be stored as a vector of string data, such as:

job_title = ["Data Scientist", "Machine Learning Scientist", "Big Data Engineer"]

3-element Vector{String}:
 "Data Scientist"
 "Machine Learning Scientist"
 "Big Data Engineer"

Symbols

Julia as a language can be used to represent the language’s code as a data structure in the language. Symbols are needed to refer to the name, or identifier, of a variable as opposed to the values in the variable. Symbols, being part of the language, are used for other purposes, such as keys for a named tuple of flags for an argument. The access pattern nt.a has been mentioned; this is a convenience for getfield(nt, :a), the symbol being used as a key. When data frames are introduced – essentially a collection of matched data vectors – symbols will be used to reference the individual variables.

The simple constructor for a symbol is :, as in :some_symbol. (The : constructor makes expressions, but in this use, these are interpreted as symbols.) The Symbol constructor can also be used to create symbols with spaces, e.g., Symbol("some symbol"); the string macro var"..." is a convenience.

The string function (or String constructor) will create a string from a symbol.

Factors

While character data is useful for representing data that is unique to each case (like an individual’s address), there are advantages to using a different representation for data with many anticipated repeated values (like an individual’s state). Factors are a data structure that allow the user to see full labels, but internally, the computer only sees an efficient pool of possible levels or values. That is, a factor is essentially a mapping between a label and a corresponding key, with a possible ordering assigned to the labels.

Factors are not a built-in data type, but are provided by the CategoricalArrays package, which is loaded as other packages are:

using CategoricalArrays

Some main tasks when working with such data are:

reordering factor levels
changing the labels of factor levels
combining several levels into one

We use an example based on coffee sizes at your neighborhood Starbucks. An order consisted of $4$ drinks with these sizes, put into a categorical array:

x = categorical(["Grande", "Venti", "Tall", "Venti"], ordered=true)

4-element CategoricalArray{String,1,UInt32}:
 "Grande"
 "Venti"
 "Tall"
 "Venti"

This appears to be like a character vector, but the type is different. First, let’s peek to see how values are internally stored. The user-visible values are stored with:

x.pool

CategoricalPool{String, UInt32}(["Grande", "Tall", "Venti"]) with ordered levels

Internally, the computer sees:

x.refs

4-element Vector{UInt32}:
 0x00000001
 0x00000003
 0x00000002
 0x00000003

The levels are the labels assigned to the internal values. We can see that the levels are internally kept as 32-bit integers, which in general is more space efficient than storing the labels. The command levelcode.(x) will show the values using the more readable 64-bit integers.

Working with levels is the key difference. First levels, may be unordered (nominal) or ordered (ordinal), the latter indicated by specifying ordered=true to the constructor or by calling ordered!(x, true) for an unordered variable.

The example data has an odd order (call levels(x) to see), it coming from lexical sorting. So Grande is before Venti, despite the latter being 16 oz. and the former 20 oz (for hot drinks, 24oz for cold)¹⁶.

x[1] < x[2]

true

To reorder, we call the levels! function with the desired order:

levels!(x, ["Tall", "Venti", "Grande"])
x[1] > x[2]

true

The levels can be extended through assignment. For example, we might prefer to change the label “Tall” for a new label “Tall, 12oz”. We can readily do this for a single value through assignment:

x[3] = "Tall, 12oz"

"Tall, 12oz"

After this command, there are 4 levels though only 3 used in the vector. To trim out extra levels, the droplevels! function can be used.

In general and for multiple replacements, the replace function is useful and the levels are adjusted accordingly:

replace!(x, "Grande" => "Grande, 20oz", "Venti" => "Venti, 16oz")

4-element CategoricalArray{String,1,UInt32}:
 "Grande, 20oz"
 "Venti, 16oz"
 "Tall, 12oz"
 "Venti, 16oz"

The same command can be used to consolidate labels. For example, suppose we wanted to use just “Really big” and “regular sized” where we lump the two smaller ones together. It could be we just use replace with the right hand side using duplicates. Below we use broadcasting to illustrate how this might be done more easily were there many levels to combine. We do this in two lines, to be less confusing, but it could be written with just one:

replace!(x, (["Tall, 12oz", "Venti, 16oz"] .=> "regular sized")...)
replace!(x, "Grande, 20oz" => "Really big")
x

4-element CategoricalArray{String,1,UInt32}:
 "Really big"
 "regular sized"
 "regular sized"
 "regular sized"

Categorical data, like numeric data, can also be combined with vcat, say. If ordered, as much as possible, an order will attempt to be matched.

2.2.3 Logical data and operators

The Boolean type in Julia has two values: true and false.

Boolean values have their own algebra. The short-circuiting && and || implement “and” and “or:”

false && false, true && false, false && true, true && true

(false, false, false, true)

false || false, true || false, false || true, true || true

(false, true, true, true)

These are called “short circuiting,” as the right-hand side is only evaluated if it need be (as in true || false, the statement is known to be true after the left side of || is evaluated, so the right-hand side is not evaluated). This is used frequently for error messages, as the error is not called when the expression is true. These operations may be broadcasted, so the above might have been illustrated by

[true false] .&& [true, false]  # produces a matrix to accommodate sizes

2×2 BitMatrix:
 1  0
 0  0

The operations have left-to-right associativity, so it is common to see them in a sequence.

Many operations promote Boolean values to 1 and 0. For example, true + false is 1, true * false is 0, and sum([true,false, true]) is 2; internally, the complex number $i$ is internally a pair (false, true) indicating no real part and an imaginary part.

Boolean values are returned by the comparison operators <, <=, ==, ===, >=, and >. These have the expected meaning, save == is a test of equality (not =, which is used for assignment) and === is a test of whether two values are identical. (E.g. for a vector x we have x == copy(x) is true, but x === copy(x) is false, as the latter does not point to the exact same container.) The symbol ! is used for negation.

Boolean values can be used for indexing. Suppose inds is a vector of trues and falses with the same length as a vector x, then x[inds] will return those values from x where inds is true.

To create such Boolean vectors, the comparison operators are typically used combined with broadcasting. For example, the following redefines the whale dataset, then filters out only those values bigger than or equal to $200$:

whale = [74, 122, 235, 111, 292, 111, 211, 133, 156, 79]
whale[ whale .>= 200 ]

3-element Vector{Int64}:
 235
 292
 211

Or, this example – which shows the mathematically natural chaining of comparison operators – filters out only the values in $[100, 125)$:

whale[ 100 .<= whale .< 125 ]

3-element Vector{Int64}:
 122
 111
 111

Querying elements

There are other helper functions to query the elements of a Boolean vector.

any(v) will return true if any of the elements of the Boolean vector v are true.
all(v) will return true if all of the elements of the Boolean vector v are true.
findfirst(v) will return the index of the first true value in the Boolean vector v or return nothing if none is found. A predicate function can be used, as in findfirst(f, w) which effectively calls findfirst on f applied to each element of a vector w.
findlast(v) will return the index of the last true value in the Boolean vector v or return nothing if none is found. A predicate function can be used, as in findlast(f, w) which effectively calls findlast on f applied to each element of a vector w.
findnext(v, i) will find the first true value after index i in the Boolean vector v or return nothing if none is found. A predicate function can be used, as in findnext(f, w, i) which effectively calls findnext on f applied to each element of a vector w.

In general, x in v will check if the element x is in the vector v. The Unicode operator ∈ (\in[tab]) can replace in.

2.2.4 Date and time types

Julia provides the built-in Dates module for working with date and time data. This module need not be installed, but is not loaded by default, so loading or importing it is needed to access the functionality.

using Dates

Constructing a date is done with the Date constructor which expects a year, followed by an optional month and day. Date and time objects have the DateTime constructor. This example uses the vernal equinox, summer solstice, autumnal equinox, and winter solstice in the year 2022 for illustration:

ve, ss, ae, ws = Date(2022, 3, 20), Date(2022, 6, 21), Date(2022, 9, 22), Date(2022, 12, 21)

(Date("2022-03-20"), Date("2022-06-21"), Date("2022-09-22"), Date("2022-12-21"))

Dates can also be parsed from a string. The following uses the default format;¹⁷:

Date.(["2022-03-20", "2022-06-21", "2022-09-22", "2022-12-21"])

4-element Vector{Date}:
 2022-03-20
 2022-06-21
 2022-09-22
 2022-12-21

Date objects have the accessors year, month, and day:

year(ve), month(ve), day(ve)

(2022, 3, 20)

The objects allow for natural operations, such as comparisons and differences:

ve < ss < ae < ws, ae - ve, ve + Day(93)

(true, Day(186), Date("2022-06-21"))

The difference is returned in the number of days. The last command shows the duration of 93 days can be constructed with Day and its value added to a Date object.

There are ways to query how a date falls within the calendar

dayofyear(ve), dayofweek(ve), dayname(ve), dayofweekofmonth(ve)

(79, 7, "Sunday", 3)

The last command indicating that the vernal equinox in 2022 fell on the third Sunday of the month.

2.2.5 Structured data

Structured data may not represent statistical data, but is useful nonetheless, e.g, for specifying the year of the counts in the whale data set.

For a vector of all ones or all zeros, the ones and zeros functions are useful. The command ones(n) will return a vector of n zeros using the default Float64 type. To specify a different type, such as Int64, the two-argument form, ones(T, n), is available. Similarly, zeros is used to create a vector of zeros. The singular one(), zero() (one one(T) and zero(T)) are useful for generic programming. For example, we use the idiom one.(x) to create a vector of all 1s with the length and type of the vector x.

Arithmetic sequences, $a, a+h, a+2h, \dots, b$ can be created with the colon operator a:h:b or a:b when h is 1. This operator returns a recipe for generating the sequence, it is lazy – it does not generate the sequence. The precedence is such that simple arithmetic operations do not need parentheses. That is a+1:b-1 represents the sequence $a+1, a+2, \dots, b-1$. Arithmetic sequences prove useful for indexing into a vector.

The colon operator for floating point values may or may not stop at b. Programming this is harder than it seems. The simple example of 1/10:1/10:3/10 should be $1/10, 2/10, 3/10$, but it turns out that on the computer 1/10 + 2*1/10 is actually just larger than 3/10. See the value of 3/10 - (1/10 + 1/10 + 1/10) to investigate. However, the algorithm of : does produce the result with $3$ values here.

1/10:1/10:3/10 |> collect

3-element Vector{Float64}:
 0.1
 0.2
 0.3

The range function creates sequences. The common usage is range(start, stop, length). That is a:h:b specifies a step size, whereas the positional arguments of range specify the number of values between the starting and stopping values. Keyword arguments allow other combinations of start, stop, step, and length. The range returns a similar expression as the colon operator, it does not realize the entire range of values.

Scalar multiplication, scalar division, and addition of like-sized ranges are defined, as they return an arithmetic sequence.

For example, if whale holds beaching numbers for the years 2010 through 2019, we can get the odd years though the following:

oddyrs = 2011:2:2019
whale[oddyrs .- 2010 .+ 1] # 1-based offset is why we add 1

5-element Vector{Int64}:
 122
 111
 111
 133
  79

The fill function creates an array of a specific size, filling it with a value. For example, to create a vector of all 1s of length 10, fill(1, 10) will do so. A tuple is used for the second positional argument to construct higher-dimensional filled arrays.

For more complicated patterns, the repeat function can prove useful. For an array, it creates a new array comprised of repeats of the given array. For a vector, the only use here, two repeating patterns are often desired: repeating the whole vector several times; repeating each entry of a vector several times then combining.

v = [1, 2, 3]
repeat(v, 3) |> show

[1, 2, 3, 1, 2, 3, 1, 2, 3]

In the above, the entire vector is repeated three times. The 3 is passed to the outer argument. To repeat the 1s then the 2s and then the 3s the inner argument is used:

repeat(v, inner=3) |> show

[1, 1, 1, 2, 2, 2, 3, 3, 3]

2.3 Functions

Julia has a simple syntax for user-defined functions.

For simple functions, the syntax borrowed from common mathematics is used. For example, here we define a function to find the mad defined by the median of the transformed data $|x_i - M_x|$.

MAD(x) = median(abs.(x .- median(x)))

MAD (generic function with 1 method)

The function is named MAD (to distinguish it from the already defined mad function in StatsBase) and, as written, accepts a vector and returns a summary number:

MAD(whale)

38.5

For functions which are not one liners, a pair of function-end keywords will define a block. For example, the following computes the fifth standardized moment (skewness and kurtosis related to the 3rd and 4th). The first line is one way to document a function in Julia.

"Compute 5th standardized momemt: m_5 / m_2^(5/2)"
function fifth_sm(x)
    xbar, n = mean(x), length(x)
    m5 = sum((xi - xbar)^5 for xi in x) / n
    m2 = sum((xi - xbar)^2 for xi in x) / n
    m5 / m2^(5/2)
end
fifth_sm(whale)

3.6325544657722215

(The above uses a generator, created by the use of for and in to loop over the different values of x rather than broadcasting.)

The repetition above in m5 and m2 could be avoided if we made a function to compute the sample moments about the mean which accepted both the data and a value for the exponent:

sample_moment(x, n=2) = sum((xi - x)^n for xi in x) / length(x)
fifth_sm(x) = sample_moment(x, 5) / sample_moment(x)^(5/2)

fifth_sm (generic function with 1 method)

The variable n is in the second position and has a default value of 2 which is employed in the denominator of the above, where sample_moment is called with just a single argument.

There can be many positional arguments, only the last ones can have default values specified.

Functions can have a variable number of arguments. Here is a way to find the proportions of a set of numbers that is not stored in a container, but rather is passed to the function separated by commas:

proportion(xs...) = collect(xs) / sum(xs)

proportion (generic function with 1 method)

The splat syntax ... indicates a variable number of arguments in a function definition, and can be used to expand a list of arguments when used inside a function call. The use of collect, above, is needed above to generate a vector, as xs is passed to the body of the function as a tuple and tuples do not have division defined for them.¹⁸

The mad function from StatsBase has signature mad(x; center=median(x), normalize=true). This shows the use of keyword arguments. These have a default value, which, as illustrated, can depend on the data passed in. To call a function without the default, the keyword is typed, as in:

mad(whale; center=mean(whale))

74.13011092528009

(We use a semicolon to separate positional arguments from keyword arguments, as that is needed to define a keyword, but commas can be used to call a keyword argument. What is important is that any keywords come last when a function is called.)

Functions, as defined above, are methods of a generic function. That is, there can be more than one method for a given name. (There are over 200 methods for the generic function named + in base Julia – and packages can extend this even more.) To direct or dispatch a call to the appropriate method, Julia considers the number and types of its positional arguments. That is, like +, functions can be defined differently for integers and floating point values.

We might like our MAD function to be more graceful than to throw a MethodError if a vector of strings is passed to it. A vector of strings has type Vector{String} so we could make a method just for that type:¹⁹ `Julia makes adding methods easy, but the types that are used to extend the function shouldn’t be owned by other packages, as this is considered type-piracy. (Failing to do so may prompt a request for a letter of marque.)

MAD(x::Vector{String}) = print("Sorry, MAD is not defined for vectors of strings")
MAD(["one", "three", "four"])

Sorry, MAD is not defined for vectors of strings

(There are many different types that one might wish to exclude and there are many tricks to efficiently code for this. It is common to define a default method which errors and then a special case for the types that can be worked with. Additionally, as types can be concrete, as the above, or abstract, it is possible to parameterize the type used for dispatch so that subtypes can be identified. For example, some operations return a SubString not a String. A vector of SubString will not match Vector{String} as even though both hold string data. The subtleties of parameterization are necessary to understand to write many special cases, but won’t be necessary in these notes.)

A higher order function is one that takes one or more functions as an argument. As example is the calling style of findfirst(predicate, x) where predicate is a function which returns a boolean value. Higher-order functions are widely used with data wrangling. For such uses it is convenient to be able to define an anonymous function. These do not create methods for a generic function, as they have no name (though anonymous functions can be assigned a name).

Anonymous functions are easily defined through the pattern: argument(s) -> body, where body, and, when multi-line, can be contained within begin/end blocks.

For example, to find the index of the first year that there were 200 or more whale beachings we have:

findfirst(x -> x >= 200, whale)

For the tasks of logical comparisons, there are partially-applied versions of the operators that are essentially defined like anonymous functions (cf. Base.Fix2). In particular, >=(200) can be used for the anonymous function x -> x >= 200. To illustrate, here we see the index of first year where there were 200 or more beachings and the index of the last year, using ! to negate a call (even though >=(200) would be identical):

findfirst(>=(200), whale), findlast(!<(200), whale)

(3, 7)

Another example would be to filter those values less than 100. The filter(predicate, v) function does this:

filter(<(100), whale)

2-element Vector{Int64}:
 74
 79

Broadcasting alternatives; iteration

As a quick illustration of a few other concepts, we consider alternatives to broadcasting that can prove useful. Consider the simplest case of broadcasting a function f over a single collection x.

For example here is broadcasting:

x = [1, 4, 9]
f(x) = sqrt(x)
f.(x)

3-element Vector{Float64}:
 1.0
 2.0
 3.0

This use of broadcasting is also called the mapping of f over x. The map(f, x) function is defined:

map(f, x)

3-element Vector{Float64}:
 1.0
 2.0
 3.0

Either broadcasting or map do the following: for each element of x apply the function f. This can be represented with the chaining operator, |>, adjusted to broadcast the values of x:

x .|> f

3-element Vector{Float64}:
 1.0
 2.0
 3.0

The iteration over x can be made explicit with a for loop. The essential syntax would include:

out = Any[]
for xi in x
   push!(out, f(xi))
end
out

3-element Vector{Any}:
 1.0
 2.0
 3.0

For loops require extra effort for accumulation, which, in the above, required the selection of a container. We used Any[] to create a zero-length container that can hold any object and push values onto this. If would be preferable to have a concrete type, which in this case is just Float64[], but, in general, that requires some peeking into the output of f. This is to point out some background work performed by map and broadcasting that simplify other common tasks.

The for xi in x part of the for loop assigns xi to each iterate of x. There are some iterations that return more than one value at once, and it is common to see these destructured in the syntax. For example, enumerate is a helper function which takes an iterable object, like a vector, and iterates over both the index and the value:

for (i, xi) in enumerate(x)
  print("element $i is $xi. ")
end

element 1 is 1. element 2 is 4. element 3 is 9.

A comprehension is a good alternative to a for loop when accumulation is required and the computation for each iterate is simple to express. Comprehensions have the syntax [ex for x in xs] and additional syntax for multiple iterations. The expression ex can use the variable x; xs is some iterable, such as a vector. For example, we might have this alternative to f.(x)

[xi - mean(x) for xi in x]

3-element Vector{Float64}:
 -3.666666666666667
 -0.666666666666667
  4.333333333333333

Comprehensions use generators, which can also be used for many other functions, such as sum, which was previously illustrated. This form has the advantage of not needing to allocate temporary space to compute. For example sum([xi for xi in x]) would have to find space for the vector created by the comprehension, but the similar – and easier to type – sum(xi for xi in x) would not.

In this example, we sum the squared differences, passing an optional function to sum:

f(x) = x^2
sum(f, xi - mean(whale) for xi in whale)

46020.399999999994

Various convenient iterators in `Julia`.
Iterator	Description
`eachindex`	iterate over each index
`values`	iterate over value in a container
`enumerate`	iterate over index and value in a container
`keys`	iterate over keys for a container
`pairs`	iterate over (key,value) pairs in a container
`zip`	iterate over multiple iterators; the value is a tuple with an element from each
`eachcol`	for tabular data, iterate over the columns
`eachrow`	for tabular data, iterate over the rows

2.4 Numeric summaries

The StatsBase package provides numerous functions to compute numeric summaries of a univariate data set.

For measures of center we have the mean (average), median (middle value), and mode (most common value):

mean(whale), median(whale), mode(whale), mean(trim(whale, prop=0.2))

(152.4, 127.5, 111, 140.66666666666666)

The trimmed mean is computed by composing mean with the trim function which returns a generator producing the trimmed values (with proportion to trim specified to prop).

For measures of spread we have the standard deviation, the median absolute deviation, and the inter-quartile range ($Q_3 - Q_1$):

std(whale), mad(whale), iqr(whale)

(71.5078861229849, 57.08018541246567, 86.25)

The quantile(x, p) function returns measures of position. Keywords alpha and beta can be used to adjust the algorithm employed. The 0 quantile is the minimum, the 1 quantile is the maximum, and 0.5 the median, or middle value. The $p$th quqntile is an interpolated value that aspires to split the data with $p\cdot 100$% of the values to the left and $(1-p)\cdot 100$% of the values to the right. The p specified to quantile may be an iterable, useful to computing more than one at a time. Here we see that the type of p is used to compute the output type:

quantile(whale, (0, 1//2, 1.0))

(74, 255//2, 292.0)

The quantile function is used internally by StatsBase: the iqr is defined by the difference of quantile(x, [1/4, 3/4]); the summarystats function returns a summary of a data set, with the 5-number summary of the quantiles (p=0:1/4:1), the mean, the length, and the number of missing data values.

summarystats(whale)

Summary Stats:
Length:         10
Missing Count:  0
Mean:           152.400000
Std. Deviation: 71.507886
Minimum:        74.000000
1st Quartile:   111.000000
Median:         127.500000
3rd Quartile:   197.250000
Maximum:        292.000000

A measure of position gives a sense of how large a value is relative to the data set. Knowing a value is the 20th percentile, say, says it is larger than 20 percent of the data and smaller than 80 percent. The percent of data less or equal a value can be computed by sum(data .<= value) or sum(x <= value for x in data).

The $z$-score is a different measure of position. It measures differences from the mean on a scale of standard deviations. The $z$ score is computed by (value - mean(data))/std(data) or for all values at once with:

zs = (whale .- mean(whale)) / std(whale)

10-element Vector{Float64}:
 -1.0963825705204249
 -0.4251279355079199
  1.155117351084019
 -0.5789571226982856
  1.9522322301613686
 -0.5789571226982856
  0.8194900335777664
 -0.2712987483175542
  0.05034409762593779
 -1.0264602127066222

A rule of thumb, based on a certain bell-shaped distribution, is that values larger than $3$ in absolute value are unusual, none of which are seen in this data.

The $z$-score is the typical form of standardization and is also supported directly through either of the following:

zs1 = zscore(whale)
zs2 = standardize(ZScoreTransform, Float64.(whale))

zs == zs1 == zs2

true

Why so many measures?

There are two main measures: those based on differences and those based on position.

The mean is defined by

\[ \bar{x} = \frac{1}{n} \sum_{i=1}^n x_i. \]

A property of the mean is when the data is centered by the mean, i.e. take the transformation $y_i = x_i - \bar{x}$, then the mean of the $y_i$ is just $0$. Put another way, the mean is the point where the differences to the left and right of the mean even out when added.

The standard deviation, as a sense of scale, has the property that if the data is centered and then scaled by the standard deviation, the center will be $0$ and the scale will be $1$. That is, the $z$-scores have mean $0$ (as they data is centered) and standard deviation $1$ (as the data is scaled) – the $z$ scores speak to the “shape” or distribution of values of a data set.

These measures are sensitive – or not resistant – to one or more outlying values. For example, the average wealth of people in a bar changes dramatically if someone like a pre-crash Elon Musk walks in. The standard deviation is similar. So measures based on position are better when data is skewed, especially if heavily skewed. For these, the median and IQR are not impacted greatly by one large value. That is they are resistant to outliers. The extreme values (the minimum and maximum) are less so. As such, the range of the data is a poor measure of spread, the range of the middle quartiles (the IQR) a much more resistant measure of spread.

Various measures of center, spread, and position in a univariate data set.
Measure	Type	Description
`mean`	center	Average value
`median`	center	Middle value
`mean ∘ trim`	center	Trimmed mean
`var`	spread	“Average” squared distance from mean
`std`	spread	The square of the variance
`mad`	spread	The median absolute deviation from center
`IQR`	spread	Range of middle half of data
`extrema`	spread	Smallest and largest values
`quantile`	position	Value where data would be split for give proportion
`summarystats`	position	The quartiles and extrema
`zscore`	position	Standardizing transformation

2.5 Shape

The five-number summary gives some insight into the shape of a distribution, in that it can show skewness. The skewness function numerically quantifies this:

skewness(whale)

0.7785957149991207

Perfectly symmetric data has $0$ skew.

The kurtosis function numerically summarizes if a distribution has long or short tails compared to a benchmark distribution, the normal:

kurtosis(whale)

-0.5833928499406711

These numeric summaries are helpful, but to best see the shape of a distribution graphical summaries are suggested.

A stem or stem-and-leaf plot is a useful summary of a data set that can be computed easily by hand for modest-sized data. In short, numbers are coded in terms of a stem and a leaf. The stems are not repeated, so the data can be more compactly displayed. The data is organized so that the extrema, the median, and the shape of the data can be quickly gleaned.

A stem-and-leaf plot function is not part of the StatsBase package.²⁰ This one, below, is modified from a RosettaCode submission. It is simplified by assuming non-negative data. Essentially, it finds the right place to split a number into two values, a stem and a leaf. Then it displays each.

using Printf
function stemleaf(a::Array{T,1}, leafsize=1) where {T<:Real}
    ls = 10^floor(Int, log10(leafsize))
    aa = floor.(Int, sort(a)/ls)
    out = divrem.(aa, 10)
    stem, leaf = first.(out), last.(out)
    io = IOBuffer()
    for i in stem[1]:stem[end]
        i != 0 || println(io, "")
        print(io, (@sprintf "%10d | " i))
        println(io, join(map(string, leaf[stem .== i]), ""))
    end
    println(io, " Leaf Unit = " * string(convert(T, ls)))
    String(take!(io))
end

stemleaf (generic function with 2 methods)

For the whale data, which is spread out over a wide range, we use a leaf size of $10$:

stemleaf(whale, 10) |> print


         0 | 77
         1 | 11235
         2 | 139
 Leaf Unit = 10

We can identify 0|7 which is 0*100 + 70 or $70$ as the “smallest” value. The actual smallest value is $74$, but the data is rounded for this graphic so that just two digits (the leaf always is represented by one, the stem may use more than one) can represent all the values. Similarly $290$ is identified as the largest values (which is actually $292$).

For graphical summaries of data, we will introduce the StatsPlots package, which adds to the Plots package recipes for many statistical graphics. In later chapters we will use a different plotting package, but for now, the StatsPlots package provides easy to use and understand methods to produce the desired graphics.

using StatsPlots

For a univariate data set, graphics are used to help gauge the center, spread, and shape of a distribution of values. Different graphics have their advantages. In the following, we illustrate the dot plot, the box plot, the histogram, the density plot, and the quantile-normal plot. There are basic recipes for each graphic that require little more than specifying which data to use, though we do adjust a few default values through keyword arguments, such as legend.

A dot plot shows the data on a number line, traditionally horizontally, and with circles (dots) representing each data point. For data with ties, like the whale data set, there are many techniques to disamgibuate the data in the graphic: the tied data may be stacked, using the $y$ direction; the data may be jittered in the $x$ direction, adding a small random to each value; the data may be jittered in the $y$ direction. In StatsPlots, with the dotplot recipe, values are jittered when plotted vertically or when plotted with just the $x$ values. Here we use the dotplot(x, y) method with x representing the data to display. For this, we manually jitter the x values and we make a vector of values, y all with the same value, in this case 1 as the one function makes this convenient:

jitter(x, scale=1/4) = x + randn(length(x)) * scale

p1 = dotplot(jitter(whale), one.(whale); legend=false);

In Figure 2.1 the graph appears on the left side.

A dot plot shows all the data from which the reader can gauge the center, spread, and shape easily enough. For a small number of data points, such as with the whale data it is a very good graphic for assessing the distribution of values. However, for a large number of values, the graphic can get too crowded, and alternatives are useful.

The box plot or box-and-whisker plot of Tukey is one such excellent alternative.

The boxplot summarizes many aspects of a data set with basically just the five-number summary presented as a box with whiskers:

The center is shown by the middle line of the box set at the median
The spread is seen through the IQR, the length of the box, which is set at $Q_1$ and $Q_3$.
The shape (skewed or symmetric) can be gauged by how different the different quartiles are: the first quartile is from the smallest illustrated value to the box; the second the bottom half of the box; the third the top half of the box; and the fourth the top of the box to the largest value.
Outliers are illustrated as dots and identified by a rule: any value more the $1.5$ IQRs from $Q_3$ or less than $1.5$ IQRs from $Q_1$ are not included in the whisker, but instead are illustrated with a dot. The whisker_range argument can be set to adjust that multiplier $1.5$.
The range is seen through the whiskers or identified outliers.

Box plots are excellent graphics for comparing several distributions, and usually appear vertically oriented in such usage, the default orientation. Here, with a single data set, we show how to use the orientation argument to specify a horizontal layout. For this boxplot, we also add in a dot plot with each dot colored by its respective quartile.

p2 = boxplot(whale; orientation=:horizontal, legend=false)

Q1, Q2,Q3 = quantile(whale, (1/4, 1/2, 3/4))
quantile_color(x) = x < Q1 ? "red" : x < Q2 ? "yellow" : x < Q3 ? "blue" : "pink"
whale′ = jitter(sort(whale))
dotplot!(p2, whale′, 1/2 .+ zero.(whale′), color=quantile_color.(whale′));

The right-hand graphic in Figure 2.1 shows the boxplot. A careful comparison shows the interpolation used to identify the quartiles. This graphic has no identified outliers. A slight skew is seen, as the right half of the figure is longer than the left half, though this may be an artifact of sampling, and not a fact about the underlying population.

Figure 2.1: The whale data summarized with a dot plot on the left and a box plot and accompanying dot plot on the right.

A histogram is a another graphic well suited for displaying large data sets. It presents an illustration of what values are seen and also how prevalent that size value is. It does this by selecting a number of “bins” over the range of the data and counting the number of values in that bin, similar to what happens in a stem-and-leaf plot. This number is then represented using a bar with the property that that area of the bar is proportional to the frequency of data being in that bin. This proportion may just represent the count (if all the bins have the same length), or may be scaled so the total area is $1$ (as these graphics suggest the shape of the underlying probability density function, which has total area $1$). The histogram function in StatsPlots offers still more means to scale the bars; adjustments specified through the normalize argument.

p1 = histogram(whale; bins = 5, legend=false)

dotplot!(p1, jitter(whale), 1/5 .+ zero.(whale), markersize=10);

This graphic is illustrated in Figure 2.2. To make it, we adjusted the default number of bins through the bins argument, and added a dotplot with somewhat larger dots, to illustrate the data being counted. The height of each bar should reflect the number of data points within the bin associated to the bar.

Histograms are great for seeing the center of the data (the balancing point of the graphic), the spread, and the shape. However, when the are many bins, the graphic gets busy. The key information being conveyed is the general shape of the top of each bar, not the vertical lines producing the bar. The stephist function avoids these lines, though the density plot is a more widely used alternative.

A density plot also illustrates the distribution of values in a data set along with how they cluster, however, it smooths out the binning process. Density plots are scaled to have area $1$, as they will be seen to suggest the shape of a probability density function. In Figure 2.2 we create a density plot overlaid on a histogram, which is normalized to have area $1$:

p2 = stephist(whale; bins=5, normalize=:pdf, alpha=0.75, legend=false)

density!(p2, whale; linewidth=5);

Figure 2.2: The whale data summarized with a histogram on the left and a density plot on the right.

A quantile-normal plot is a graphic to assess if a data set comes from a normal distribution, a reference bell-shaped distribution. The graphic is a scatter plot of a quantile of the data set with a corresponding quantile of the reference distribution.

If the data set has a roughly bell-shaped distribution with “typical” tails, then the dots will mostly fall on a straight line;
if the data set is skewed one edge will deviate from the line;
if the data set is symmetric, but has long tails these will show a deviations in both edges of the line.

The qqnorm function produces Figure 2.3:

qqnorm(whale; legend=false)

Figure 2.3: A quantile-normal plot of the whale data

For some cases, where show takes too much horizontal space, permutedims is used for a similar purpose, as it may elide some values.↩︎
The parentheses for constructing tuples with 2 or more elements are technically optional, a fact that is useful for bundling different outputs into one↩︎
The output of the following command is displayed as a tuple, as the commas used create the tuple, even if not enclosed in parentheses.↩︎
The var implementation is via a string macro, and is documented through @var_str.↩︎
In Julia there are options to have offsets for indexing, e.g. OffsetArrays that allow for other access patterns, such as would be useful for zero-based indexing.↩︎
This example uses the colon operator to produce a range of values between it two arguments with increment of $1$.↩︎
Arrays can be $n$ dimensional, so there may be $1$ or more index to specify↩︎
Though : does, as it is a function, not a keywoard.↩︎
The : copies on the right-hand side, but does in-place assignment on the left-hand side↩︎
The DataFrames package is almost always utilized, as it provides a fundamental type to work with tabular data, the data frame. With data frames, the allowmissing! function is used, as well, for this task.↩︎
The allowmissing function creates a union type with Missing in addition to the original data type. It some printouts, a ? is appended to the type name to indicate this addition.↩︎
Also the @. macro can automatically broadcast all operations.↩︎
The colon operator a:b returns an iterator to produce “a, a+1, …, b”.↩︎
The expression s^i just calls repeat(s, i), for these types.↩︎
These are illustrated in the stemplot function defined later.↩︎
There are also Demi (3oz), short (8oz), and Trenta (31oz).↩︎
Different date formats are possible, see ?DateFormat for details.↩︎
Tuples and vectors are at times interchangeable, but tuples do not have the associated algebraic operations defined for them, as they need not be homogeneous. With data frames, they are treated quite differently.↩︎
To extend the mad function from StatsBase requires the extra step of importing that function or qualifying it with its module, as in StatsBase.mad(...) = ....↩︎
Stem-and-leaf plots are very useful for organizing data by hand, but better graphics can be used to illustrate data and its distribution on the computer.↩︎