3 Tabular data

Some data sets are naturally stored in a tabular format, like a spreadsheet. This chapter looks at such data.

In the following, we will utilize these basic packages, among others introduced along the way:

using StatsBase, StatsPlots

3.1 Data frames

The basic framework for exploratory data analysis is a data set with 1 or more observations for $n$ different cases. For each case (or observation) a variable will be a length-$n$ data set, with values measuring the same type of quantity, or perhaps missing. The variables can be stored as vectors. The collection of variables is traditionally arranged in a rectangular manner with each row representing the measurements for a given case and each column representing the measured values for each variable. That is the columns are homogeneous (as they measure the same thing), but the rows need not be (as different variables are measuring different quantities).

Matrices contain rectangular data, but are homogeneous. As such, a new structure is needed to store tabular data in general. A data frame is a concept borrowed from S Plus where they were introduced in 1991. The R language has data frames as a basic built-in data container. In Julia the structure is implemented in an external package, DataFrames.

using DataFrames

An OG package

The DataFrames package dates back to 2012, at least, and must be one of Julia’s oldest packages. Julia was on version 0.1 then. It is also under continual development and has a comprehensive set of documentation and tutorials. Bouchet-Valat and Kamiński (2023) has an excellent overview including comparisons to other implementations in other languages and context for certain design decisions. This introduction attempts to illustrate some basic usage; consult the provided documentation for additional details and applications.

3.1.1 Data frame construction

There are different ways to construct a data frame.

Consider the task of the Wirecutter in trying to select the best carry on travel bag. After compiling a list of possible models by scouring travel blogs etc., they select some criteria (capacity, compartment design, aesthetics, comfort, …) and compile data, similar to what one person collected in a spreadsheet. Here we create a much simplified spreadsheet for 3 listed bags with measurements of volume, price, laptop compatability, loading style, and a last-checked date – as this market improves constantly.

product         v  p    l loads       checked
Goruck GR2      40 395  Y front panel 2022-09
Minaal 3.0      35 349  Y front panel 2022-09
Genius          25 228  Y clamshell   2022-10

We see that product is a character, volume and price numeric, laptop compatability a Boolean value, load style one of a few levels, and the last checked date, a year-month date.

We create vectors to hold each. We load the CategoricalArrays and Dates packages for a few of the variables:

using CategoricalArrays, Dates
product = ["Goruck GR2", "Minaal 3.0", "Genius"]
volume = [40, 35, 25]
price = [395, 349, 228]
laptop_compatability = categorical(["Y","Y","Y"])
loading_style = categorical(["front panel", "front panel", "clamshell"])
date_checked = Date.(2022, [9,9,10])

3-element Vector{Date}:
 2022-09-01
 2022-09-01
 2022-10-01

With this, we use the DataFrame constructor to combine these into one data set.

d = DataFrame(product = product, volume=volume, price=price,
              var"laptop compatability"=laptop_compatability,
              var"loading style"=loading_style, var"date checked"=date_checked)

3×6 DataFrame

Row	product	volume	price	laptop compatability	loading style	date checked
	String	Int64	Int64	Cat…	Cat…	Date
1	Goruck GR2	40	395	Y	front panel	2022-09-01
2	Minaal 3.0	35	349	Y	front panel	2022-09-01
3	Genius	25	228	Y	clamshell	2022-10-01

The call to the constructor is like that of a named tuple and again, var"..." is employed for names with spaces.

As called above, the constructor binds vectors of equal length into a new multi-variate container.

The describe function will summarize a data frame using different summaries for different types:

describe(d)

6×7 DataFrame

Row	variable	mean	min	median	max	nmissing	eltype
	Symbol	Union…	Any	Any	Any	Int64	DataType
1	product		Genius		Minaal 3.0	0	String
2	volume	33.3333	25	35.0	40	0	Int64
3	price	324.0	228	349.0	395	0	Int64
4	laptop compatability		Y		Y	0	CategoricalValue{String, UInt32}
5	loading style		clamshell		front panel	0	CategoricalValue{String, UInt32}
6	date checked		2022-09-01	2022-09-01	2022-10-01	0	Date

In the above construction, we repeated the names of the variables to the constructor. A style, also akin to the construction of named tuple, could also have been used where variables after a semicolon are implicitly assumed to have the desired name:

d = DataFrame(; product, volume, price,
              var"laptop compatability"=laptop_compatability,
              var"loading style"=loading_style, var"date checked"=date_checked)

3×6 DataFrame

Row	product	volume	price	laptop compatability	loading style	date checked
	String	Int64	Int64	Cat…	Cat…	Date
1	Goruck GR2	40	395	Y	front panel	2022-09-01
2	Minaal 3.0	35	349	Y	front panel	2022-09-01
3	Genius	25	228	Y	clamshell	2022-10-01

(In fact, with the rename! function, or other conveniences, it can be more direct to define the data frame, then rename the columns, avoiding the var above.)

We see that the DataFrame type displays values as we might expect with the column header showing the name of the variable and the storage type. The row numbers are added and may be used for reference to the data values.

There is a convention for attaching new columns to a data frame that allows another alternate construction:

d = DataFrame()   # empty data frame
d.product = product
d.volume = volume
d.price = price
d."laptop compatability" = laptop_compatability
d."loading style" = loading_style
d."date checked" = date_checked
d

3×6 DataFrame

Row	product	volume	price	laptop compatability	loading style	date checked
	String	Int64	Int64	Cat…	Cat…	Date
1	Goruck GR2	40	395	Y	front panel	2022-09-01
2	Minaal 3.0	35	349	Y	front panel	2022-09-01
3	Genius	25	228	Y	clamshell	2022-10-01

This uses one of a few different ways to refer to a column in a data frame.

As another alternative, the insertcols! function can insert multiple columns, for example the last one in the example, could have been written insertcols!(d, "date checked" => date_checked).

Before exploring how we might query the data frame, we note that an alternate means to describe the data set is by row, or case. For each row, we might use a named tuple with the names indicating the variable, and the values the measurement. The DataFrame constructor would accept this, here we shorten the names and only do one row (for now):

d = DataFrame([
 (b = "Goruck GR2", v = 40, p = 395, lap = "Y", load = "front panel", d = Date("2022-09-01"))
])

1×6 DataFrame

Row	b	v	p	lap	load	d
	String	Int64	Int64	String	String	Date
1	Goruck GR2	40	395	Y	front panel	2022-09-01

Here, the variable types are different, as it takes more effort to make one categorical. The same way we defined a column allows us to redefine that column (replacing the container, not re-using it for storage). For example, we could do:

d.lap = categorical(d.lap)
d.load = categorical(d.load)

1-element CategoricalArray{String,1,UInt32}:
 "front panel"

and then the two variables would be categorical.

There are other ways to define row by row:

A data frame with empty variables can be constructed and then values as tuples may be push!ed onto the data frame

push!(d,  ("Minaal 3.0", 35, 349, "Y", "front panel", Date("2022-09-01")))

2×6 DataFrame

Row	b	v	p	lap	load	d
	String	Int64	Int64	Cat…	Cat…	Date
1	Goruck GR2	40	395	Y	front panel	2022-09-01
2	Minaal 3.0	35	349	Y	front panel	2022-09-01

A named tuple can also be used.

Similarly, rows specified by dictionaries may be pushed onto a data frame

push!(d, Dict(:b => "Genius", :v => 25, :p => 228, :lap => "Y",
              :load => "clamshell", :d => Date("2022-10-01")))

3×6 DataFrame

Row	b	v	p	lap	load	d
	String	Int64	Int64	Cat…	Cat…	Date
1	Goruck GR2	40	395	Y	front panel	2022-09-01
2	Minaal 3.0	35	349	Y	front panel	2022-09-01
3	Genius	25	228	Y	clamshell	2022-10-01

(A dictionary is a key => value container like a named tuple, but keys may be arbitrary Julia objects – not always symbols – so we explicitly use symbols in the above command.)

The Tables interface

In the Julia ecosystem, the Tables package provides a lightweight means to describe tabular data, like a data frame, TableOperations provides some basic means to query the data. The different ways in which a data frame can be defined, reflect the different ways Tables objects can be created.

As tables are viewed as rows of homogeneous types, a vector of named tuples is a natural description, whereas tables are also naturally viewed as a collection of named columns, each column a vector. As such, both of these specify a table.

As a struct of arrays. This is used by calling DataFrame(a=[1,2], b= [2,1]), say.
As an array of structs. This is used by calling DataFrame([(a=1,b=2), (a=2,b=1)])

RDataset

Data frames are often read in from external sources, and not typed in. The RDataSets package has collected data sets from many different R packages and bundled them into data sets for Julia in the form of data frames. We will use some of these data sets in the examples. To load a data set we need to know its R package and R name:

using RDatasets
cars = dataset("MASS", "Cars93")  # package, data set name
first(cars, 2)  # first 2 rows only

2×27 DataFrame

Row	Manufacturer	Model	Type	MinPrice	Price	MaxPrice	MPGCity	MPGHighway	AirBags	DriveTrain	Cylinders	EngineSize	Horsepower	RPM	RevPerMile	ManTransAvail	FuelTankCapacity	Passengers	Length	Wheelbase	Width	TurnCircle	RearSeatRoom	LuggageRoom	Weight	Origin	Make
	Cat…	String	Cat…	Float64	Float64	Float64	Int32	Int32	Cat…	Cat…	Cat…	Float64	Int32	Int32	Int32	Cat…	Float64	Int32	Int32	Int32	Int32	Int32	Float64?	Int32?	Int32	Cat…	String
1	Acura	Integra	Small	12.9	15.9	18.8	25	31	None	Front	4	1.8	140	6300	2890	Yes	13.2	5	177	102	68	37	26.5	11	2705	non-USA	Acura Integra
2	Acura	Legend	Midsize	29.2	33.9	38.7	18	25	Driver & Passenger	Front	6	3.2	200	5500	2335	Yes	18.0	5	195	115	71	38	30.0	15	3560	non-USA	Acura Legend

CSV.read

For interacting with spreadsheet data, when it is stored like a data frame, it may be convenient to export the data in some format and then read that into a Julia session. A common format is to use comma-separated values, or “csv” data. The CSV package reads such files. The CSV package reads in delimited text data (the source) using the Tables interface and can be instructed to write to a data frame (the sink). The CSV.read file is used below. Here we write to some file to show that we can read it back:

using CSV
f = tempname() * ".csv"
CSV.write(f, d)  # write simple data frame to file `f`
d1 = CSV.read(f, DataFrame)

3×6 DataFrame

Row	b	v	p	lap	load	d
	String15	Int64	Int64	String1	String15	Date
1	Goruck GR2	40	395	Y	front panel	2022-09-01
2	Minaal 3.0	35	349	Y	front panel	2022-09-01
3	Genius	25	228	Y	clamshell	2022-10-01

Comma-separated-value files are simple text files, with no metadata to indicate what type a value may be.¹ Some guesswork is necessary. We observe that the two categorical variables were read back in as strings. As seen above, that can be addressed through column conversion.

The filename, may be more general. For example, it could be download(url) for some url that points to a csv file to read data from the internet.

Read and write

The methods read and write are qualified in the above usage with the CSV module. In the Julia ecosystem, the FileIO package provides a common framework for reading and writing files; it uses the verbs load and save. This can also be used with DataFrames, though it works through the CSVFiles package – and not CSV, as illustrated above. The read command would look like DataFrame(load(fname)) and the write command like save(fname, df). Here fname would have a “.csv” extension so that the type of file could be detected.

Basic usage to read/write `.csv` file into a data frame.
Command	Description
`CSV.read(file_name, DataFrame)`	Read csv file from file with given name
`CSV.read(IOBuffer(string), DataFrame)`	Read csv file from a string using an `IOBuffer`
`CSV.read(download(url), DataFrame)`	Read csv file an url
`open(file_name,"w") do io; CSV.write(io, dd); end`	Write data frame `df` to csv file
`DataFrame(load(file_name))`	Read csv file from file with given name using `CSVFiles`
`save(file_name, df)`	Write data frame `df` to a csv file using `CSVFiles`

TableScraper

The TableScraper package can scrape tables from an HTML file in a manner that can be passed to DataFrame for conversion to a data frame. The command DataFrame.(scrape_tables(url)) may just work to create the tables, but in practice, only well-formed tables can be scraped successfully; tables used for formatting purposes may fail. TableScraper is built on Cascadia and Gumbo, as is the Harbest package, which provides more access to the underlying data than TableScraper.

3.1.2 Column names

The variable names of a data set are returned by the names function, as strings:

names(d)

6-element Vector{String}:
 "b"
 "v"
 "p"
 "lap"
 "load"
 "d"

This function has a second argument which can be used to narrow or filter the names that are returned. For example, the following is a convenience for matching columns whose element type is a subtype of Real:

names(d, Real)

2-element Vector{String}:
 "v"
 "p"

The rename! function allows the names to be changed in-place (without returning a new data frame). There are many different calling patterns, but:

to replace one name, the form rename!(df, :v => :nv) is used, where :v is the old name, and :nv the new one. The form follows the generic replace! function. The pair notation can be broadcast (.=>) to replace more than one name, as with replace!(df, [:v1, :v2] .=> [:nv1, :nv2]).
to replace allnames, the form rename!(df, [...]) is used, where the vector is of length matching the number of columns in df.

3.1.3 Indexing and assignment

The values in a data frame can be referenced programatically by a row number and column number, both 1-based. For example, the 2nd row and 3rd column of d can be seen to be 349 by observation

3×6 DataFrame

Row	b	v	p	lap	load	d
	String	Int64	Int64	Cat…	Cat…	Date
1	Goruck GR2	40	395	Y	front panel	2022-09-01
2	Minaal 3.0	35	349	Y	front panel	2022-09-01
3	Genius	25	228	Y	clamshell	2022-10-01

but it can be referenced through indexing (row number first, column number second):

d[2, 3]

A new value can also be assigned with the same indexing reference. Suppose, that bag went on sale. This new price could be adjusted, as follows:

d[2, 3] = 309 # a sale price

Column selectors

The 3rd column has a name, p in this case. We can use a column selector to avoid having to know which column number a variable is. For this a string or a symbol may be used, as in:

d[2, :p], d[2, "p"]

(309, 309)

More generally, it is possible to use more than 1 column in column selection. For example, to get both the volume and price, for the second bag we have:

d[2, [:v, :p]]

DataFrameRow (2 columns)

Row	v	p
	Int64	Int64
2	35	309

As well, either [2, 3] or ["v", "p"] could have been used.

For data frames with many columns, a regular expression can be used. The column selector r"^l" would match all columns whose name begins with l (lap and load) in this case. (Regular expressions can be very complicated, but here we only assume that r"name" will match “name” somewhere in the string; r"^name" and r"name$" will match “name” at the beginning and ending of a string. Using a regular expression will return a data frame row (when a row index is specified) – not a value – as it is possible to return 0, 1 or more columns in the selection.

Column selectors, as seen may be column number(s), column names as strings, column names as symbols, and regular expressions.

In addition:

Boolean vectors of a length matching the number of columns (d[2, [false, true, true, false, false, false]] would return the 2nd and 3rd columns.)
The value All() which selects all columns
the value Between(a, b) where, a and b are column selectors (such as Between(:v, 3)).
Not(...) to select those columns not specified in ...
Cols(a,b,...) which combines different selectors, such as Cols(:v, 3).

Slices

The All() column selector has a shortcut inherited from the indexing of arrays in base Julia, this being a colon, :. When used in the column position it refers to all the values in the row:

d[2, :]

DataFrameRow (6 columns)

Row	b	v	p	lap	load	d
	String	Int64	Int64	Cat…	Cat…	Date
2	Minaal 3.0	35	309	Y	front panel	2022-09-01

When used in the row position, it refers to all rows:

d[:, 3]

3-element Vector{Int64}:
 395
 309
 228

In the above use, a vector is returned, not a data frame, as only one column was selected. But which vector? Is it a copy of the one that d holds, or is it the same container? In base Julia the answer depends on what side of an equals sign the use is on (meaning, if used in assignment the answer is different, as mentioned later). In this usage, a copy is returned.

To get the underlying column (not a copy), the selector ! is used instead:

d[!, 3]

3-element Vector{Int64}:
 395
 309
 228

The notation d.p also refers to the 3rd variable. In this case, d.p is a view of the underlying data (using !) and not a copy.

The notation d[:,:] will refer to all rows and all columns of d, and in this case will be a copy of the data frame.

Assignment

Single values can be assigned, as previously illustrated. These values are assigned to the underlying vector in the data frame, so as with vectors, the assigned value will be promoted to the column type, if necessary. Assigning a value that can’t be promoted will throw and error, as it would were such an assignment tried for a vector. To do such an assignment, the underlying vector must be changed to a wider type.

The row selector : or ! had different semantics when accessing the underlying data, the first returning a copy, the second a view. When used with assignment though, : refers to the underlying vector or does in-place assignment; whereas ! (and the dot access) will replace the the underlying container with the assigned container.

To illustrate, suppose the prices are rounded down, but actually contain an extra 99 cents. We can reassign price as following:

d[!, :p] = price .+ 0.99

3-element Vector{Float64}:
 395.99
 349.99
 228.99

as even though price .+ 0.99 is no longer an integer vector, the ! row indicator will instruct d to point to this new vector. Had d[:, :p] = ... been used above, an error would have been thrown as the underlying container is integer, but the new prices require floating point values to represent them.

Broadcast assignment (using .=) will be similar. The right hand side is expanded, and then assigned. Attempts to assign the wrong value type using : to indicate the rows will error.

Missing values

Trying to assign a value of missing to a data frame requires the underlying vector to allow missing values. As with vectors, the allowmissing function is useful. For example, to give the 2nd bag a price of missing, we could do:

allowmissing!(d, :p)
d[2, :p] = missing

missing

By using the mutating form (with the trailing !), the data frame d is mutated in the allowmissing! call. This could be observed by looking a the underlying column type.

The new type is a union of the old type and Missing, so we could reassign the old value:

d[2, :p] = 349.99
d

3×6 DataFrame

Row	b	v	p	lap	load	d
	String	Int64	Float64?	Cat…	Cat…	Date
1	Goruck GR2	40	395.99	Y	front panel	2022-09-01
2	Minaal 3.0	35	349.99	Y	front panel	2022-09-01
3	Genius	25	228.99	Y	clamshell	2022-10-01

Tables.jl interface

A table, like a data frame can be expected to loop, or iterate, over columns, say to find which columns are numeric. Or, they can be expected to iterate over rows, say to find which values are within a certain range. What should be the case with a data frame? As there are reasons to prefer either, none is specified. Rather, there is a difference based on the calling function. Data frames respect the Tables interface, which has methods to iterate over rows or columns. When iterating over columns, the individual vectors are given; when iterating over rows, a DataFrameRow object is returned. Calling copy on these rows objects will return a named tuple.

The basic iterators are eachrow and eachcol. For example, to see the column type for a data frame, we can broadcast eltype over the variables returned by eachcol:

eltype.(eachcol(d))  |> permutedims

1×6 Matrix{Type}:
 String  Int64  Union{Missing, Float64}  …  Date

To get the data frame as a vector of named tuples, then broadcasting NamedTuple over eachrow can be used: NamedTuple.(eachrow(d)); Base.copy is an alias. Tables.rowtable(d) will produce this vector of named tuples, as well.

3.1.4 Sorting

A data frame can be sorted by specifying a permutation of the indices, such as is returned by sortperm. For example, d[sortperm(d.price), :] will sort by price in increasing order. The keyword argument rev can be passed true to reverse the default sort order.

The generic sort function has a method for data frames, that offers a more convenient functionality. The sort! method does in-place sorting. The simplest call is to pass in a column to sort by, as in

sort(d, :p)

3×6 DataFrame

Row	b	v	p	lap	load	d
	String	Int64	Float64?	Cat…	Cat…	Date
1	Genius	25	228.99	Y	clamshell	2022-10-01
2	Minaal 3.0	35	349.99	Y	front panel	2022-09-01
3	Goruck GR2	40	395.99	Y	front panel	2022-09-01

This sorts by the price (:p) variable in increasing order. To sort by decreasing order a keyword argument rev=true can be specified:

sort(d, "v"; rev=true)

3×6 DataFrame

Row	b	v	p	lap	load	d
	String	Int64	Float64?	Cat…	Cat…	Date
1	Goruck GR2	40	395.99	Y	front panel	2022-09-01
2	Minaal 3.0	35	349.99	Y	front panel	2022-09-01
3	Genius	25	228.99	Y	clamshell	2022-10-01

There can be one or more column selectors. For this, the order function can be used to reverse just one of the columns, as in the following, which first sorts the dates in reverse order (newest first) and then within a data, uses lowest price first to sort:

sort(d, [order("d", rev=true), :p])

3×6 DataFrame

Row	b	v	p	lap	load	d
	String	Int64	Float64?	Cat…	Cat…	Date
1	Genius	25	228.99	Y	clamshell	2022-10-01
2	Minaal 3.0	35	349.99	Y	front panel	2022-09-01
3	Goruck GR2	40	395.99	Y	front panel	2022-09-01

3.1.5 Filtering rows

Filtering, or subsetting, is a means to select a specified selection of rows.

Let’s consider the cars data set, previously defined by loading dataset("MASS", "Cars93"). This data set consists of 27 measurements on 93 cars from the 1990s. The data set comes from R’s MASS package, and is documented there.

Two ready ways to filter are by using specific indices, or by using Boolean indices generated by a logical comparison.

For the cars data set, the latter can be used to extract the Volkswagen models. To generate the indices we compare the :Manufacturer variable with the string Volkswagen to return a vector of length 93 that indicates if the manufacturer is VW. This is done with broadcasting: cars.Manufacturer .== "Volkswagen". We use the easy dot access to variables in cars, alternatively cars[:, :Manufacturer] (or, more efficiently, !, as is used by the dot access) could be used were :Manufacturer coming from some variable. With this row selector, we have:

cars[cars.Manufacturer .== "Volkswagen", :]

4×27 DataFrame

Row	Manufacturer	Model	Type	MinPrice	Price	MaxPrice	MPGCity	MPGHighway	AirBags	DriveTrain	Cylinders	EngineSize	Horsepower	RPM	RevPerMile	ManTransAvail	FuelTankCapacity	Passengers	Length	Wheelbase	Width	TurnCircle	RearSeatRoom	LuggageRoom	Weight	Origin	Make
	Cat…	String	Cat…	Float64	Float64	Float64	Int32	Int32	Cat…	Cat…	Cat…	Float64	Int32	Int32	Int32	Cat…	Float64	Int32	Int32	Int32	Int32	Int32	Float64?	Int32?	Int32	Cat…	String
1	Volkswagen	Fox	Small	8.7	9.1	9.5	25	33	None	Front	4	1.8	81	5500	2550	Yes	12.4	4	163	93	63	34	26.0	10	2240	non-USA	Volkswagen Fox
2	Volkswagen	Eurovan	Van	16.6	19.7	22.7	17	21	None	Front	5	2.5	109	4500	2915	Yes	21.1	7	187	115	72	38	34.0	missing	3960	non-USA	Volkswagen Eurovan
3	Volkswagen	Passat	Compact	17.6	20.0	22.4	21	30	None	Front	4	2.0	134	5800	2685	Yes	18.5	5	180	103	67	35	31.5	14	2985	non-USA	Volkswagen Passat
4	Volkswagen	Corrado	Sporty	22.9	23.3	23.7	18	25	None	Front	6	2.8	178	5800	2385	Yes	18.5	4	159	97	66	36	26.0	15	2810	non-USA	Volkswagen Corrado

This approach lends itself to the description “find all rows matching some value” then “extract the identified rows,” – written as two steps to emphasize there are two passes through the data. Another mental model would be loop over the rows, and keep those that match the query. This is done generically by the filter function for collections in Julia or by the subset function of DataFrames.

The filter(predicate, collection) function is used to identify just the values in the collection for which the predicate function returns true. When a data frame is used with filter, the iteration is over the rows, so the wrapping eachrow iterator is not needed. We need a predicate function to replace the .== above. One follows. It doesn’t need .==, as r is a data frame row and access produces a value not a vector:

filter(r -> r.Manufacturer == "Volkswagen", cars)

4×27 DataFrame

Row	Manufacturer	Model	Type	MinPrice	Price	MaxPrice	MPGCity	MPGHighway	AirBags	DriveTrain	Cylinders	EngineSize	Horsepower	RPM	RevPerMile	ManTransAvail	FuelTankCapacity	Passengers	Length	Wheelbase	Width	TurnCircle	RearSeatRoom	LuggageRoom	Weight	Origin	Make
	Cat…	String	Cat…	Float64	Float64	Float64	Int32	Int32	Cat…	Cat…	Cat…	Float64	Int32	Int32	Int32	Cat…	Float64	Int32	Int32	Int32	Int32	Int32	Float64?	Int32?	Int32	Cat…	String
1	Volkswagen	Fox	Small	8.7	9.1	9.5	25	33	None	Front	4	1.8	81	5500	2550	Yes	12.4	4	163	93	63	34	26.0	10	2240	non-USA	Volkswagen Fox
2	Volkswagen	Eurovan	Van	16.6	19.7	22.7	17	21	None	Front	5	2.5	109	4500	2915	Yes	21.1	7	187	115	72	38	34.0	missing	3960	non-USA	Volkswagen Eurovan
3	Volkswagen	Passat	Compact	17.6	20.0	22.4	21	30	None	Front	4	2.0	134	5800	2685	Yes	18.5	5	180	103	67	35	31.5	14	2985	non-USA	Volkswagen Passat
4	Volkswagen	Corrado	Sporty	22.9	23.3	23.7	18	25	None	Front	6	2.8	178	5800	2385	Yes	18.5	4	159	97	66	36	26.0	15	2810	non-USA	Volkswagen Corrado

There are some conveniences that have been developed to make the predicate functions even easier to write, which will be introduced incrementally:

The basic binary comparisons, like ==, are defined by functions essentially of the form (a,b) -> ==(a,b). There are partially applied versions, essentially a -> ==(a,b), with b applied, formed by ==(b). So the syntax ==("Volkswagen") is a predicate taking a single argument which is then compared for equality to the string "Volkswagen".

However, r is a data frame row, what we need to pass to ==("Volkswagen") is the value of the :Manufacturer column. For this another convenience is available.

The pair syntax, => of the form column selector(s) => predicate function will pass just the value(s) in the column(s) to the predicate function. Combined, the above can be simplified to:

filter(:Manufacturer => ==("Volkswagen"), cars)

4×27 DataFrame

Row	Manufacturer	Model	Type	MinPrice	Price	MaxPrice	MPGCity	MPGHighway	AirBags	DriveTrain	Cylinders	EngineSize	Horsepower	RPM	RevPerMile	ManTransAvail	FuelTankCapacity	Passengers	Length	Wheelbase	Width	TurnCircle	RearSeatRoom	LuggageRoom	Weight	Origin	Make
	Cat…	String	Cat…	Float64	Float64	Float64	Int32	Int32	Cat…	Cat…	Cat…	Float64	Int32	Int32	Int32	Cat…	Float64	Int32	Int32	Int32	Int32	Int32	Float64?	Int32?	Int32	Cat…	String
1	Volkswagen	Fox	Small	8.7	9.1	9.5	25	33	None	Front	4	1.8	81	5500	2550	Yes	12.4	4	163	93	63	34	26.0	10	2240	non-USA	Volkswagen Fox
2	Volkswagen	Eurovan	Van	16.6	19.7	22.7	17	21	None	Front	5	2.5	109	4500	2915	Yes	21.1	7	187	115	72	38	34.0	missing	3960	non-USA	Volkswagen Eurovan
3	Volkswagen	Passat	Compact	17.6	20.0	22.4	21	30	None	Front	4	2.0	134	5800	2685	Yes	18.5	5	180	103	67	35	31.5	14	2985	non-USA	Volkswagen Passat
4	Volkswagen	Corrado	Sporty	22.9	23.3	23.7	18	25	None	Front	6	2.8	178	5800	2385	Yes	18.5	4	159	97	66	36	26.0	15	2810	non-USA	Volkswagen Corrado

It doesn’t save much for this simple example, but this specification style is the basis of the DataFrames mini language, which provides a standardized and performant means of wrangling a data frame.

Filtering may be done on one or more values. There are different approaches.

For example, to first filter by the manufacturer, then by city mileage, we might write a more complicated predicate function using && to combine Boolean values:

pred(r) = r.Manufacturer == "Volkswagen" && r.MPGCity >= 20
filter(pred, cars)

2×27 DataFrame

Row	Manufacturer	Model	Type	MinPrice	Price	MaxPrice	MPGCity	MPGHighway	AirBags	DriveTrain	Cylinders	EngineSize	Horsepower	RPM	RevPerMile	ManTransAvail	FuelTankCapacity	Passengers	Length	Wheelbase	Width	TurnCircle	RearSeatRoom	LuggageRoom	Weight	Origin	Make
	Cat…	String	Cat…	Float64	Float64	Float64	Int32	Int32	Cat…	Cat…	Cat…	Float64	Int32	Int32	Int32	Cat…	Float64	Int32	Int32	Int32	Int32	Int32	Float64?	Int32?	Int32	Cat…	String
1	Volkswagen	Fox	Small	8.7	9.1	9.5	25	33	None	Front	4	1.8	81	5500	2550	Yes	12.4	4	163	93	63	34	26.0	10	2240	non-USA	Volkswagen Fox
2	Volkswagen	Passat	Compact	17.6	20.0	22.4	21	30	None	Front	4	2.0	134	5800	2685	Yes	18.5	5	180	103	67	35	31.5	14	2985	non-USA	Volkswagen Passat

The mini language can also be used. On the left hand side of (the first) => a vector of column selectors may be indicated, in which the predicate function (on the right hand side of the first =>) will get multiple arguments that can be used to implement the logic:

pred(m, mpg) = m == "Volkswagen" && mpg >= 20
filter([:Manufacturer, :MPGCity] => pred, cars)

2×27 DataFrame

Row	Manufacturer	Model	Type	MinPrice	Price	MaxPrice	MPGCity	MPGHighway	AirBags	DriveTrain	Cylinders	EngineSize	Horsepower	RPM	RevPerMile	ManTransAvail	FuelTankCapacity	Passengers	Length	Wheelbase	Width	TurnCircle	RearSeatRoom	LuggageRoom	Weight	Origin	Make
	Cat…	String	Cat…	Float64	Float64	Float64	Int32	Int32	Cat…	Cat…	Cat…	Float64	Int32	Int32	Int32	Cat…	Float64	Int32	Int32	Int32	Int32	Int32	Float64?	Int32?	Int32	Cat…	String
1	Volkswagen	Fox	Small	8.7	9.1	9.5	25	33	None	Front	4	1.8	81	5500	2550	Yes	12.4	4	163	93	63	34	26.0	10	2240	non-USA	Volkswagen Fox
2	Volkswagen	Passat	Compact	17.6	20.0	22.4	21	30	None	Front	4	2.0	134	5800	2685	Yes	18.5	5	180	103	67	35	31.5	14	2985	non-USA	Volkswagen Passat

Finally, we could use filter twice:

cars1 = filter(:Manufacturer => ==("Volkswagen"), cars)
cars2 = filter(:MPGCity => >=(20), cars1)

2×27 DataFrame

Row	Manufacturer	Model	Type	MinPrice	Price	MaxPrice	MPGCity	MPGHighway	AirBags	DriveTrain	Cylinders	EngineSize	Horsepower	RPM	RevPerMile	ManTransAvail	FuelTankCapacity	Passengers	Length	Wheelbase	Width	TurnCircle	RearSeatRoom	LuggageRoom	Weight	Origin	Make
	Cat…	String	Cat…	Float64	Float64	Float64	Int32	Int32	Cat…	Cat…	Cat…	Float64	Int32	Int32	Int32	Cat…	Float64	Int32	Int32	Int32	Int32	Int32	Float64?	Int32?	Int32	Cat…	String
1	Volkswagen	Fox	Small	8.7	9.1	9.5	25	33	None	Front	4	1.8	81	5500	2550	Yes	12.4	4	163	93	63	34	26.0	10	2240	non-USA	Volkswagen Fox
2	Volkswagen	Passat	Compact	17.6	20.0	22.4	21	30	None	Front	4	2.0	134	5800	2685	Yes	18.5	5	180	103	67	35	31.5	14	2985	non-USA	Volkswagen Passat

The above required the introduction of an intermediate data frame to store the result of the first filter call to pass to the second. This threading through of the modified data is quite common in processing pipelines. The first two approaches with complicated predicate functions can grow unwieldly, so staged modification is common. To support that, the chaining or piping operation (|>) is often used:

filter(:Manufacturer => ==("Volkswagen"), cars) |>
    d -> filter(:MPGCity => >=(20), d)

2×27 DataFrame

Row	Manufacturer	Model	Type	MinPrice	Price	MaxPrice	MPGCity	MPGHighway	AirBags	DriveTrain	Cylinders	EngineSize	Horsepower	RPM	RevPerMile	ManTransAvail	FuelTankCapacity	Passengers	Length	Wheelbase	Width	TurnCircle	RearSeatRoom	LuggageRoom	Weight	Origin	Make
	Cat…	String	Cat…	Float64	Float64	Float64	Int32	Int32	Cat…	Cat…	Cat…	Float64	Int32	Int32	Int32	Cat…	Float64	Int32	Int32	Int32	Int32	Int32	Float64?	Int32?	Int32	Cat…	String
1	Volkswagen	Fox	Small	8.7	9.1	9.5	25	33	None	Front	4	1.8	81	5500	2550	Yes	12.4	4	163	93	63	34	26.0	10	2240	non-USA	Volkswagen Fox
2	Volkswagen	Passat	Compact	17.6	20.0	22.4	21	30	None	Front	4	2.0	134	5800	2685	Yes	18.5	5	180	103	67	35	31.5	14	2985	non-USA	Volkswagen Passat

The right-hand side of |> expects a function for application, so an anonymous function is created. There are macros available in user-contributed packages that shorten this pattern by using a placeholder.

A popular one is, @chain, in the Chain package, with its basic usage based on two simple principles:

a new line is implicitly a pipe
a _ receives the argument from the pipe. If none is given, the first argument does

For example,

using Chain

@chain cars begin
    filter(:Manufacturer => ==("Volkswagen"), _)
    filter(:MPGCity => >=(20), _)
end

2×27 DataFrame

Row	Manufacturer	Model	Type	MinPrice	Price	MaxPrice	MPGCity	MPGHighway	AirBags	DriveTrain	Cylinders	EngineSize	Horsepower	RPM	RevPerMile	ManTransAvail	FuelTankCapacity	Passengers	Length	Wheelbase	Width	TurnCircle	RearSeatRoom	LuggageRoom	Weight	Origin	Make
	Cat…	String	Cat…	Float64	Float64	Float64	Int32	Int32	Cat…	Cat…	Cat…	Float64	Int32	Int32	Int32	Cat…	Float64	Int32	Int32	Int32	Int32	Int32	Float64?	Int32?	Int32	Cat…	String
1	Volkswagen	Fox	Small	8.7	9.1	9.5	25	33	None	Front	4	1.8	81	5500	2550	Yes	12.4	4	163	93	63	34	26.0	10	2240	non-USA	Volkswagen Fox
2	Volkswagen	Passat	Compact	17.6	20.0	22.4	21	30	None	Front	4	2.0	134	5800	2685	Yes	18.5	5	180	103	67	35	31.5	14	2985	non-USA	Volkswagen Passat

Subset

The calling style filter(predicate, collection) is consistently used with several higher-order functions, for example, map and reduce and is supported by Julia’s do syntax.

However, with data frames, the convention is to have action functions expect the data frame in the first position.

The subset function also returns a filtered copy of the data frame with rows matching a certain selection criteria. It’s usage is somewhat similar to filter with the order of its arguments reversed, but the predicate must return a vector of indices, so broadcasting or the ByRow function may be necessary.

For example, these produce the same data frame:

d1 = subset(cars, :Manufacturer => m -> m .== "Volkswagen") # use .== not just ==
d2 = subset(cars, :Manufacturer => ByRow(==("Volkswagen"))) # use `ByRow` to ensure predicate is applied to each

4×27 DataFrame

Row	Manufacturer	Model	Type	MinPrice	Price	MaxPrice	MPGCity	MPGHighway	AirBags	DriveTrain	Cylinders	EngineSize	Horsepower	RPM	RevPerMile	ManTransAvail	FuelTankCapacity	Passengers	Length	Wheelbase	Width	TurnCircle	RearSeatRoom	LuggageRoom	Weight	Origin	Make
	Cat…	String	Cat…	Float64	Float64	Float64	Int32	Int32	Cat…	Cat…	Cat…	Float64	Int32	Int32	Int32	Cat…	Float64	Int32	Int32	Int32	Int32	Int32	Float64?	Int32?	Int32	Cat…	String
1	Volkswagen	Fox	Small	8.7	9.1	9.5	25	33	None	Front	4	1.8	81	5500	2550	Yes	12.4	4	163	93	63	34	26.0	10	2240	non-USA	Volkswagen Fox
2	Volkswagen	Eurovan	Van	16.6	19.7	22.7	17	21	None	Front	5	2.5	109	4500	2915	Yes	21.1	7	187	115	72	38	34.0	missing	3960	non-USA	Volkswagen Eurovan
3	Volkswagen	Passat	Compact	17.6	20.0	22.4	21	30	None	Front	4	2.0	134	5800	2685	Yes	18.5	5	180	103	67	35	31.5	14	2985	non-USA	Volkswagen Passat
4	Volkswagen	Corrado	Sporty	22.9	23.3	23.7	18	25	None	Front	6	2.8	178	5800	2385	Yes	18.5	4	159	97	66	36	26.0	15	2810	non-USA	Volkswagen Corrado

Unlike filter, subset allows multiple arguments to be applied:

subset(cars,
       :Manufacturer => ByRow(==("Volkswagen")),
       :MPGCity      => ByRow(>=(20)))

2×27 DataFrame

Row	Manufacturer	Model	Type	MinPrice	Price	MaxPrice	MPGCity	MPGHighway	AirBags	DriveTrain	Cylinders	EngineSize	Horsepower	RPM	RevPerMile	ManTransAvail	FuelTankCapacity	Passengers	Length	Wheelbase	Width	TurnCircle	RearSeatRoom	LuggageRoom	Weight	Origin	Make
	Cat…	String	Cat…	Float64	Float64	Float64	Int32	Int32	Cat…	Cat…	Cat…	Float64	Int32	Int32	Int32	Cat…	Float64	Int32	Int32	Int32	Int32	Int32	Float64?	Int32?	Int32	Cat…	String
1	Volkswagen	Fox	Small	8.7	9.1	9.5	25	33	None	Front	4	1.8	81	5500	2550	Yes	12.4	4	163	93	63	34	26.0	10	2240	non-USA	Volkswagen Fox
2	Volkswagen	Passat	Compact	17.6	20.0	22.4	21	30	None	Front	4	2.0	134	5800	2685	Yes	18.5	5	180	103	67	35	31.5	14	2985	non-USA	Volkswagen Passat

The mini language uses a default of identity for the function so, if the columns are Boolean, they can be used to subset the data:

dd = DataFrame(a = [true, false], b = [true, missing])
subset(dd, :a)  # just first row

1×2 DataFrame

Row	a	b
	Bool	Bool?
1	true	true

For data with missing values, like in the :b column of dd, the comparison operators implement 3-valued logic, meaning the response can be true, false, or missing. Using subset(dd, :b) above will error. The keyword skipmissing can be passed true to have these skipped over. The default is false.

subset(dd, :b; skipmissing=true)

1×2 DataFrame

Row	a	b
	Bool	Bool?
1	true	true

The dropmissing function filters out each row that has a missing value.

3.1.6 Selecting columns; transforming data

Filtering of a data frame can be viewed as accessing the data in df with a boolean set of row indices, as in df[inds, :]. (There are flags to have a “view” of the data, as in df[inds, !]). Filtering returns a data frame with a possibly reduced number of rows, and the same number of columns.

The select function is related in that it can be viewed as passing a Boolean vector of column indices to select specific columns or variables, as in df[:, inds]. The syntax of select also extends this allowing transformations of the data and reassignment, as could be achieved through this pattern: df.new_variable = f(other_variables_in_df).

The select function will return a data frame with the same number of rows (cases) as the original data frame. The variables returned are selected using column selectors. There is support for the DataFrames mini language.

First, we can select one or more columns by separating column selectors with commas. In the following, using the backpack data in the variable d, we select the first column by column number, the second using a symbol, the third by a string, and the fourth and fifth using a regular expression:²

select(d, 1, :v, "p", r"^l")

3×5 DataFrame

Row	b	v	p	lap	load
	String	Int64	Float64?	Cat…	Cat…
1	Goruck GR2	40	395.99	Y	front panel
2	Minaal 3.0	35	349.99	Y	front panel
3	Genius	25	228.99	Y	clamshell

One use of select is to rearrange the column order, say by moving a column to the front. The : or All() column selector will add the rest. That is this command would move “d” to the front of the other columns: select(d, :d, All()).

Selection also allows an easy renaming of column names, using the pairs notation column name => new column name:

select(d, :v => :volume, "p" => "price")

3×2 DataFrame

Row	volume	price
	Int64	Float64?
1	40	395.99
2	35	349.99
3	25	228.99

This usage is a supported abbreviation of source_column => function => target_column_name, where the function is identity. Here the function is not necessarily a predicate function, as it is with subset, but rather a function which returns a vector of the same length as the number of columns in the data frame (as select always returns the same number of columns as the data frame it is passed.)

The function receives column vectors and should return a column vector. For a scalar function, it must be applied to each entry of the column vector, that is, each row. Broadcasting is one manner to do this. As used previously, the DataFrames package provides ByRow to also apply a function to each row of the column vector(s).

For example to compute a ratio of price to volume, we might have:

select(d, [:p, :v] => ByRow((p,v) -> p/v) => "price to volume")

3×1 DataFrame

Row	price to volume
	Float64
1	9.89975
2	9.99971
3	9.1596

The above would be equivalent to select(d, [:p, :v] => ((p,v) -> p./v) => "price to volume") (broadcasted p ./ v and parentheses), but not select(d, [:p, :v] => (p,v) -> p./v => "price to volume"), as the precedence rules for => do not parse the expressions identically. That is, the use of parentheses around an anonymous function is needed.

The specification of a target column name is not necessary, as one will be generated. In this particular example, it would be :p_v_function. For named functions, the automatic namings are a bit more informative, as the method name replaces the “function” part. The renamecols keyword argument can be set to false to drop the addition of a function name, which in this case would leave :p_v as the name. For a single-column selector it would not rename the column, rather replace it.

The above reduced the number of columns to just that specified. The : selector will select all columns (as it does with indexing) and return them:

select(d, [:p, :v] => ByRow((p,v) -> p/v) => "price to volume", :) # return all columns

3×7 DataFrame

Row	price to volume	b	v	p	lap	load	d
	Float64	String	Int64	Float64?	Cat…	Cat…	Date
1	9.89975	Goruck GR2	40	395.99	Y	front panel	2022-09-01
2	9.99971	Minaal 3.0	35	349.99	Y	front panel	2022-09-01
3	9.1596	Genius	25	228.99	Y	clamshell	2022-10-01

As seen in this last example, the source column selectors can select more than one column. The function receives multiple arguments. It is possible that there be multiple target columns. For example, we could compute both price-to-volume and volume-to-price quantities. A function to do so would be

pv_and_vp(p,v) = [p/v, v/p]

pv_and_vp (generic function with 1 method)

When we use this, we see the two values are stored in one column.

select(d, [:p, :v] => ByRow(pv_and_vp))

3×1 DataFrame

Row	p_v_pv_and_vp
	Array…
1	[9.89975, 0.101013]
2	[9.99971, 0.100003]
3	[9.1596, 0.109175]

To split these up, we can give two names:

select(d, [:p, :v] => ByRow(pv_and_vp) => [:pv, :vp])

3×2 DataFrame

Row	pv	vp
	Float64	Float64
1	9.89975	0.101013
2	9.99971	0.100003
3	9.1596	0.109175

`AsTable`

The AsTable function has two meanings in the mini language. If on the target side, it assumes the output of the function is a vector of containers (in this example the vectors [p/v, v/p]) that should be expanded into multiple columns. It uses the keys (were the containers named tuples) or, in this case generates them, to create multiple columns in the destination:

select(d, [:p, :v] => ByRow(pv_and_vp) => AsTable)

3×2 DataFrame

Row	x1	x2
	Float64	Float64
1	9.89975	0.101013
2	9.99971	0.100003
3	9.1596	0.109175

Had we used a named tuple in our function, pv_and_vp(p, v) = (pv = p/v, vp = v/p), then the column names would come from the tuple’s names and need not be generated.

When AsTable is used on the source columns, as in AsTable([:p,:v]) then the function receives a named tuple with column vector entries. (For this, pv_and_vp would be written pv_and_vp(r) = [r.p/r.v, r.v/r.p], with r representing a row in a two-column table with names :p and :v.

Transform

Extending the columns in the data frame by select is common enough that the function transform is supplied which always keeps the columns of the original data frame, though they can also be modified through the mini language. The use of transfrom is equivalent to select(df, :, args...).

The DataFrames’ mini language

The mini language is well documented in the documentation string of transform and the blog post of Bogumil Kasminski “DataFrames.jl minilanguage explained.”

3.1.7 Combine

The combine function creates a new data frame whose columns are the result of transformations applied to the source data frame. The name will be more clear after we discuss grouping or splitting of a data set.

A typical usage of combine is to apply some reduction to the data, for example finding an average.

In this example, we use the mean function from the StatsBase package and apply that to the :MPGCity measurements in the cars data:

combine(cars, :MPGCity => mean)

1×1 DataFrame

Row	MPGCity_mean
	Float64
1	22.3656

(A second pair can be used to specify a name for the target, as in :MPGCity => mean => :AvgMPGCity)

There are several numeric variables in this data set, we may wish to find the mean of more than 1 at a time. For this, we can broadcast the columns via .=>, as in this example:

combine(cars, [:MPGCity, :MPGHighway] .=> mean)

1×2 DataFrame

Row	MPGCity_mean	MPGHighway_mean
	Float64	Float64
1	22.3656	29.086

(That is an alternate to combine(cars, :MPGCity => mean, :MPGHighway => mean).)

Broadcasting => lends itself to succinctly applying a function to multiple columns.

For example, to apply mean to all the numeric values, we might select them first (using the filtering feature of names), then pass to the mini language, as follows:

nms = names(cars, Real)
combine(cars, nms .=> mean; renamecols=false)

1×16 DataFrame

Row	MinPrice	Price	MaxPrice	MPGCity	MPGHighway	EngineSize	Horsepower	RPM	RevPerMile	FuelTankCapacity	Passengers	Length	Wheelbase	Width	TurnCircle	Weight
	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64
1	17.1258	19.5097	21.8989	22.3656	29.086	2.66774	143.828	5280.65	2332.2	16.6645	5.08602	183.204	103.946	69.3763	38.957	3072.9

More than one transformation at a time is possible, as already seen in the mini language. With combine the number of rows is determined by the output of the transformation. In this example, the mean reduces to a number, but unique reduces :Origin to the number of levels. The result of mean is repeated, it is not the mean for each group (a task we demonstrate shortly):

combine(cars, :MPGCity => mean, :Origin => unique)

2×2 DataFrame

Row	MPGCity_mean	Origin_unique
	Float64	String
1	22.3656	non-USA
2	22.3656	USA

3.1.8 Flatten

As mentioned, the repeat function is used to repeat values in a vector, or other iterable. The simplest usage looks like this:

repeat([1,2,3], 2) |> show

[1, 2, 3, 1, 2, 3]

A common use of repetition is to take two variables, say x and y, perhaps of unequal length and create a data frame with two columns: one to store the values and one to indicate if a value came from x or y. This could be achieved like

x = [1,2,3]
y = [4, 5]

DataFrame(variable=vcat(repeat([:x], length(x)), repeat([:y], length(y))),
          values = vcat(x,y))

5×2 DataFrame

Row	variable	values
	Symbol	Int64
1	x	1
2	x	2
3	x	3
4	y	4
5	y	5

The flatten function offers an alternative. Consider the data frame:

x = [1,2,3]
y = [4, 5]
dxy = DataFrame(variable=[:x,:y], value=[x, y])

2×2 DataFrame

Row	variable	value
	Symbol	Array…
1	x	[1, 2, 3]
2	y	[4, 5]

The flatten function will iterate over the indicated columns and repeat the other variables:

flatten(dxy, :value)

5×2 DataFrame

Row	variable	value
	Symbol	Int64
1	x	1
2	x	2
3	x	3
4	y	4
5	y	5

3.1.9 SplitApplyCombine

The split-apply-combine strategy for wrangling data frames was popularized in the influential paper “The Split-Apply-Combine Strategy for Data Analysis: (Wickham 2011) by H. Wickham which introduces this description:”where you break up a big problem into manageable pieces, operate on each piece independently and then put all the pieces back together.”

Grouping data

The groupby function for data frames does the breaking up, the DataFrames mini language can operate on each piece, and the combine function can put things together again.

Grouping takes one or more categorical variables and creates a “grouped” data frame for each different set of combined levels. That is one grouped data frame for each possible combination of the levels. The groupby function splits the data frame into pieces of type GroupedDataFrame. The basic signature is groupby(df, cols; sort=nothing, skipmissing=false). The cols are one or more column selectors. The value of sort defaults to nothing leaving the order of the groups in the result undefined, but the grouping uses the fastest identified algorithm. The skipmissing argument can instruct the skipping of rows which have missing values in the selected columns.

For example, grouping the baggage data by loading style, returns two sub data frames.

gdf = groupby(d, :load)

GroupedDataFrame with 2 groups based on key: load

First Group (2 rows): load = CategoricalValue{String, UInt32} "front panel"

Row	b	v	p	lap	load	d
	String	Int64	Float64?	Cat…	Cat…	Date
1	Goruck GR2	40	395.99	Y	front panel	2022-09-01
2	Minaal 3.0	35	349.99	Y	front panel	2022-09-01

⋮

Last Group (1 row): load = CategoricalValue{String, UInt32} "clamshell"

Row	b	v	p	lap	load	d
	String	Int64	Float64?	Cat…	Cat…	Date
1	Genius	25	228.99	Y	clamshell	2022-10-01

Grouping by more than one variable is achieved by passing in a selector with multiple variables. The following also creates two sub data frames, but only because the unique combination of both load and date are the same as just load:

gdf = groupby(d, [:load, :d])

GroupedDataFrame with 2 groups based on keys: load, d

First Group (2 rows): load = CategoricalValue{String, UInt32} "front panel", d = 2022-09-01

Row	b	v	p	lap	load	d
	String	Int64	Float64?	Cat…	Cat…	Date
1	Goruck GR2	40	395.99	Y	front panel	2022-09-01
2	Minaal 3.0	35	349.99	Y	front panel	2022-09-01

⋮

Last Group (1 row): load = CategoricalValue{String, UInt32} "clamshell", d = 2022-10-01

Row	b	v	p	lap	load	d
	String	Int64	Float64?	Cat…	Cat…	Date
1	Genius	25	228.99	Y	clamshell	2022-10-01

Indices may be used to get each sub data frame. Moreover, the keys method returns keys, like indices, to efficiently look up the different sub data frames in gdf. The levels for the keys are found in the property .load, as with keys(gdf)[1].load. Grouped data frames may be iterated over; the pairs iterator iterates over both keys and values.

The select and subset functions work over GroupedDataFrame objects, here we see the returned values are combined:

subset(gdf, :p => ByRow(<=(350)))

2×6 DataFrame

Row	b	v	p	lap	load	d
	String	Int64	Float64?	Cat…	Cat…	Date
1	Minaal 3.0	35	349.99	Y	front panel	2022-09-01
2	Genius	25	228.99	Y	clamshell	2022-10-01

(The command [subset(g, :p => ByRow(<=(350))) for g in gdf] avoids the combine step.)

Group and combine

We can see that combine also works over GroupedDataFrame objects. This example shows how to find the mean city mileage of groups formed by the value of :Origin in the cars data set:

combine(groupby(cars, :Origin), :MPGCity => mean)

2×2 DataFrame

Row	Origin	MPGCity_mean
	Cat…	Float64
1	USA	20.9583
2	non-USA	23.8667

Typically, the function applied to a column is a reduction, which results in aggregated data.

By default the keys used to group (:Origin) are kept by combine. The keepkeys argument for combine when applied to grouped data controls this feature.

More than one column can be used to group the data. This example first groups by the number of cylinders, then groups by origin. For each of the 9 resulting groups, the group mean for city mileage is taken:

combine(groupby(cars, [:Cylinders, :Origin]), :MPGCity => mean)

9×3 DataFrame

Row	Cylinders	Origin	MPGCity_mean
	Cat…	Cat…	Float64
1	3	non-USA	39.3333
2	4	USA	24.4091
3	4	non-USA	25.2222
4	5	non-USA	18.5
5	6	USA	18.35
6	6	non-USA	18.5455
7	8	USA	17.0
8	8	non-USA	17.0
9	rotary	non-USA	17.0

The @chain macro may make this more readable:

@chain cars begin
    groupby([:Cylinders, :Origin])
    combine(:MPGCity => mean)
    describe
end

3×7 DataFrame

Row	variable	mean	min	median	max	nmissing	eltype
	Symbol	Union…	Any	Union…	Any	Int64	DataType
1	Cylinders		3		rotary	0	CategoricalValue{String, UInt8}
2	Origin		USA		non-USA	0	CategoricalValue{String, UInt8}
3	MPGCity_mean	21.7067	17.0	18.5	39.3333	0	Float64

Example 3.1 (Lego sets) For an example, we download a data set from the internet describing all the sets of Legos sold during some time period. This data set is part of the data sets provided by openintro.org, an organization trying to make interesting educational products that are free and transparent.

csv_data = download("https://www.openintro.org/data/csv/lego_population.csv")
legos = CSV.read(csv_data, DataFrame)
first(legos, 2)

2×14 DataFrame

Row	item_number	set_name	theme	pieces	price	amazon_price	year	ages	pages	minifigures	packaging	weight	unique_pieces	size
	Int64	String	String31	String7	String31	String31	Int64	String15	String7	String3	String31	String31	String7	String7
1	41916	Extra Dots - Series 2	DOTS	109	3.99	3.44	2020	Ages_6+	NA	NA	Foil pack	NA	6	Small
2	41908	Extra Dots - Series 1	DOTS	109	3.99	3.99	2020	Ages_6+	NA	NA	Foil pack	NA	6	Small

Let’s explore this data.

The size of the data set is printed, though suppressed above as only the first 2 rows are requested to be shown, or can be programmatically identified with size(legos).

size(legos)

(1304, 14)

Each row is a product; there are a lot of different ones.

The types identified by CSV.read are not perfect. The data uses NA for not-available. We read again using the argument missingstring="NA":

legos = CSV.read(csv_data, DataFrame; missingstring="NA")
first(legos, 2)  # show first 2 rows

2×14 DataFrame

Row	item_number	set_name	theme	pieces	price	amazon_price	year	ages	pages	minifigures	packaging	weight	unique_pieces	size
	Int64	String	String31?	Int64?	Float64?	Float64?	Int64	String15	Int64?	Int64?	String31?	String31?	Int64?	String7?
1	41916	Extra Dots - Series 2	DOTS	109	3.99	3.44	2020	Ages_6+	missing	missing	Foil pack	missing	6	Small
2	41908	Extra Dots - Series 1	DOTS	109	3.99	3.99	2020	Ages_6+	missing	missing	Foil pack	missing	6	Small

The :theme and :packaging variables are categorical, so we make them so:

legos.theme = categorical(legos.theme)
legos.packaging = categorical(legos.packaging);
first(legos, 2)

2×14 DataFrame

Row	item_number	set_name	theme	pieces	price	amazon_price	year	ages	pages	minifigures	packaging	weight	unique_pieces	size
	Int64	String	Cat…?	Int64?	Float64?	Float64?	Int64	String15	Int64?	Int64?	Cat…?	String31?	Int64?	String7?
1	41916	Extra Dots - Series 2	DOTS	109	3.99	3.44	2020	Ages_6+	missing	missing	Foil pack	missing	6	Small
2	41908	Extra Dots - Series 1	DOTS	109	3.99	3.99	2020	Ages_6+	missing	missing	Foil pack	missing	6	Small

We see data on the number of pieces and the age range. Is there some relationship?

The :age variable should be an ordered factor, ordered by the youngest age intended for use. The :ages variable has a pattern Ages_startXXX where XXX may be + to indicate or up, or a dash to indicate a range. We use this to identify the youngest intended age.

# alternative to
# `(m = match(r"Ages_(\d+)", x); m === nothing ? missing : parse(Int, m.captures[1]))`
function pick_first_age(x)
    ismissing(x) && return missing
    nos = collect(string(0:9...))
    out = ""
    for xi in x[6:end] # drop Ages_
        !(xi in nos) && break
        out = out * xi
    end
    out == "" && return missing
    parse(Int, out)
end


transform!(legos, :ages => ByRow(pick_first_age) => :youngest_age)
legos.youngest_age = categorical(legos.youngest_age, ordered=true)
first(legos[:,r"age"], 2)

2×3 DataFrame

Row	ages	pages	youngest_age
	String15	Int64?	Cat…?
1	Ages_6+	missing	6
2	Ages_6+	missing	6

With that ordering, an expected pattern becomes clear – kits for older users have on average more pieces – though there are unexpected exceptions:

@chain legos begin
    groupby(:youngest_age)
    combine([:pieces, :unique_pieces] .=> mean∘skipmissing
            .=> [:avg_pieces, :avg_unique_pieces])
end

16×3 DataFrame

Row	youngest_age	avg_pieces	avg_unique_pieces
	Cat…?	Float64	Float64
1	missing	127.592	40.5604
2	1	37.2083	26.2174
3	2	55.617	31.2979
4	3	140.0	41.0
5	4	194.74	87.2857
6	5	147.712	85.1557
7	6	191.079	87.1289
8	7	265.556	108.816
9	8	491.193	173.801
10	9	859.182	252.773
11	10	417.52	109.768
12	11	2357.86	214.143
13	12	1307.29	256.4
14	14	1617.25	349.5
15	16	2454.79	409.107
16	18	2073.84	214.294

(We composed the two functions skipmissing with mean using the \circ[tab] operator and use .=> in both places to avoid defining the transformation function twice.)

The Lego Group

Lego, was the largest toy company in the world by 2021. By 2015 it had produced over 600 billion parts.

Lego is a mature company with an origin dating to 1934. The number of products per year should be fairly stable, though it is easy enough to check. The data set has a :year variable, so we would only need to group the data by that, then count the number of cases per each group. The nrow function will count the cases. This function is special cased in the DataFrames’ mini language and can appear by itself:

combine(groupby(legos, :year), nrow) # use `nrow => :n`, or some such, to label differently

3×2 DataFrame

Row	year	nrow
	Int64	Int64
1	2018	442
2	2019	423
3	2020	439

We can take other summaries by year, for example, here we find the average price by year and size:

@chain legos begin
    groupby([:year, :size])
    combine(nrow => :n,
            :price => mean∘skipmissing => :avg_price
            )
end

9×4 DataFrame

Row	year	size	n	avg_price
	Int64	String7?	Int64	Float64
1	2018	Small	324	45.7792
2	2018	Large	21	28.59
3	2018	missing	97	25.8135
4	2019	Small	322	50.4663
5	2019	Large	14	29.6329
6	2019	missing	87	13.8536
7	2020	Small	335	49.8104
8	2020	Large	18	33.5456
9	2020	missing	86	21.6959

The youngest age range is a bit long. We wish to collapse it to the ranges 1-6, 7-12, and 12-18. The data has been stored as a categorical array. The cut function can group numeric data easily, but to group categorical data, the cut-ranges must be promoted to a CategoricalValue. We have the following, where Ref is used to stop the broadcasting for that value:

vals = CategoricalValue.([1, 6, 12, 18], Ref(legos.youngest_age))
transform!(legos,
           :youngest_age => (x ->cut(x, vals; extend=true))
           => :youngest_age_range);

We now check if the number of pieces has dramatically changed over the year. We break up the data by the age range, as we expect variation in that value so it is best to dis-aggregate over that factor:

@chain legos begin
    groupby([:youngest_age_range, :year])
    combine(nrow => :n,
            :pieces => mean∘skipmissing => :avg_pieces)
end

12×4 DataFrame

Row	youngest_age_range	year	n	avg_pieces
	Cat…?	Int64	Int64	Float64
1	missing	2018	33	165.379
2	missing	2019	34	95.7778
3	missing	2020	44	111.8
4	[1, 6)	2018	98	131.561
5	[1, 6)	2019	92	122.533
6	[1, 6)	2020	92	153.413
7	[6, 12)	2018	292	341.866
8	[6, 12)	2019	275	389.945
9	[6, 12)	2020	276	365.887
10	[12, 18]	2018	19	1956.21
11	[12, 18]	2019	22	1938.95
12	[12, 18]	2020	27	2111.3

No real pattern shows up.

There are many different themes. The command unique(legos.theme) shows 41. Which are the most popular? To see, the data is grouped by the theme, each group is counted, missing themes are dropped, then the data frame is sorted and the top 5 rows are shown:

@chain legos begin
    groupby(:theme)
    combine(nrow => :n)
    dropmissing
    sort(:n; rev=true)
    first(5)
end

5×2 DataFrame

Row	theme	n
	Cat…	Int64
1	Star Wars™	119
2	Friends	103
3	City	101
4	NINJAGO®	78
5	DUPLO®	53

The SplitApplyCombine package

There is support for the split-apply-combine paradigm outside of the DataFrames package in the package SplitApplyCombine. This package is much lighter weight than DataFrames. This support targets other representations of tabular data, such as a vector of named tuples:

using SplitApplyCombine
tbl = [
(b = "Goruck GR2", v = 40, p = 395, lap = "Y", load = "front panel", d = Date("2022-09-01")),
(b = "Minaal 3.0", v = 35, p = 349, lap = "Y", load = "front panel", d = Date("2022-09-01")),
(b = "Genius    ",     v = 25, p = 228, lap = "Y", load = "clamshell",   d = Date("2022-10-01"))
]

3-element Vector{@NamedTuple{b::String, v::Int64, p::Int64, lap::String, load::String, d::Date}}:
 (b = "Goruck GR2", v = 40, p = 395, lap = "Y", load = "front panel", d = Date("2022-09-01"))
 (b = "Minaal 3.0", v = 35, p = 349, lap = "Y", load = "front panel", d = Date("2022-09-01"))
 (b = "Genius    ", v = 25, p = 228, lap = "Y", load = "clamshell", d = Date("2022-10-01"))

Many of the main verbs are generic functions: filter, map, reduce; group does grouping. There are some others. A few examples follow.

For indexing this structure, we can’t index by [row, col], rather we can use [row][col] notation, the first to extract the row, the second to access the entries in the column. For example, here we index a selected column by position, by name, and then with multiple names:

tbl[2][2], tbl[2].v, tbl[2][ [:v, :p] ]

(35, 35, (v = 35, p = 349))

The invert function from SplitApplyCombine reverses the access pattern, allowing, in this case, [col][row] access, with:

itbl = invert(tbl)
itbl[3][1], itbl[:p][2]

(395, 349)

In some ways, the invert function flips the viewpoint from an array of structs to a struct of arrays.

To filter out rows, we have the same calling style as with data frames: filter(pred, tbl). With the outer container of tbl being an array, no special method is utilized, iteration is over each row. For example,

filter(r -> r.v >= 35, tbl) |> DataFrame

2×6 DataFrame

Row	b	v	p	lap	load	d
	String	Int64	Int64	String	String	Date
1	Goruck GR2	40	395	Y	front panel	2022-09-01
2	Minaal 3.0	35	349	Y	front panel	2022-09-01

(The call to DataFrame is to take advantage of the prettier printing. DataFrame can consume a vector of named tuples for its input.)

The DataPipes package composes a bit more nicely with this approach, but we continue using Chain for examples³, like this one of nested calls to filter:

@chain tbl begin
    filter(r -> r.v >= 35,  _)
    filter(r -> r.p >= 350, _)
    DataFrame
end

1×6 DataFrame

Row	b	v	p	lap	load	d
	String	Int64	Int64	String	String	Date
1	Goruck GR2	40	395	Y	front panel	2022-09-01

Selecting columns and adding a transformation can be achieved through map, which iterates over each named tuple in this example:

map(r -> (b=r.b, v=r.v, p=r.p, pv = round(r.p/r.v; digits=2)), tbl) |>
    DataFrame

3×4 DataFrame

Row	b	v	p	pv
	String	Int64	Int64	Float64
1	Goruck GR2	40	395	9.88
2	Minaal 3.0	35	349	9.97
3	Genius	25	228	9.12

The group function can group by values of a function call. This example from the package’s Readme is instructive:

nms = ["Andrew Smith", "John Smith", "Alice Baker",
       "Robert Baker", "Jane Smith", "Jason Bourne"]
group(last, split.(nms))

3-element Dictionaries.Dictionary{SubString{String}, Vector{Vector{SubString{String}}}}
  "Smith" │ Vector{SubString{String}}[["Andrew", "Smith"], ["John", "Smith"], […
  "Baker" │ Vector{SubString{String}}[["Alice", "Baker"], ["Robert", "Baker"]]
 "Bourne" │ Vector{SubString{String}}[["Jason", "Bourne"]]

The names are split on spaces, and the keys of the group are determined by the last function, which here gets the last piece after splitting (aka the “last” name, but by coincidence, not intelligence).

For numeric operations, this is quite convenient. Here we group by the remainder upon division by 3:

group(x -> rem(x,3), 1:10)

3-element Dictionaries.Dictionary{Int64, Vector{Int64}}
 1 │ [1, 4, 7, 10]
 2 │ [2, 5, 8]
 0 │ [3, 6, 9]

The grouping basically creates a dictionary and adds to the key determined by the function.

To apply a function to the values of a base Julia dictionary is not directly supported by map, however the dictionary used by group (from the Dictionaries packages) does allow this, so we could, for example, add up the values in each group with:

map(sum, group(x -> rem(x,3), 1:10))

3-element Dictionaries.Dictionary{Int64, Int64}
 1 │ 22
 2 │ 15
 0 │ 18

The group function can do this in one call, by placing the function to apply in between the by function and the table. Here we apply first to pick off the first name. That is, ["Andrew", "Smith"] becomes just "Andrew", the first element of that vector.

gps = group(last, first, split.(nms))

3-element Dictionaries.Dictionary{SubString{String}, Vector{SubString{String}}}
  "Smith" │ SubString{String}["Andrew", "John", "Jane"]
  "Baker" │ SubString{String}["Alice", "Robert"]
 "Bourne" │ SubString{String}["Jason"]

Example 3.2 (A tally function) The Tally package provides a quick way to tally up numbers. Here we do a simplified version.

To tally, we have three steps: group by each by value, count the number in each groups, sort the values. This is implemented with this composition of functions:

tally(v; by=identity) = SplitApplyCombine.sortkeys(map(length, group(by, v)))

tally (generic function with 1 method)

To see, we have:

v = rand(1:5, 50)
d = tally(v)

5-element Dictionaries.Dictionary{Int64, Int64}
 1 │ 6
 2 │ 13
 3 │ 9
 4 │ 13
 5 │ 9

Using map, we can provide an alternate, oft used view for tallying:

tallies = "\u007C"^4
ftally = "┼┼┼┼ "
function prison_count(x)
    d, r = divrem(x, 5)
    ftally^d * tallies[1:r]
end


map(prison_count, d)

5-element Dictionaries.Dictionary{Int64, Any}
 1 │ "┼┼┼┼ |"
 2 │ "┼┼┼┼ ┼┼┼┼ |||"
 3 │ "┼┼┼┼ ||||"
 4 │ "┼┼┼┼ ┼┼┼┼ |||"
 5 │ "┼┼┼┼ ||||"

Example 3.3 (A simple stem and leaf diagram) The grouping and sorting made quite accessible by SplitApplyCombine could also be used to produce a simple stem-and-leaf type diagram. Here we don’t bother displaying stems with no associated leaves, which does change the interpretation of spread in the data, and could easily be added in, though it takes us a bit afield, as it requires some manipulation of the display:

x = rand(0:100, 32)  # 32 test scores
stem_unit = 10       # stem * 10 is value of stem
@chain x begin
    round.(Int, 10*_/stem_unit)
    SplitApplyCombine.group(x -> x ÷ 10, x->rem(x,10), _)
    SplitApplyCombine.sortkeys(_)
    map(sort, _)
    map(x -> parse(Int,join(x)), _)
end

10-element Dictionaries.Dictionary{Int64, Int64}
 0 │ 689
 1 │ 137
 2 │ 2679
 3 │ 12
 4 │ 1337
 5 │ 2
 6 │ 34556
 7 │ 47
 8 │ 0
 9 │ 17

Let’s return to the cars data and use group to identify the average mileage per type. For this task, we group by :Typeand then apply a function to extract the mileage from a row. The copy.(eachrow(cars)) command creates a vector of named tuples.

cs = copy.(eachrow(cars))  # Can also call `NamedTuple` for `copy`
gps = group(r -> r.Type, r -> r.MPGCity, cs)

6-element Dictionaries.Dictionary{CategoricalValue{String, UInt8}, Vector{Int32}}
 CategoricalValue{String, UInt8} "Smal… │ Int32[25, 29, 23, 29, 31, 23, 46, 42,…
 CategoricalValue{String, UInt8} "Mids… │ Int32[18, 19, 22, 22, 19, 16, 21, 21,…
 CategoricalValue{String, UInt8} "Comp… │ Int32[20, 25, 25, 23, 22, 22, 24, 26,…
 CategoricalValue{String, UInt8} "Larg… │ Int32[19, 16, 16, 17, 20, 20, 20, 18,…
 CategoricalValue{String, UInt8} "Spor… │ Int32[19, 17, 18, 22, 24, 30, 24, 26,…
  CategoricalValue{String, UInt8} "Van" │ Int32[18, 15, 17, 15, 18, 17, 18, 18,…

We now can map mean over each group:

ms = map(mean, gps)

6-element Dictionaries.Dictionary{CategoricalValue{String, UInt8}, Float64}
   CategoricalValue{String, UInt8} "Small" │ 29.857142857142858
 CategoricalValue{String, UInt8} "Midsize" │ 19.545454545454547
 CategoricalValue{String, UInt8} "Compact" │ 22.6875
   CategoricalValue{String, UInt8} "Large" │ 18.363636363636363
  CategoricalValue{String, UInt8} "Sporty" │ 21.785714285714285
     CategoricalValue{String, UInt8} "Van" │ 17.0

This requires two passes through the grouped data, the first to get just the mileage values, the next to call mean. The following DataFrames syntax hides this:

combine(groupby(cars, :Type), :MPGHighway => mean)

6×2 DataFrame

Row	Type	MPGHighway_mean
	Cat…	Float64
1	Compact	29.875
2	Large	26.7273
3	Midsize	26.7273
4	Small	35.4762
5	Sporty	28.7857
6	Van	21.8889

Reductions, like sum and count which can be written easily using the higher-order function reduce, have support to combine the grouping and reduction. The groupreduce function takes the additional arguments of reduce (an operation and an optional init value). For example, to get the group sum, we could call:

groupreduce(r -> r.Type, r -> r.MPGCity, +, cs)

6-element Dictionaries.Dictionary{CategoricalValue{String, UInt8}, Int32}
   CategoricalValue{String, UInt8} "Small" │ 627
 CategoricalValue{String, UInt8} "Midsize" │ 430
 CategoricalValue{String, UInt8} "Compact" │ 363
   CategoricalValue{String, UInt8} "Large" │ 202
  CategoricalValue{String, UInt8} "Sporty" │ 305
     CategoricalValue{String, UInt8} "Van" │ 153

The group sum and group count have shortcuts, so the mean could have been computed as:

groupsum(r -> r.Type, r -> r.MPGCity, cs) ./ groupcount(r -> r.Type, cs)

6-element Dictionaries.Dictionary{CategoricalValue{String, UInt8}, Float64}
   CategoricalValue{String, UInt8} "Small" │ 29.857142857142858
 CategoricalValue{String, UInt8} "Midsize" │ 19.545454545454547
 CategoricalValue{String, UInt8} "Compact" │ 22.6875
   CategoricalValue{String, UInt8} "Large" │ 18.363636363636363
  CategoricalValue{String, UInt8} "Sporty" │ 21.785714285714285
     CategoricalValue{String, UInt8} "Van" │ 17.0

(Unlike for a Dict instance, the dictionaries above allow such division.)

The “combine” part of the paradigm takes the pieces and reassembles. The data frames example returns a data frame, whereas with SplitApplyCombine.jl we have a dictionary. A dictionary does not map directly to a data frame. While both dictionaries and named tuples are associative arrays, we wouldn’t necessarily want to map our data to a row. Rather, here we use the pairs iterator to iterate over the keys and values to produce an array of named tuples:

[ (; Type, MPG) for (Type,MPG) in pairs(ms)] |> DataFrame

6×2 DataFrame

Row	Type	MPG
	Cat…	Float64
1	Small	29.8571
2	Midsize	19.5455
3	Compact	22.6875
4	Large	18.3636
5	Sporty	21.7857
6	Van	17.0

3.1.10 Aggregating data

Somewhat inverse to flattening data is to aggregate data, by which we mean combining multiple columns into a single one, or combining multiple rows into a single one. The select and combine methods can do these tasks.

Consider the three price variables in the cars data. We can find an average of the three price numbers easily enough. This function allows the specification as different arguments, not in a container:

_mean(xs...) = mean(xs)

_mean (generic function with 1 method)

We can then use select to aggregate the three price variables. In the following, these are selected with a regular expression:

cars_price = select(cars, 1:3, r"Price" => ByRow(_mean) => :price)
first(cars_price, 3)

3×4 DataFrame

Row	Manufacturer	Model	Type	price
	Cat…	String	Cat…	Float64
1	Acura	Integra	Small	15.8667
2	Acura	Legend	Midsize	33.9333
3	Audi	90	Compact	29.1

To aggregate over columns, the groupby construct can be used to specify the groupings to aggregate over. In the following this is the car type. The Chain style is used to make the steps more explicit:

@chain cars_price begin
    groupby(:Type)
    combine(:price => mean => :avg_price)
end

6×2 DataFrame

Row	Type	avg_price
	Cat…	Float64
1	Compact	18.2104
2	Large	24.303
3	Midsize	27.2152
4	Small	10.1667
5	Sporty	19.4024
6	Van	19.1111

As mean is a reduction, the result has one row for each unique set of levels in the column selector specified to groupby. Again, that these columns appear in the output is due to the chosen default value ofkeepkeys.

This similar example shows grouping by two variables and then aggregating miles per gallon over each:

@chain cars begin
       groupby([:Type, :DriveTrain])
       combine(:MPGHighway => mean => :avg_MPG)
       sort(:avg_MPG, rev=true)
end

14×3 DataFrame

Row	Type	DriveTrain	avg_MPG
	Cat…	Cat…	Float64
1	Small	Front	35.6842
2	Small	4WD	33.5
3	Sporty	Front	30.5714
4	Compact	Front	30.0769
5	Compact	4WD	30.0
6	Compact	Rear	28.5
7	Large	Front	27.2857
8	Midsize	Front	27.1765
9	Sporty	4WD	27.0
10	Sporty	Rear	27.0
11	Large	Rear	25.75
12	Midsize	Rear	25.2
13	Van	Front	22.5
14	Van	4WD	21.4

3.1.11 Tidy data; stack, unstack

The split-apply-combine paradigm suggest that storing data so that it can be split readily is preferable to other ways of storage. The notion of “tidy data,” (Wickham 2014) as presented in R’s tidyr package and described in this vignette suggests data sets should follow:

Every column is a variable.
Every row is an observation.
Every cell is a single value.

These terms are defined by

A dataset is a collection of values, usually either numbers (if quantitative) or strings (if qualitative). Values are organised in two ways. Every value belongs to a variable and an observation. A variable contains all values that measure the same underlying attribute (like height, temperature, duration) across units. An observation contains all values measured on the same unit (like a person, or a day, or a race) across attributes.

In addition to flatten, the DataFrames package provides the functions stack and unstack for tidying data.

Stack and unstack

The stack and unstack commands are used to move from “wide” format, where related values are stored in multiple columns with column names representing some level, to “long” format, where related values are stored in one column, and their categorical levels in another column.

For example, this wide format as the attributes are in columns :A and :B; the values in columns :C and :D.

wide = DataFrame(A=["A","B"], B=["a","b"], C=["c1","c2"], D=["d1","d2"])

2×4 DataFrame

Row	A	B	C	D
	String	String	String	String
1	A	a	c1	d1
2	B	b	c2	d2

With the stack function, the values are placed into a single column (by default :value) and the variable names placed into a single column (by default :variable):

long = stack(wide, [:C, :D])

4×4 DataFrame

Row	A	B	variable	value
	String	String	String	String
1	A	a	C	c1
2	B	b	C	c2
3	A	a	D	d1
4	B	b	D	d2

The values in the :variable column come from the data frame names.

With unstack the data can be spread back out into “wide” format. We pass in the two columns indicating the variable and the values to unstack:

unstack(long, :variable, :value)

2×4 DataFrame

Row	A	B	C	D
	String	String	String?	String?
1	A	a	c1	d1
2	B	b	c2	d2

We illustrate this further with an abbreviated version of an example in the tidyr vignette.

Consider this data from the Global Historical Climatology Network for one weather station (MX17004) in Mexico for five months in 2010. The full data has 31 days per month, this abbreviated data works with just the first 8:

weather = """
"id","year","month","element","d1","d2","d3","d4","d5","d6","d7","d8"
"MX17004","2010","1","tmax","---","---","---","---","---","---","---","---"
"MX17004","2010","1","tmin","---","---","---","---","---","---","---","---"
"MX17004","2010","2","tmax","---","27.3","24.1","---","---","---","---","---"
"MX17004","2010","2","tmin","---","14.4","14.4","---","---","---","---","---"
"MX17004","2010","3","tmax","---","---","---","---","32.1","---","---","---"
"MX17004","2010","3","tmin","---","---","---","---","14.2","---","---","---"
"MX17004","2010","4","tmax","---","---","---","---","---","---","---","---"
"MX17004","2010","4","tmin","---","---","---","---","---","---","---","---"
"MX17004","2010","5","tmax","---","---","---","---","---","---","---","---"
"MX17004","2010","5","tmin","---","---","---","---","---","---","---","---"
"""

w = CSV.read(IOBuffer(weather), DataFrame; missingstring="---")
first(w, 4)

4×12 DataFrame

Row	id	year	month	element	d1	d2	d3	d4	d5	d6	d7	d8
	String7	Int64	Int64	String7	Missing	Float64?	Float64?	Missing	Float64?	Missing	Missing	Missing
1	MX17004	2010	1	tmax	missing	missing	missing	missing	missing	missing	missing	missing
2	MX17004	2010	1	tmin	missing	missing	missing	missing	missing	missing	missing	missing
3	MX17004	2010	2	tmax	missing	27.3	24.1	missing	missing	missing	missing	missing
4	MX17004	2010	2	tmin	missing	14.4	14.4	missing	missing	missing	missing	missing

This example is said to have variables stored in both rows and columns. The first step to tidying the data is to stack days 1 through 8 into a column. We use the Between column selector to select the columns for d1 though d8 (an alternative to, say, r"^d"):

w1 = @chain w begin
    stack(Between(:d1, :d8); variable_name = :day)
    filter(:value => !ismissing, _)
end

6×6 DataFrame

Row	id	year	month	element	day	value
	String7	Int64	Int64	String7	String	Float64?
1	MX17004	2010	2	tmax	d2	27.3
2	MX17004	2010	2	tmin	d2	14.4
3	MX17004	2010	2	tmax	d3	24.1
4	MX17004	2010	2	tmin	d3	14.4
5	MX17004	2010	3	tmax	d5	32.1
6	MX17004	2010	3	tmin	d5	14.2

As in the example being followed, missing values are skipped, requiring the reader to imply the lack of measurement on a given day.

The day is coded with a leading “d” and as a string, not an integer. We can strip this prefix out using a regular expression or, as here, string indexing then parse the result to an integer:

transform!(w1, :day => ByRow(x -> parse(Int, x[2:end])) => :day)

6×6 DataFrame

Row	id	year	month	element	day	value
	String7	Int64	Int64	String7	Int64	Float64?
1	MX17004	2010	2	tmax	2	27.3
2	MX17004	2010	2	tmin	2	14.4
3	MX17004	2010	2	tmax	3	24.1
4	MX17004	2010	2	tmin	3	14.4
5	MX17004	2010	3	tmax	5	32.1
6	MX17004	2010	3	tmin	5	14.2

Finally, the element column does not hold a variable, rather it stores the names of two variables (tmin and tmax). For this the data is “unstacked” splitting the longer variable value in two:

w2 = unstack(w1, :element, :value)

3×6 DataFrame

Row	id	year	month	day	tmax	tmin
	String7	Int64	Int64	Int64	Float64?	Float64?
1	MX17004	2010	2	2	27.3	14.4
2	MX17004	2010	2	3	24.1	14.4
3	MX17004	2010	3	5	32.1	14.2

As described in the vignette, now each variable is in one column, and each row describes one day.

3.1.12 Joins

Consider the three following assessments of some class. The students have somewhat spotty attendance:

t1 = """
first,last,points
Alice, Smith, 100
Bob, Jones, 200
"""

t2 = """
first,last,score
Carol, Gates, 125
Chad, Gad, 200
Bob, Jones, 225
"""

t3 = """
first,last,points
Bob, Jones, 300
Carol, Gates, 150
Erin, Dan, 100
"""
a1, a2, a3 = [CSV.read(IOBuffer(txt), DataFrame) for txt ∈ (t1, t2, t3)];

We have several ways to combine these three data frames.

We can concatenate them using vcat. Before doing so, we need to align the column names, as the second data set used a different name:

select!(a2, 1:2, :score => :points);

The vcat method for data frames takes a named argument source allowing a variable to be appended indicating the source of the data frame. The following will indicate with a 1, 2, or 3 what data frame contributed what row:

aₛ = vcat(a1, a2, a3; source=:assignment)
describe(aₛ)

4×7 DataFrame

Row	variable	mean	min	median	max	nmissing	eltype
	Symbol	Union…	Any	Union…	Any	Int64	DataType
1	first		Alice		Erin	0	String7
2	last		Dan		Smith	0	String7
3	points	175.0	100	175.0	300	0	Int64
4	assignment	2.125	1	2.0	3	0	Int64

Similarly, append! could be used in combination with reduce (as in reduce(append!, (a1, a2, a3))).

A more common alternative to the above is to join the data frames using one of several join commands. Joins allow multiple tables to be merged, or mashed. into one. We briefly illustrate the implementations in DataFrames of outer, left, right, inner, semi, and anti joins, leaving the cross join and further details to the documentation.

We return to the original data, with the differently named columns:

a1, a2, a3 = [CSV.read(IOBuffer(txt), DataFrame) for txt ∈ (t1, t2, t3)];

The first join we introduce is outerjoin. It combines two data frames with rows for keys found in any of the data frames passed in.

outerjoin(a1, a2, on=["first", "last"])

4×4 DataFrame

Row	first	last	points	score
	String7	String7	Int64?	Int64?
1	Bob	Jones	200	225
2	Alice	Smith	100	missing
3	Carol	Gates	missing	125
4	Chad	Gad	missing	200

The on argument is a column name (or names as above) to join the data frames on. They need not have the same name; the values passed to on may be pairs indicating the mapping. (That is, if we were joining on the last variable, on=:points => :score would be possible.)

The variable to join on is often called the “key,” though in Julia that terminology is also used with associative arrays.

The joined data frame includes all columns from either data frame not specified in on. The resulting data frame, in this case, is in a wide, not tidy, format.

In the above, the columns do not have names in conflict. However, were a1 and a3 joined, they would have been. Passing makeunique=true instructs distinguishing names to be produced:

outerjoin(a1, a3, on=["first", "last"], makeunique=true)

4×4 DataFrame

Row	first	last	points	points_1
	String7	String7	Int64?	Int64?
1	Bob	Jones	200	300
2	Alice	Smith	100	missing
3	Carol	Gates	missing	150
4	Erin	Dan	missing	100

More than two data frames may be passed to outerjoin, as illustrated below where all three are. The result includes a row for each person (key) in any of the data sets.

outerjoin(a1, a2, a3, on=["first", "last"], makeunique=true)

5×5 DataFrame

Row	first	last	points	score	points_1
	String7	String7	Int64?	Int64?	Int64?
1	Bob	Jones	200	225	300
2	Carol	Gates	missing	125	150
3	Alice	Smith	100	missing	missing
4	Chad	Gad	missing	200	missing
5	Erin	Dan	missing	missing	100

From the outerjoin, we easily note that only the first listed student was present for all the assessments.

Three other joins related to outerjoin are:

leftjoin(df1, df2; on, ...) which includes all rows in the left data frame
rightjoin(df1, df2; on, ...) which includes all rows in the right data frame (similar to leftjoin(df2, df1).
innerjoin(df1, df2; on, ...) which includes all rows with keys in both data data frames.

To illustrate:

leftjoin(a1, a2, on=["first", "last"])

2×4 DataFrame

Row	first	last	points	score
	String7	String7	Int64	Int64?
1	Bob	Jones	200	225
2	Alice	Smith	100	missing

rightjoin(a1, a2, on=["first", "last"])

3×4 DataFrame

Row	first	last	points	score
	String7	String7	Int64?	Int64
1	Bob	Jones	200	225
2	Carol	Gates	missing	125
3	Chad	Gad	missing	200

In the following, we must qualify the function below, as SplitApplyCombine has a conflicting use, though this isn’t typically needed:

DataFrames.innerjoin(a1, a2, on=["first", "last"])

1×4 DataFrame

Row	first	last	points	score
	String7	String7	Int64	Int64
1	Bob	Jones	200	225

Joins have an analog in set operations: the outer join is like a union, the inner join is like an intersection.

The next two joins do not join columns, as the ones above. Rather, they use the second data frame to filter out rows of the first data frame.

The semijoin returns the rows of the left data frame that match a key of the right data frame:

semijoin(a1, a2[3:3, :]; on=["first", "last"])

1×3 DataFrame

Row	first	last	points
	String7	String7	Int64
1	Bob	Jones	200

An antijoin includes rows in the left data frame that do not match a key in the right data frame. (Like the set difference). Here we see the row in common to a1 and a2 is not included:

antijoin(a1, a2, on=["first", "last"])

1×3 DataFrame

Row	first	last	points
	String7	String7	Int64
1	Alice	Smith	100

There are other storage formats available within Julia that can contain more detail on the data, e.g. Arrow.↩︎
This combination of column selectors is also performed by Cols(1, :v, "p", r"^l").↩︎
The DataPipes package inserts a magical _ in the last positional argument, which is common with higher-order functions like filter. The example could become @p tbl |> filter(_.v >= 25) |> filter(_.p >= 350) with the new lines replacing the pipe operation as a convenience with longer chains↩︎