using StatsBase, StatsPlots
3 Tabular data
Some data sets are naturally stored in a tabular format, like a spreadsheet. This chapter looks at such data.
In the following, we will utilize these basic packages, among others introduced along the way:
3.1 Data frames
The basic framework for exploratory data analysis is a data set with 1 or more observations for \(n\) different cases. For each case (or observation) a variable will be a length-\(n\) data set, with values measuring the same type of quantity, or perhaps missing. The variables can be stored as vectors. The collection of variables is traditionally arranged in a rectangular manner with each row representing the measurements for a given case and each column representing the measured values for each variable. That is the columns are homogeneous (as they measure the same thing), but the rows need not be (as different variables are measuring different quantities).
Matrices contain rectangular data, but are homogeneous. As such, a new structure is needed to store tabular data in general. A data frame is a concept borrowed from S Plus where they were introduced in 1991. The R language has data frames as a basic built-in data container. In Julia
the structure is implemented in an external package, DataFrames
.
using DataFrames
The DataFrames
package dates back to 2012, at least, and must be one of Julia
’s oldest packages. Julia was on version 0.1
then. It is also under continual development and has a comprehensive set of documentation and tutorials. Bouchet-Valat and Kamiński (2023) has an excellent overview including comparisons to other implementations in other languages and context for certain design decisions. This introduction attempts to illustrate some basic usage; consult the provided documentation for additional details and applications.
3.1.1 Data frame construction
There are different ways to construct a data frame.
Consider the task of the Wirecutter in trying to select the best carry on travel bag. After compiling a list of possible models by scouring travel blogs etc., they select some criteria (capacity, compartment design, aesthetics, comfort, …) and compile data, similar to what one person collected in a spreadsheet. Here we create a much simplified spreadsheet for 3 listed bags with measurements of volume, price, laptop compatability, loading style, and a last-checked date – as this market improves constantly.
product v p l loads checked
Goruck GR2 40 395 Y front panel 2022-09
Minaal 3.0 35 349 Y front panel 2022-09
Genius 25 228 Y clamshell 2022-10
We see that product is a character, volume and price numeric, laptop compatability a Boolean value, load style one of a few levels, and the last checked date, a year-month date.
We create vectors to hold each. We load the CategoricalArrays
and Dates
packages for a few of the variables:
using CategoricalArrays, Dates
= ["Goruck GR2", "Minaal 3.0", "Genius"]
product = [40, 35, 25]
volume = [395, 349, 228]
price = categorical(["Y","Y","Y"])
laptop_compatability = categorical(["front panel", "front panel", "clamshell"])
loading_style = Date.(2022, [9,9,10]) date_checked
3-element Vector{Date}:
2022-09-01
2022-09-01
2022-10-01
With this, we use the DataFrame
constructor to combine these into one data set.
= DataFrame(product = product, volume=volume, price=price,
d "laptop compatability"=laptop_compatability,
var"loading style"=loading_style, var"date checked"=date_checked) var
Row | product | volume | price | laptop compatability | loading style | date checked |
---|---|---|---|---|---|---|
String | Int64 | Int64 | Cat… | Cat… | Date | |
1 | Goruck GR2 | 40 | 395 | Y | front panel | 2022-09-01 |
2 | Minaal 3.0 | 35 | 349 | Y | front panel | 2022-09-01 |
3 | Genius | 25 | 228 | Y | clamshell | 2022-10-01 |
The call to the constructor is like that of a named tuple and again, var"..."
is employed for names with spaces.
As called above, the constructor binds vectors of equal length into a new multi-variate container.
The describe
function will summarize a data frame using different summaries for different types:
describe(d)
Row | variable | mean | min | median | max | nmissing | eltype |
---|---|---|---|---|---|---|---|
Symbol | Union… | Any | Any | Any | Int64 | DataType | |
1 | product | Genius | Minaal 3.0 | 0 | String | ||
2 | volume | 33.3333 | 25 | 35.0 | 40 | 0 | Int64 |
3 | price | 324.0 | 228 | 349.0 | 395 | 0 | Int64 |
4 | laptop compatability | Y | Y | 0 | CategoricalValue{String, UInt32} | ||
5 | loading style | clamshell | front panel | 0 | CategoricalValue{String, UInt32} | ||
6 | date checked | 2022-09-01 | 2022-09-01 | 2022-10-01 | 0 | Date |
In the above construction, we repeated the names of the variables to the constructor. A style, also akin to the construction of named tuple, could also have been used where variables after a semicolon are implicitly assumed to have the desired name:
= DataFrame(; product, volume, price,
d "laptop compatability"=laptop_compatability,
var"loading style"=loading_style, var"date checked"=date_checked) var
Row | product | volume | price | laptop compatability | loading style | date checked |
---|---|---|---|---|---|---|
String | Int64 | Int64 | Cat… | Cat… | Date | |
1 | Goruck GR2 | 40 | 395 | Y | front panel | 2022-09-01 |
2 | Minaal 3.0 | 35 | 349 | Y | front panel | 2022-09-01 |
3 | Genius | 25 | 228 | Y | clamshell | 2022-10-01 |
(In fact, with the rename!
function, or other conveniences, it can be more direct to define the data frame, then rename the columns, avoiding the var
above.)
We see that the DataFrame
type displays values as we might expect with the column header showing the name of the variable and the storage type. The row numbers are added and may be used for reference to the data values.
There is a convention for attaching new columns to a data frame that allows another alternate construction:
= DataFrame() # empty data frame
d = product
d.product = volume
d.volume = price
d.price "laptop compatability" = laptop_compatability
d."loading style" = loading_style
d."date checked" = date_checked
d. d
Row | product | volume | price | laptop compatability | loading style | date checked |
---|---|---|---|---|---|---|
String | Int64 | Int64 | Cat… | Cat… | Date | |
1 | Goruck GR2 | 40 | 395 | Y | front panel | 2022-09-01 |
2 | Minaal 3.0 | 35 | 349 | Y | front panel | 2022-09-01 |
3 | Genius | 25 | 228 | Y | clamshell | 2022-10-01 |
This uses one of a few different ways to refer to a column in a data frame.
As another alternative, the insertcols!
function can insert multiple columns, for example the last one in the example, could have been written insertcols!(d, "date checked" => date_checked)
.
Before exploring how we might query the data frame, we note that an alternate means to describe the data set is by row, or case. For each row, we might use a named tuple with the names indicating the variable, and the values the measurement. The DataFrame
constructor would accept this, here we shorten the names and only do one row (for now):
= DataFrame([
d = "Goruck GR2", v = 40, p = 395, lap = "Y", load = "front panel", d = Date("2022-09-01"))
(b ])
Row | b | v | p | lap | load | d |
---|---|---|---|---|---|---|
String | Int64 | Int64 | String | String | Date | |
1 | Goruck GR2 | 40 | 395 | Y | front panel | 2022-09-01 |
Here, the variable types are different, as it takes more effort to make one categorical. The same way we defined a column allows us to redefine that column (replacing the container, not re-using it for storage). For example, we could do:
= categorical(d.lap)
d.lap = categorical(d.load) d.load
1-element CategoricalArray{String,1,UInt32}:
"front panel"
and then the two variables would be categorical.
There are other ways to define row by row:
- A data frame with empty variables can be constructed and then values as tuples may be
push!
ed onto the data frame
push!(d, ("Minaal 3.0", 35, 349, "Y", "front panel", Date("2022-09-01")))
Row | b | v | p | lap | load | d |
---|---|---|---|---|---|---|
String | Int64 | Int64 | Cat… | Cat… | Date | |
1 | Goruck GR2 | 40 | 395 | Y | front panel | 2022-09-01 |
2 | Minaal 3.0 | 35 | 349 | Y | front panel | 2022-09-01 |
A named tuple can also be used.
- Similarly, rows specified by dictionaries may be pushed onto a data frame
push!(d, Dict(:b => "Genius", :v => 25, :p => 228, :lap => "Y",
:load => "clamshell", :d => Date("2022-10-01")))
Row | b | v | p | lap | load | d |
---|---|---|---|---|---|---|
String | Int64 | Int64 | Cat… | Cat… | Date | |
1 | Goruck GR2 | 40 | 395 | Y | front panel | 2022-09-01 |
2 | Minaal 3.0 | 35 | 349 | Y | front panel | 2022-09-01 |
3 | Genius | 25 | 228 | Y | clamshell | 2022-10-01 |
(A dictionary is a key => value
container like a named tuple, but keys may be arbitrary Julia
objects – not always symbols – so we explicitly use symbols in the above command.)
Tables
interface
In the Julia
ecosystem, the Tables
package provides a lightweight means to describe tabular data, like a data frame, TableOperations
provides some basic means to query the data. The different ways in which a data frame can be defined, reflect the different ways Tables
objects can be created.
As tables are viewed as rows of homogeneous types, a vector of named tuples is a natural description, whereas tables are also naturally viewed as a collection of named columns, each column a vector. As such, both of these specify a table.
- As a struct of arrays. This is used by calling
DataFrame(a=[1,2], b= [2,1])
, say. - As an array of structs. This is used by calling
DataFrame([(a=1,b=2), (a=2,b=1)])
RDataset
Data frames are often read in from external sources, and not typed in. The RDataSets
package has collected data sets from many different R packages and bundled them into data sets for Julia
in the form of data frames. We will use some of these data sets in the examples. To load a data set we need to know its R
package and R
name:
using RDatasets
= dataset("MASS", "Cars93") # package, data set name
cars first(cars, 2) # first 2 rows only
Row | Manufacturer | Model | Type | MinPrice | Price | MaxPrice | MPGCity | MPGHighway | AirBags | DriveTrain | Cylinders | EngineSize | Horsepower | RPM | RevPerMile | ManTransAvail | FuelTankCapacity | Passengers | Length | Wheelbase | Width | TurnCircle | RearSeatRoom | LuggageRoom | Weight | Origin | Make |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Cat… | String | Cat… | Float64 | Float64 | Float64 | Int32 | Int32 | Cat… | Cat… | Cat… | Float64 | Int32 | Int32 | Int32 | Cat… | Float64 | Int32 | Int32 | Int32 | Int32 | Int32 | Float64? | Int32? | Int32 | Cat… | String | |
1 | Acura | Integra | Small | 12.9 | 15.9 | 18.8 | 25 | 31 | None | Front | 4 | 1.8 | 140 | 6300 | 2890 | Yes | 13.2 | 5 | 177 | 102 | 68 | 37 | 26.5 | 11 | 2705 | non-USA | Acura Integra |
2 | Acura | Legend | Midsize | 29.2 | 33.9 | 38.7 | 18 | 25 | Driver & Passenger | Front | 6 | 3.2 | 200 | 5500 | 2335 | Yes | 18.0 | 5 | 195 | 115 | 71 | 38 | 30.0 | 15 | 3560 | non-USA | Acura Legend |
CSV.read
For interacting with spreadsheet data, when it is stored like a data frame, it may be convenient to export the data in some format and then read that into a Julia
session. A common format is to use comma-separated values, or “csv” data. The CSV
package reads such files. The CSV
package reads in delimited text data (the source) using the Tables
interface and can be instructed to write to a data frame (the sink). The CSV.read
file is used below. Here we write to some file to show that we can read it back:
using CSV
= tempname() * ".csv"
f write(f, d) # write simple data frame to file `f`
CSV.= CSV.read(f, DataFrame) d1
Row | b | v | p | lap | load | d |
---|---|---|---|---|---|---|
String15 | Int64 | Int64 | String1 | String15 | Date | |
1 | Goruck GR2 | 40 | 395 | Y | front panel | 2022-09-01 |
2 | Minaal 3.0 | 35 | 349 | Y | front panel | 2022-09-01 |
3 | Genius | 25 | 228 | Y | clamshell | 2022-10-01 |
Comma-separated-value files are simple text files, with no metadata to indicate what type a value may be.1 Some guesswork is necessary. We observe that the two categorical variables were read back in as strings. As seen above, that can be addressed through column conversion.
The filename, may be more general. For example, it could be download(url)
for some url that points to a csv file to read data from the internet.
The methods read
and write
are qualified in the above usage with the CSV
module. In the Julia
ecosystem, the FileIO
package provides a common framework for reading and writing files; it uses the verbs load
and save
. This can also be used with DataFrames
, though it works through the CSVFiles
package – and not CSV
, as illustrated above. The read command would look like DataFrame(load(fname))
and the write command like save(fname, df)
. Here fname
would have a “.csv” extension so that the type of file could be detected.
Command | Description |
---|---|
CSV.read(file_name, DataFrame) |
Read csv file from file with given name |
CSV.read(IOBuffer(string), DataFrame) |
Read csv file from a string using an IOBuffer |
CSV.read(download(url), DataFrame) |
Read csv file an url |
open(file_name,"w") do io; CSV.write(io, dd); end |
Write data frame df to csv file |
DataFrame(load(file_name)) |
Read csv file from file with given name using CSVFiles |
save(file_name, df) |
Write data frame df to a csv file using CSVFiles |
TableScraper
The TableScraper
package can scrape tables from an HTML file in a manner that can be passed to DataFrame
for conversion to a data frame. The command DataFrame.(scrape_tables(url))
may just work to create the tables, but in practice, only well-formed tables can be scraped successfully; tables used for formatting purposes may fail. TableScraper
is built on Cascadia
and Gumbo
, as is the Harbest
package, which provides more access to the underlying data than TableScraper
.
3.1.2 Column names
The variable names of a data set are returned by the names
function, as strings:
names(d)
6-element Vector{String}:
"b"
"v"
"p"
"lap"
"load"
"d"
This function has a second argument which can be used to narrow or filter the names that are returned. For example, the following is a convenience for matching columns whose element type is a subtype of Real
:
names(d, Real)
2-element Vector{String}:
"v"
"p"
The rename!
function allows the names to be changed in-place (without returning a new data frame). There are many different calling patterns, but:
- to replace one name, the form
rename!(df, :v => :nv)
is used, where:v
is the old name, and:nv
the new one. The form follows the genericreplace!
function. The pair notation can be broadcast (.=>
) to replace more than one name, as withreplace!(df, [:v1, :v2] .=> [:nv1, :nv2])
. - to replace allnames, the form
rename!(df, [...])
is used, where the vector is of length matching the number of columns indf
.
3.1.3 Indexing and assignment
The values in a data frame can be referenced programatically by a row number and column number, both 1-based. For example, the 2nd row and 3rd column of d
can be seen to be 349
by observation
d
Row | b | v | p | lap | load | d |
---|---|---|---|---|---|---|
String | Int64 | Int64 | Cat… | Cat… | Date | |
1 | Goruck GR2 | 40 | 395 | Y | front panel | 2022-09-01 |
2 | Minaal 3.0 | 35 | 349 | Y | front panel | 2022-09-01 |
3 | Genius | 25 | 228 | Y | clamshell | 2022-10-01 |
but it can be referenced through indexing (row number first, column number second):
2, 3] d[
349
A new value can also be assigned with the same indexing reference. Suppose, that bag went on sale. This new price could be adjusted, as follows:
2, 3] = 309 # a sale price d[
309
Column selectors
The 3rd column has a name, p
in this case. We can use a column selector to avoid having to know which column number a variable is. For this a string or a symbol may be used, as in:
2, :p], d[2, "p"] d[
(309, 309)
More generally, it is possible to use more than 1 column in column selection. For example, to get both the volume and price, for the second bag we have:
2, [:v, :p]] d[
Row | v | p |
---|---|---|
Int64 | Int64 | |
2 | 35 | 309 |
As well, either [2, 3]
or ["v", "p"]
could have been used.
For data frames with many columns, a regular expression can be used. The column selector r"^l"
would match all columns whose name begins with l
(lap
and load
) in this case. (Regular expressions can be very complicated, but here we only assume that r"name"
will match “name” somewhere in the string; r"^name"
and r"name$"
will match “name” at the beginning and ending of a string. Using a regular expression will return a data frame row (when a row index is specified) – not a value – as it is possible to return 0, 1 or more columns in the selection.
Column selectors, as seen may be column number(s), column names as strings, column names as symbols, and regular expressions.
In addition:
- Boolean vectors of a length matching the number of columns (
d[2, [false, true, true, false, false, false]]
would return the 2nd and 3rd columns.) - The value
All()
which selects all columns - the value
Between(a, b)
where,a
andb
are column selectors (such asBetween(:v, 3)
). Not(...)
to select those columns not specified in...
Cols(a,b,...)
which combines different selectors, such asCols(:v, 3)
.
Slices
The All()
column selector has a shortcut inherited from the indexing of arrays in base Julia
, this being a colon, :
. When used in the column position it refers to all the values in the row:
2, :] d[
Row | b | v | p | lap | load | d |
---|---|---|---|---|---|---|
String | Int64 | Int64 | Cat… | Cat… | Date | |
2 | Minaal 3.0 | 35 | 309 | Y | front panel | 2022-09-01 |
When used in the row position, it refers to all rows:
:, 3] d[
3-element Vector{Int64}:
395
309
228
In the above use, a vector is returned, not a data frame, as only one column was selected. But which vector? Is it a copy of the one that d
holds, or is it the same container? In base Julia
the answer depends on what side of an equals sign the use is on (meaning, if used in assignment the answer is different, as mentioned later). In this usage, a copy is returned.
To get the underlying column (not a copy), the selector !
is used instead:
3] d[!,
3-element Vector{Int64}:
395
309
228
The notation d.p
also refers to the 3rd variable. In this case, d.p
is a view of the underlying data (using !
) and not a copy.
The notation d[:,:]
will refer to all rows and all columns of d
, and in this case will be a copy of the data frame.
Assignment
Single values can be assigned, as previously illustrated. These values are assigned to the underlying vector in the data frame, so as with vectors, the assigned value will be promoted to the column type, if necessary. Assigning a value that can’t be promoted will throw and error, as it would were such an assignment tried for a vector. To do such an assignment, the underlying vector must be changed to a wider type.
The row selector :
or !
had different semantics when accessing the underlying data, the first returning a copy, the second a view. When used with assignment though, :
refers to the underlying vector or does in-place assignment; whereas !
(and the dot access) will replace the the underlying container with the assigned container.
To illustrate, suppose the prices are rounded down, but actually contain an extra 99 cents. We can reassign price
as following:
:p] = price .+ 0.99 d[!,
3-element Vector{Float64}:
395.99
349.99
228.99
as even though price .+ 0.99
is no longer an integer vector, the !
row indicator will instruct d
to point to this new vector. Had d[:, :p] = ...
been used above, an error would have been thrown as the underlying container is integer, but the new prices require floating point values to represent them.
Broadcast assignment (using .=
) will be similar. The right hand side is expanded, and then assigned. Attempts to assign the wrong value type using :
to indicate the rows will error.
Missing values
Trying to assign a value of missing
to a data frame requires the underlying vector to allow missing values. As with vectors, the allowmissing
function is useful. For example, to give the 2nd bag a price of missing, we could do:
allowmissing!(d, :p)
2, :p] = missing d[
missing
By using the mutating form (with the trailing !
), the data frame d
is mutated in the allowmissing!
call. This could be observed by looking a the underlying column type.
The new type is a union of the old type and Missing
, so we could reassign the old value:
2, :p] = 349.99
d[ d
Row | b | v | p | lap | load | d |
---|---|---|---|---|---|---|
String | Int64 | Float64? | Cat… | Cat… | Date | |
1 | Goruck GR2 | 40 | 395.99 | Y | front panel | 2022-09-01 |
2 | Minaal 3.0 | 35 | 349.99 | Y | front panel | 2022-09-01 |
3 | Genius | 25 | 228.99 | Y | clamshell | 2022-10-01 |
Tables.jl interface
A table, like a data frame can be expected to loop, or iterate, over columns, say to find which columns are numeric. Or, they can be expected to iterate over rows, say to find which values are within a certain range. What should be the case with a data frame? As there are reasons to prefer either, none is specified. Rather, there is a difference based on the calling function. Data frames respect the Tables
interface, which has methods to iterate over rows or columns. When iterating over columns, the individual vectors are given; when iterating over rows, a DataFrameRow
object is returned. Calling copy
on these rows objects will return a named tuple.
The basic iterators are eachrow
and eachcol
. For example, to see the column type for a data frame, we can broadcast eltype
over the variables returned by eachcol
:
eltype.(eachcol(d)) |> permutedims
1×6 Matrix{Type}:
String Int64 Union{Missing, Float64} … Date
To get the data frame as a vector of named tuples, then broadcasting NamedTuple
over eachrow
can be used: NamedTuple.(eachrow(d))
; Base.copy
is an alias. Tables.rowtable(d)
will produce this vector of named tuples, as well.
3.1.4 Sorting
A data frame can be sorted by specifying a permutation of the indices, such as is returned by sortperm
. For example, d[sortperm(d.price), :]
will sort by price in increasing order. The keyword argument rev
can be passed true
to reverse the default sort order.
The generic sort
function has a method for data frames, that offers a more convenient functionality. The sort!
method does in-place sorting. The simplest call is to pass in a column to sort by, as in
sort(d, :p)
Row | b | v | p | lap | load | d |
---|---|---|---|---|---|---|
String | Int64 | Float64? | Cat… | Cat… | Date | |
1 | Genius | 25 | 228.99 | Y | clamshell | 2022-10-01 |
2 | Minaal 3.0 | 35 | 349.99 | Y | front panel | 2022-09-01 |
3 | Goruck GR2 | 40 | 395.99 | Y | front panel | 2022-09-01 |
This sorts by the price (:p
) variable in increasing order. To sort by decreasing order a keyword argument rev=true
can be specified:
sort(d, "v"; rev=true)
Row | b | v | p | lap | load | d |
---|---|---|---|---|---|---|
String | Int64 | Float64? | Cat… | Cat… | Date | |
1 | Goruck GR2 | 40 | 395.99 | Y | front panel | 2022-09-01 |
2 | Minaal 3.0 | 35 | 349.99 | Y | front panel | 2022-09-01 |
3 | Genius | 25 | 228.99 | Y | clamshell | 2022-10-01 |
There can be one or more column selectors. For this, the order
function can be used to reverse just one of the columns, as in the following, which first sorts the dates in reverse order (newest first) and then within a data, uses lowest price first to sort:
sort(d, [order("d", rev=true), :p])
Row | b | v | p | lap | load | d |
---|---|---|---|---|---|---|
String | Int64 | Float64? | Cat… | Cat… | Date | |
1 | Genius | 25 | 228.99 | Y | clamshell | 2022-10-01 |
2 | Minaal 3.0 | 35 | 349.99 | Y | front panel | 2022-09-01 |
3 | Goruck GR2 | 40 | 395.99 | Y | front panel | 2022-09-01 |
3.1.5 Filtering rows
Filtering, or subsetting, is a means to select a specified selection of rows.
Let’s consider the cars
data set, previously defined by loading dataset("MASS", "Cars93")
. This data set consists of 27 measurements on 93 cars from the 1990s. The data set comes from R’s MASS
package, and is documented there.
Two ready ways to filter are by using specific indices, or by using Boolean indices generated by a logical comparison.
For the cars
data set, the latter can be used to extract the Volkswagen models. To generate the indices we compare the :Manufacturer
variable with the string Volkswagen
to return a vector of length 93 that indicates if the manufacturer is VW. This is done with broadcasting: cars.Manufacturer .== "Volkswagen"
. We use the easy dot access to variables in cars
, alternatively cars[:, :Manufacturer]
(or, more efficiently, !
, as is used by the dot access) could be used were :Manufacturer
coming from some variable. With this row selector, we have:
.== "Volkswagen", :] cars[cars.Manufacturer
Row | Manufacturer | Model | Type | MinPrice | Price | MaxPrice | MPGCity | MPGHighway | AirBags | DriveTrain | Cylinders | EngineSize | Horsepower | RPM | RevPerMile | ManTransAvail | FuelTankCapacity | Passengers | Length | Wheelbase | Width | TurnCircle | RearSeatRoom | LuggageRoom | Weight | Origin | Make |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Cat… | String | Cat… | Float64 | Float64 | Float64 | Int32 | Int32 | Cat… | Cat… | Cat… | Float64 | Int32 | Int32 | Int32 | Cat… | Float64 | Int32 | Int32 | Int32 | Int32 | Int32 | Float64? | Int32? | Int32 | Cat… | String | |
1 | Volkswagen | Fox | Small | 8.7 | 9.1 | 9.5 | 25 | 33 | None | Front | 4 | 1.8 | 81 | 5500 | 2550 | Yes | 12.4 | 4 | 163 | 93 | 63 | 34 | 26.0 | 10 | 2240 | non-USA | Volkswagen Fox |
2 | Volkswagen | Eurovan | Van | 16.6 | 19.7 | 22.7 | 17 | 21 | None | Front | 5 | 2.5 | 109 | 4500 | 2915 | Yes | 21.1 | 7 | 187 | 115 | 72 | 38 | 34.0 | missing | 3960 | non-USA | Volkswagen Eurovan |
3 | Volkswagen | Passat | Compact | 17.6 | 20.0 | 22.4 | 21 | 30 | None | Front | 4 | 2.0 | 134 | 5800 | 2685 | Yes | 18.5 | 5 | 180 | 103 | 67 | 35 | 31.5 | 14 | 2985 | non-USA | Volkswagen Passat |
4 | Volkswagen | Corrado | Sporty | 22.9 | 23.3 | 23.7 | 18 | 25 | None | Front | 6 | 2.8 | 178 | 5800 | 2385 | Yes | 18.5 | 4 | 159 | 97 | 66 | 36 | 26.0 | 15 | 2810 | non-USA | Volkswagen Corrado |
This approach lends itself to the description “find all rows matching some value” then “extract the identified rows,” – written as two steps to emphasize there are two passes through the data. Another mental model would be loop over the rows, and keep those that match the query. This is done generically by the filter
function for collections in Julia
or by the subset
function of DataFrames
.
The filter(predicate, collection)
function is used to identify just the values in the collection for which the predicate function returns true
. When a data frame is used with filter
, the iteration is over the rows, so the wrapping eachrow
iterator is not needed. We need a predicate function to replace the .==
above. One follows. It doesn’t need .==
, as r
is a data frame row and access produces a value not a vector:
filter(r -> r.Manufacturer == "Volkswagen", cars)
Row | Manufacturer | Model | Type | MinPrice | Price | MaxPrice | MPGCity | MPGHighway | AirBags | DriveTrain | Cylinders | EngineSize | Horsepower | RPM | RevPerMile | ManTransAvail | FuelTankCapacity | Passengers | Length | Wheelbase | Width | TurnCircle | RearSeatRoom | LuggageRoom | Weight | Origin | Make |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Cat… | String | Cat… | Float64 | Float64 | Float64 | Int32 | Int32 | Cat… | Cat… | Cat… | Float64 | Int32 | Int32 | Int32 | Cat… | Float64 | Int32 | Int32 | Int32 | Int32 | Int32 | Float64? | Int32? | Int32 | Cat… | String | |
1 | Volkswagen | Fox | Small | 8.7 | 9.1 | 9.5 | 25 | 33 | None | Front | 4 | 1.8 | 81 | 5500 | 2550 | Yes | 12.4 | 4 | 163 | 93 | 63 | 34 | 26.0 | 10 | 2240 | non-USA | Volkswagen Fox |
2 | Volkswagen | Eurovan | Van | 16.6 | 19.7 | 22.7 | 17 | 21 | None | Front | 5 | 2.5 | 109 | 4500 | 2915 | Yes | 21.1 | 7 | 187 | 115 | 72 | 38 | 34.0 | missing | 3960 | non-USA | Volkswagen Eurovan |
3 | Volkswagen | Passat | Compact | 17.6 | 20.0 | 22.4 | 21 | 30 | None | Front | 4 | 2.0 | 134 | 5800 | 2685 | Yes | 18.5 | 5 | 180 | 103 | 67 | 35 | 31.5 | 14 | 2985 | non-USA | Volkswagen Passat |
4 | Volkswagen | Corrado | Sporty | 22.9 | 23.3 | 23.7 | 18 | 25 | None | Front | 6 | 2.8 | 178 | 5800 | 2385 | Yes | 18.5 | 4 | 159 | 97 | 66 | 36 | 26.0 | 15 | 2810 | non-USA | Volkswagen Corrado |
There are some conveniences that have been developed to make the predicate functions even easier to write, which will be introduced incrementally:
The basic binary comparisons, like ==
, are defined by functions essentially of the form (a,b) -> ==(a,b)
. There are partially applied versions, essentially a -> ==(a,b)
, with b
applied, formed by ==(b)
. So the syntax ==("Volkswagen")
is a predicate taking a single argument which is then compared for equality to the string "Volkswagen"
.
However, r
is a data frame row, what we need to pass to ==("Volkswagen")
is the value of the :Manufacturer
column. For this another convenience is available.
The pair
syntax, =>
of the form column selector(s) => predicate function
will pass just the value(s) in the column(s) to the predicate function. Combined, the above can be simplified to:
filter(:Manufacturer => ==("Volkswagen"), cars)
Row | Manufacturer | Model | Type | MinPrice | Price | MaxPrice | MPGCity | MPGHighway | AirBags | DriveTrain | Cylinders | EngineSize | Horsepower | RPM | RevPerMile | ManTransAvail | FuelTankCapacity | Passengers | Length | Wheelbase | Width | TurnCircle | RearSeatRoom | LuggageRoom | Weight | Origin | Make |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Cat… | String | Cat… | Float64 | Float64 | Float64 | Int32 | Int32 | Cat… | Cat… | Cat… | Float64 | Int32 | Int32 | Int32 | Cat… | Float64 | Int32 | Int32 | Int32 | Int32 | Int32 | Float64? | Int32? | Int32 | Cat… | String | |
1 | Volkswagen | Fox | Small | 8.7 | 9.1 | 9.5 | 25 | 33 | None | Front | 4 | 1.8 | 81 | 5500 | 2550 | Yes | 12.4 | 4 | 163 | 93 | 63 | 34 | 26.0 | 10 | 2240 | non-USA | Volkswagen Fox |
2 | Volkswagen | Eurovan | Van | 16.6 | 19.7 | 22.7 | 17 | 21 | None | Front | 5 | 2.5 | 109 | 4500 | 2915 | Yes | 21.1 | 7 | 187 | 115 | 72 | 38 | 34.0 | missing | 3960 | non-USA | Volkswagen Eurovan |
3 | Volkswagen | Passat | Compact | 17.6 | 20.0 | 22.4 | 21 | 30 | None | Front | 4 | 2.0 | 134 | 5800 | 2685 | Yes | 18.5 | 5 | 180 | 103 | 67 | 35 | 31.5 | 14 | 2985 | non-USA | Volkswagen Passat |
4 | Volkswagen | Corrado | Sporty | 22.9 | 23.3 | 23.7 | 18 | 25 | None | Front | 6 | 2.8 | 178 | 5800 | 2385 | Yes | 18.5 | 4 | 159 | 97 | 66 | 36 | 26.0 | 15 | 2810 | non-USA | Volkswagen Corrado |
It doesn’t save much for this simple example, but this specification style is the basis of the DataFrames
mini language, which provides a standardized and performant means of wrangling a data frame.
Filtering may be done on one or more values. There are different approaches.
For example, to first filter by the manufacturer, then by city mileage, we might write a more complicated predicate function using &&
to combine Boolean values:
pred(r) = r.Manufacturer == "Volkswagen" && r.MPGCity >= 20
filter(pred, cars)
Row | Manufacturer | Model | Type | MinPrice | Price | MaxPrice | MPGCity | MPGHighway | AirBags | DriveTrain | Cylinders | EngineSize | Horsepower | RPM | RevPerMile | ManTransAvail | FuelTankCapacity | Passengers | Length | Wheelbase | Width | TurnCircle | RearSeatRoom | LuggageRoom | Weight | Origin | Make |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Cat… | String | Cat… | Float64 | Float64 | Float64 | Int32 | Int32 | Cat… | Cat… | Cat… | Float64 | Int32 | Int32 | Int32 | Cat… | Float64 | Int32 | Int32 | Int32 | Int32 | Int32 | Float64? | Int32? | Int32 | Cat… | String | |
1 | Volkswagen | Fox | Small | 8.7 | 9.1 | 9.5 | 25 | 33 | None | Front | 4 | 1.8 | 81 | 5500 | 2550 | Yes | 12.4 | 4 | 163 | 93 | 63 | 34 | 26.0 | 10 | 2240 | non-USA | Volkswagen Fox |
2 | Volkswagen | Passat | Compact | 17.6 | 20.0 | 22.4 | 21 | 30 | None | Front | 4 | 2.0 | 134 | 5800 | 2685 | Yes | 18.5 | 5 | 180 | 103 | 67 | 35 | 31.5 | 14 | 2985 | non-USA | Volkswagen Passat |
The mini language can also be used. On the left hand side of (the first) =>
a vector of column selectors may be indicated, in which the predicate function (on the right hand side of the first =>
) will get multiple arguments that can be used to implement the logic:
pred(m, mpg) = m == "Volkswagen" && mpg >= 20
filter([:Manufacturer, :MPGCity] => pred, cars)
Row | Manufacturer | Model | Type | MinPrice | Price | MaxPrice | MPGCity | MPGHighway | AirBags | DriveTrain | Cylinders | EngineSize | Horsepower | RPM | RevPerMile | ManTransAvail | FuelTankCapacity | Passengers | Length | Wheelbase | Width | TurnCircle | RearSeatRoom | LuggageRoom | Weight | Origin | Make |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Cat… | String | Cat… | Float64 | Float64 | Float64 | Int32 | Int32 | Cat… | Cat… | Cat… | Float64 | Int32 | Int32 | Int32 | Cat… | Float64 | Int32 | Int32 | Int32 | Int32 | Int32 | Float64? | Int32? | Int32 | Cat… | String | |
1 | Volkswagen | Fox | Small | 8.7 | 9.1 | 9.5 | 25 | 33 | None | Front | 4 | 1.8 | 81 | 5500 | 2550 | Yes | 12.4 | 4 | 163 | 93 | 63 | 34 | 26.0 | 10 | 2240 | non-USA | Volkswagen Fox |
2 | Volkswagen | Passat | Compact | 17.6 | 20.0 | 22.4 | 21 | 30 | None | Front | 4 | 2.0 | 134 | 5800 | 2685 | Yes | 18.5 | 5 | 180 | 103 | 67 | 35 | 31.5 | 14 | 2985 | non-USA | Volkswagen Passat |
Finally, we could use filter
twice:
= filter(:Manufacturer => ==("Volkswagen"), cars)
cars1 = filter(:MPGCity => >=(20), cars1) cars2
Row | Manufacturer | Model | Type | MinPrice | Price | MaxPrice | MPGCity | MPGHighway | AirBags | DriveTrain | Cylinders | EngineSize | Horsepower | RPM | RevPerMile | ManTransAvail | FuelTankCapacity | Passengers | Length | Wheelbase | Width | TurnCircle | RearSeatRoom | LuggageRoom | Weight | Origin | Make |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Cat… | String | Cat… | Float64 | Float64 | Float64 | Int32 | Int32 | Cat… | Cat… | Cat… | Float64 | Int32 | Int32 | Int32 | Cat… | Float64 | Int32 | Int32 | Int32 | Int32 | Int32 | Float64? | Int32? | Int32 | Cat… | String | |
1 | Volkswagen | Fox | Small | 8.7 | 9.1 | 9.5 | 25 | 33 | None | Front | 4 | 1.8 | 81 | 5500 | 2550 | Yes | 12.4 | 4 | 163 | 93 | 63 | 34 | 26.0 | 10 | 2240 | non-USA | Volkswagen Fox |
2 | Volkswagen | Passat | Compact | 17.6 | 20.0 | 22.4 | 21 | 30 | None | Front | 4 | 2.0 | 134 | 5800 | 2685 | Yes | 18.5 | 5 | 180 | 103 | 67 | 35 | 31.5 | 14 | 2985 | non-USA | Volkswagen Passat |
The above required the introduction of an intermediate data frame to store the result of the first filter
call to pass to the second. This threading through of the modified data is quite common in processing pipelines. The first two approaches with complicated predicate functions can grow unwieldly, so staged modification is common. To support that, the chaining or piping operation (|>
) is often used:
filter(:Manufacturer => ==("Volkswagen"), cars) |>
-> filter(:MPGCity => >=(20), d) d
Row | Manufacturer | Model | Type | MinPrice | Price | MaxPrice | MPGCity | MPGHighway | AirBags | DriveTrain | Cylinders | EngineSize | Horsepower | RPM | RevPerMile | ManTransAvail | FuelTankCapacity | Passengers | Length | Wheelbase | Width | TurnCircle | RearSeatRoom | LuggageRoom | Weight | Origin | Make |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Cat… | String | Cat… | Float64 | Float64 | Float64 | Int32 | Int32 | Cat… | Cat… | Cat… | Float64 | Int32 | Int32 | Int32 | Cat… | Float64 | Int32 | Int32 | Int32 | Int32 | Int32 | Float64? | Int32? | Int32 | Cat… | String | |
1 | Volkswagen | Fox | Small | 8.7 | 9.1 | 9.5 | 25 | 33 | None | Front | 4 | 1.8 | 81 | 5500 | 2550 | Yes | 12.4 | 4 | 163 | 93 | 63 | 34 | 26.0 | 10 | 2240 | non-USA | Volkswagen Fox |
2 | Volkswagen | Passat | Compact | 17.6 | 20.0 | 22.4 | 21 | 30 | None | Front | 4 | 2.0 | 134 | 5800 | 2685 | Yes | 18.5 | 5 | 180 | 103 | 67 | 35 | 31.5 | 14 | 2985 | non-USA | Volkswagen Passat |
The right-hand side of |>
expects a function for application, so an anonymous function is created. There are macros available in user-contributed packages that shorten this pattern by using a placeholder.
A popular one is, @chain
, in the Chain
package, with its basic usage based on two simple principles:
- a new line is implicitly a pipe
- a
_
receives the argument from the pipe. If none is given, the first argument does
For example,
using Chain
@chain cars begin
filter(:Manufacturer => ==("Volkswagen"), _)
filter(:MPGCity => >=(20), _)
end
Row | Manufacturer | Model | Type | MinPrice | Price | MaxPrice | MPGCity | MPGHighway | AirBags | DriveTrain | Cylinders | EngineSize | Horsepower | RPM | RevPerMile | ManTransAvail | FuelTankCapacity | Passengers | Length | Wheelbase | Width | TurnCircle | RearSeatRoom | LuggageRoom | Weight | Origin | Make |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Cat… | String | Cat… | Float64 | Float64 | Float64 | Int32 | Int32 | Cat… | Cat… | Cat… | Float64 | Int32 | Int32 | Int32 | Cat… | Float64 | Int32 | Int32 | Int32 | Int32 | Int32 | Float64? | Int32? | Int32 | Cat… | String | |
1 | Volkswagen | Fox | Small | 8.7 | 9.1 | 9.5 | 25 | 33 | None | Front | 4 | 1.8 | 81 | 5500 | 2550 | Yes | 12.4 | 4 | 163 | 93 | 63 | 34 | 26.0 | 10 | 2240 | non-USA | Volkswagen Fox |
2 | Volkswagen | Passat | Compact | 17.6 | 20.0 | 22.4 | 21 | 30 | None | Front | 4 | 2.0 | 134 | 5800 | 2685 | Yes | 18.5 | 5 | 180 | 103 | 67 | 35 | 31.5 | 14 | 2985 | non-USA | Volkswagen Passat |
Subset
The calling style filter(predicate, collection)
is consistently used with several higher-order functions, for example, map
and reduce
and is supported by Julia
’s do
syntax.
However, with data frames, the convention is to have action functions expect the data frame in the first position.
The subset
function also returns a filtered copy of the data frame with rows matching a certain selection criteria. It’s usage is somewhat similar to filter
with the order of its arguments reversed, but the predicate must return a vector of indices, so broadcasting or the ByRow
function may be necessary.
For example, these produce the same data frame:
= subset(cars, :Manufacturer => m -> m .== "Volkswagen") # use .== not just ==
d1 = subset(cars, :Manufacturer => ByRow(==("Volkswagen"))) # use `ByRow` to ensure predicate is applied to each d2
Row | Manufacturer | Model | Type | MinPrice | Price | MaxPrice | MPGCity | MPGHighway | AirBags | DriveTrain | Cylinders | EngineSize | Horsepower | RPM | RevPerMile | ManTransAvail | FuelTankCapacity | Passengers | Length | Wheelbase | Width | TurnCircle | RearSeatRoom | LuggageRoom | Weight | Origin | Make |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Cat… | String | Cat… | Float64 | Float64 | Float64 | Int32 | Int32 | Cat… | Cat… | Cat… | Float64 | Int32 | Int32 | Int32 | Cat… | Float64 | Int32 | Int32 | Int32 | Int32 | Int32 | Float64? | Int32? | Int32 | Cat… | String | |
1 | Volkswagen | Fox | Small | 8.7 | 9.1 | 9.5 | 25 | 33 | None | Front | 4 | 1.8 | 81 | 5500 | 2550 | Yes | 12.4 | 4 | 163 | 93 | 63 | 34 | 26.0 | 10 | 2240 | non-USA | Volkswagen Fox |
2 | Volkswagen | Eurovan | Van | 16.6 | 19.7 | 22.7 | 17 | 21 | None | Front | 5 | 2.5 | 109 | 4500 | 2915 | Yes | 21.1 | 7 | 187 | 115 | 72 | 38 | 34.0 | missing | 3960 | non-USA | Volkswagen Eurovan |
3 | Volkswagen | Passat | Compact | 17.6 | 20.0 | 22.4 | 21 | 30 | None | Front | 4 | 2.0 | 134 | 5800 | 2685 | Yes | 18.5 | 5 | 180 | 103 | 67 | 35 | 31.5 | 14 | 2985 | non-USA | Volkswagen Passat |
4 | Volkswagen | Corrado | Sporty | 22.9 | 23.3 | 23.7 | 18 | 25 | None | Front | 6 | 2.8 | 178 | 5800 | 2385 | Yes | 18.5 | 4 | 159 | 97 | 66 | 36 | 26.0 | 15 | 2810 | non-USA | Volkswagen Corrado |
Unlike filter
, subset
allows multiple arguments to be applied:
subset(cars,
:Manufacturer => ByRow(==("Volkswagen")),
:MPGCity => ByRow(>=(20)))
Row | Manufacturer | Model | Type | MinPrice | Price | MaxPrice | MPGCity | MPGHighway | AirBags | DriveTrain | Cylinders | EngineSize | Horsepower | RPM | RevPerMile | ManTransAvail | FuelTankCapacity | Passengers | Length | Wheelbase | Width | TurnCircle | RearSeatRoom | LuggageRoom | Weight | Origin | Make |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Cat… | String | Cat… | Float64 | Float64 | Float64 | Int32 | Int32 | Cat… | Cat… | Cat… | Float64 | Int32 | Int32 | Int32 | Cat… | Float64 | Int32 | Int32 | Int32 | Int32 | Int32 | Float64? | Int32? | Int32 | Cat… | String | |
1 | Volkswagen | Fox | Small | 8.7 | 9.1 | 9.5 | 25 | 33 | None | Front | 4 | 1.8 | 81 | 5500 | 2550 | Yes | 12.4 | 4 | 163 | 93 | 63 | 34 | 26.0 | 10 | 2240 | non-USA | Volkswagen Fox |
2 | Volkswagen | Passat | Compact | 17.6 | 20.0 | 22.4 | 21 | 30 | None | Front | 4 | 2.0 | 134 | 5800 | 2685 | Yes | 18.5 | 5 | 180 | 103 | 67 | 35 | 31.5 | 14 | 2985 | non-USA | Volkswagen Passat |
The mini language uses a default of identity
for the function so, if the columns are Boolean, they can be used to subset the data:
= DataFrame(a = [true, false], b = [true, missing])
dd subset(dd, :a) # just first row
Row | a | b |
---|---|---|
Bool | Bool? | |
1 | true | true |
For data with missing values, like in the :b
column of dd
, the comparison operators implement 3-valued logic, meaning the response can be true
, false
, or missing
. Using subset(dd, :b)
above will error. The keyword skipmissing
can be passed true
to have these skipped over. The default is false.
subset(dd, :b; skipmissing=true)
Row | a | b |
---|---|---|
Bool | Bool? | |
1 | true | true |
The dropmissing
function filters out each row that has a missing
value.
3.1.6 Selecting columns; transforming data
Filtering of a data frame can be viewed as accessing the data in df
with a boolean set of row indices, as in df[inds, :]
. (There are flags to have a “view” of the data, as in df[inds, !]
). Filtering returns a data frame with a possibly reduced number of rows, and the same number of columns.
The select
function is related in that it can be viewed as passing a Boolean vector of column indices to select specific columns or variables, as in df[:, inds]
. The syntax of select
also extends this allowing transformations of the data and reassignment, as could be achieved through this pattern: df.new_variable = f(other_variables_in_df)
.
The select
function will return a data frame with the same number of rows (cases) as the original data frame. The variables returned are selected using column selectors. There is support for the DataFrames
mini language.
First, we can select one or more columns by separating column selectors with commas. In the following, using the backpack data in the variable d
, we select the first column by column number, the second using a symbol, the third by a string, and the fourth and fifth using a regular expression:2
select(d, 1, :v, "p", r"^l")
Row | b | v | p | lap | load |
---|---|---|---|---|---|
String | Int64 | Float64? | Cat… | Cat… | |
1 | Goruck GR2 | 40 | 395.99 | Y | front panel |
2 | Minaal 3.0 | 35 | 349.99 | Y | front panel |
3 | Genius | 25 | 228.99 | Y | clamshell |
One use of select
is to rearrange the column order, say by moving a column to the front. The :
or All()
column selector will add the rest. That is this command would move “d
” to the front of the other columns: select(d, :d, All())
.
Selection also allows an easy renaming of column names, using the pairs notation column name => new column name
:
select(d, :v => :volume, "p" => "price")
Row | volume | price |
---|---|---|
Int64 | Float64? | |
1 | 40 | 395.99 |
2 | 35 | 349.99 |
3 | 25 | 228.99 |
This usage is a supported abbreviation of source_column => function => target_column_name
, where the function is identity
. Here the function is not necessarily a predicate function, as it is with subset
, but rather a function which returns a vector of the same length as the number of columns in the data frame (as select
always returns the same number of columns as the data frame it is passed.)
The function receives column vectors and should return a column vector. For a scalar function, it must be applied to each entry of the column vector, that is, each row. Broadcasting is one manner to do this. As used previously, the DataFrames
package provides ByRow
to also apply a function to each row of the column vector(s).
For example to compute a ratio of price to volume, we might have:
select(d, [:p, :v] => ByRow((p,v) -> p/v) => "price to volume")
Row | price to volume |
---|---|
Float64 | |
1 | 9.89975 |
2 | 9.99971 |
3 | 9.1596 |
The above would be equivalent to select(d, [:p, :v] => ((p,v) -> p./v) => "price to volume")
(broadcasted p ./ v
and parentheses), but not select(d, [:p, :v] => (p,v) -> p./v => "price to volume")
, as the precedence rules for =>
do not parse the expressions identically. That is, the use of parentheses around an anonymous function is needed.
The specification of a target column name is not necessary, as one will be generated. In this particular example, it would be :p_v_function
. For named functions, the automatic namings are a bit more informative, as the method name replaces the “function
” part. The renamecols
keyword argument can be set to false
to drop the addition of a function name, which in this case would leave :p_v
as the name. For a single-column selector it would not rename the column, rather replace it.
The above reduced the number of columns to just that specified. The :
selector will select all columns (as it does with indexing) and return them:
select(d, [:p, :v] => ByRow((p,v) -> p/v) => "price to volume", :) # return all columns
Row | price to volume | b | v | p | lap | load | d |
---|---|---|---|---|---|---|---|
Float64 | String | Int64 | Float64? | Cat… | Cat… | Date | |
1 | 9.89975 | Goruck GR2 | 40 | 395.99 | Y | front panel | 2022-09-01 |
2 | 9.99971 | Minaal 3.0 | 35 | 349.99 | Y | front panel | 2022-09-01 |
3 | 9.1596 | Genius | 25 | 228.99 | Y | clamshell | 2022-10-01 |
As seen in this last example, the source column selectors can select more than one column. The function receives multiple arguments. It is possible that there be multiple target columns. For example, we could compute both price-to-volume and volume-to-price quantities. A function to do so would be
pv_and_vp(p,v) = [p/v, v/p]
pv_and_vp (generic function with 1 method)
When we use this, we see the two values are stored in one column.
select(d, [:p, :v] => ByRow(pv_and_vp))
Row | p_v_pv_and_vp |
---|---|
Array… | |
1 | [9.89975, 0.101013] |
2 | [9.99971, 0.100003] |
3 | [9.1596, 0.109175] |
To split these up, we can give two names:
select(d, [:p, :v] => ByRow(pv_and_vp) => [:pv, :vp])
Row | pv | vp |
---|---|---|
Float64 | Float64 | |
1 | 9.89975 | 0.101013 |
2 | 9.99971 | 0.100003 |
3 | 9.1596 | 0.109175 |
AsTable
The AsTable
function has two meanings in the mini language. If on the target side, it assumes the output of the function is a vector of containers (in this example the vectors [p/v, v/p]
) that should be expanded into multiple columns. It uses the keys (were the containers named tuples) or, in this case generates them, to create multiple columns in the destination:
select(d, [:p, :v] => ByRow(pv_and_vp) => AsTable)
Row | x1 | x2 |
---|---|---|
Float64 | Float64 | |
1 | 9.89975 | 0.101013 |
2 | 9.99971 | 0.100003 |
3 | 9.1596 | 0.109175 |
Had we used a named tuple in our function, pv_and_vp(p, v) = (pv = p/v, vp = v/p)
, then the column names would come from the tuple’s names and need not be generated.
When AsTable
is used on the source columns, as in AsTable([:p,:v])
then the function receives a named tuple with column vector entries. (For this, pv_and_vp
would be written pv_and_vp(r) = [r.p/r.v, r.v/r.p]
, with r
representing a row in a two-column table with names :p
and :v
.
Transform
Extending the columns in the data frame by select
is common enough that the function transform
is supplied which always keeps the columns of the original data frame, though they can also be modified through the mini language. The use of transfrom is equivalent to select(df, :, args...)
.
The mini language is well documented in the documentation string of transform
and the blog post of Bogumil Kasminski “DataFrames.jl minilanguage explained.”
3.1.7 Combine
The combine
function creates a new data frame whose columns are the result of transformations applied to the source data frame. The name will be more clear after we discuss grouping or splitting of a data set.
A typical usage of combine is to apply some reduction to the data, for example finding an average.
In this example, we use the mean
function from the StatsBase
package and apply that to the :MPGCity
measurements in the cars
data:
combine(cars, :MPGCity => mean)
Row | MPGCity_mean |
---|---|
Float64 | |
1 | 22.3656 |
(A second pair can be used to specify a name for the target, as in :MPGCity => mean => :AvgMPGCity
)
There are several numeric variables in this data set, we may wish to find the mean of more than 1 at a time. For this, we can broadcast the columns via .=>
, as in this example:
combine(cars, [:MPGCity, :MPGHighway] .=> mean)
Row | MPGCity_mean | MPGHighway_mean |
---|---|---|
Float64 | Float64 | |
1 | 22.3656 | 29.086 |
(That is an alternate to combine(cars, :MPGCity => mean, :MPGHighway => mean)
.)
Broadcasting =>
lends itself to succinctly applying a function to multiple columns.
For example, to apply mean
to all the numeric values, we might select them first (using the filtering feature of names
), then pass to the mini language, as follows:
= names(cars, Real)
nms combine(cars, nms .=> mean; renamecols=false)
Row | MinPrice | Price | MaxPrice | MPGCity | MPGHighway | EngineSize | Horsepower | RPM | RevPerMile | FuelTankCapacity | Passengers | Length | Wheelbase | Width | TurnCircle | Weight |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | |
1 | 17.1258 | 19.5097 | 21.8989 | 22.3656 | 29.086 | 2.66774 | 143.828 | 5280.65 | 2332.2 | 16.6645 | 5.08602 | 183.204 | 103.946 | 69.3763 | 38.957 | 3072.9 |
More than one transformation at a time is possible, as already seen in the mini language. With combine
the number of rows is determined by the output of the transformation. In this example, the mean
reduces to a number, but unique
reduces :Origin
to the number of levels. The result of mean
is repeated, it is not the mean for each group (a task we demonstrate shortly):
combine(cars, :MPGCity => mean, :Origin => unique)
Row | MPGCity_mean | Origin_unique |
---|---|---|
Float64 | String | |
1 | 22.3656 | non-USA |
2 | 22.3656 | USA |
3.1.8 Flatten
As mentioned, the repeat
function is used to repeat values in a vector, or other iterable. The simplest usage looks like this:
repeat([1,2,3], 2) |> show
[1, 2, 3, 1, 2, 3]
A common use of repetition is to take two variables, say x
and y
, perhaps of unequal length and create a data frame with two columns: one to store the values and one to indicate if a value came from x
or y
. This could be achieved like
= [1,2,3]
x = [4, 5]
y
DataFrame(variable=vcat(repeat([:x], length(x)), repeat([:y], length(y))),
= vcat(x,y)) values
Row | variable | values |
---|---|---|
Symbol | Int64 | |
1 | x | 1 |
2 | x | 2 |
3 | x | 3 |
4 | y | 4 |
5 | y | 5 |
The flatten
function offers an alternative. Consider the data frame:
= [1,2,3]
x = [4, 5]
y = DataFrame(variable=[:x,:y], value=[x, y]) dxy
Row | variable | value |
---|---|---|
Symbol | Array… | |
1 | x | [1, 2, 3] |
2 | y | [4, 5] |
The flatten
function will iterate over the indicated columns and repeat the other variables:
flatten(dxy, :value)
Row | variable | value |
---|---|---|
Symbol | Int64 | |
1 | x | 1 |
2 | x | 2 |
3 | x | 3 |
4 | y | 4 |
5 | y | 5 |
3.1.9 SplitApplyCombine
The split-apply-combine strategy for wrangling data frames was popularized in the influential paper “The Split-Apply-Combine Strategy for Data Analysis: (Wickham 2011) by H. Wickham which introduces this description:”where you break up a big problem into manageable pieces, operate on each piece independently and then put all the pieces back together.”
Grouping data
The groupby
function for data frames does the breaking up, the DataFrames
mini language can operate on each piece, and the combine
function can put things together again.
Grouping takes one or more categorical variables and creates a “grouped” data frame for each different set of combined levels. That is one grouped data frame for each possible combination of the levels. The groupby
function splits the data frame into pieces of type GroupedDataFrame
. The basic signature is groupby(df, cols; sort=nothing, skipmissing=false)
. The cols
are one or more column selectors. The value of sort
defaults to nothing
leaving the order of the groups in the result undefined, but the grouping uses the fastest identified algorithm. The skipmissing
argument can instruct the skipping of rows which have missing values in the selected columns.
For example, grouping the baggage data by loading style, returns two sub data frames.
= groupby(d, :load) gdf
GroupedDataFrame with 2 groups based on key: load
Row | b | v | p | lap | load | d |
---|---|---|---|---|---|---|
String | Int64 | Float64? | Cat… | Cat… | Date | |
1 | Goruck GR2 | 40 | 395.99 | Y | front panel | 2022-09-01 |
2 | Minaal 3.0 | 35 | 349.99 | Y | front panel | 2022-09-01 |
⋮
Row | b | v | p | lap | load | d |
---|---|---|---|---|---|---|
String | Int64 | Float64? | Cat… | Cat… | Date | |
1 | Genius | 25 | 228.99 | Y | clamshell | 2022-10-01 |
Grouping by more than one variable is achieved by passing in a selector with multiple variables. The following also creates two sub data frames, but only because the unique combination of both load and date are the same as just load:
= groupby(d, [:load, :d]) gdf
GroupedDataFrame with 2 groups based on keys: load, d
Row | b | v | p | lap | load | d |
---|---|---|---|---|---|---|
String | Int64 | Float64? | Cat… | Cat… | Date | |
1 | Goruck GR2 | 40 | 395.99 | Y | front panel | 2022-09-01 |
2 | Minaal 3.0 | 35 | 349.99 | Y | front panel | 2022-09-01 |
⋮
Row | b | v | p | lap | load | d |
---|---|---|---|---|---|---|
String | Int64 | Float64? | Cat… | Cat… | Date | |
1 | Genius | 25 | 228.99 | Y | clamshell | 2022-10-01 |
Indices may be used to get each sub data frame. Moreover, the keys
method returns keys, like indices, to efficiently look up the different sub data frames in gdf
. The levels for the keys are found in the property .load
, as with keys(gdf)[1].load
. Grouped data frames may be iterated over; the pairs
iterator iterates over both keys and values.
The select
and subset
functions work over GroupedDataFrame
objects, here we see the returned values are combined:
subset(gdf, :p => ByRow(<=(350)))
Row | b | v | p | lap | load | d |
---|---|---|---|---|---|---|
String | Int64 | Float64? | Cat… | Cat… | Date | |
1 | Minaal 3.0 | 35 | 349.99 | Y | front panel | 2022-09-01 |
2 | Genius | 25 | 228.99 | Y | clamshell | 2022-10-01 |
(The command [subset(g, :p => ByRow(<=(350))) for g in gdf]
avoids the combine
step.)
Group and combine
We can see that combine
also works over GroupedDataFrame
objects. This example shows how to find the mean city mileage of groups formed by the value of :Origin
in the cars
data set:
combine(groupby(cars, :Origin), :MPGCity => mean)
Row | Origin | MPGCity_mean |
---|---|---|
Cat… | Float64 | |
1 | USA | 20.9583 |
2 | non-USA | 23.8667 |
Typically, the function applied to a column is a reduction, which results in aggregated data.
By default the keys used to group (:Origin
) are kept by combine
. The keepkeys
argument for combine
when applied to grouped data controls this feature.
More than one column can be used to group the data. This example first groups by the number of cylinders, then groups by origin. For each of the 9 resulting groups, the group mean for city mileage is taken:
combine(groupby(cars, [:Cylinders, :Origin]), :MPGCity => mean)
Row | Cylinders | Origin | MPGCity_mean |
---|---|---|---|
Cat… | Cat… | Float64 | |
1 | 3 | non-USA | 39.3333 |
2 | 4 | USA | 24.4091 |
3 | 4 | non-USA | 25.2222 |
4 | 5 | non-USA | 18.5 |
5 | 6 | USA | 18.35 |
6 | 6 | non-USA | 18.5455 |
7 | 8 | USA | 17.0 |
8 | 8 | non-USA | 17.0 |
9 | rotary | non-USA | 17.0 |
The @chain
macro may make this more readable:
@chain cars begin
groupby([:Cylinders, :Origin])
combine(:MPGCity => mean)
describeend
Row | variable | mean | min | median | max | nmissing | eltype |
---|---|---|---|---|---|---|---|
Symbol | Union… | Any | Union… | Any | Int64 | DataType | |
1 | Cylinders | 3 | rotary | 0 | CategoricalValue{String, UInt8} | ||
2 | Origin | USA | non-USA | 0 | CategoricalValue{String, UInt8} | ||
3 | MPGCity_mean | 21.7067 | 17.0 | 18.5 | 39.3333 | 0 | Float64 |
Example 3.1 (Lego sets) For an example, we download a data set from the internet describing all the sets of Legos sold during some time period. This data set is part of the data sets provided by openintro.org
, an organization trying to make interesting educational products that are free and transparent.
= download("https://www.openintro.org/data/csv/lego_population.csv")
csv_data = CSV.read(csv_data, DataFrame)
legos first(legos, 2)
Row | item_number | set_name | theme | pieces | price | amazon_price | year | ages | pages | minifigures | packaging | weight | unique_pieces | size |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Int64 | String | String31 | String7 | String31 | String31 | Int64 | String15 | String7 | String3 | String31 | String31 | String7 | String7 | |
1 | 41916 | Extra Dots - Series 2 | DOTS | 109 | 3.99 | 3.44 | 2020 | Ages_6+ | NA | NA | Foil pack | NA | 6 | Small |
2 | 41908 | Extra Dots - Series 1 | DOTS | 109 | 3.99 | 3.99 | 2020 | Ages_6+ | NA | NA | Foil pack | NA | 6 | Small |
Let’s explore this data.
The size of the data set is printed, though suppressed above as only the first 2 rows are requested to be shown, or can be programmatically identified with size(legos)
.
size(legos)
(1304, 14)
Each row is a product; there are a lot of different ones.
The types identified by CSV.read
are not perfect. The data uses NA
for not-available. We read again using the argument missingstring="NA"
:
= CSV.read(csv_data, DataFrame; missingstring="NA")
legos first(legos, 2) # show first 2 rows
Row | item_number | set_name | theme | pieces | price | amazon_price | year | ages | pages | minifigures | packaging | weight | unique_pieces | size |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Int64 | String | String31? | Int64? | Float64? | Float64? | Int64 | String15 | Int64? | Int64? | String31? | String31? | Int64? | String7? | |
1 | 41916 | Extra Dots - Series 2 | DOTS | 109 | 3.99 | 3.44 | 2020 | Ages_6+ | missing | missing | Foil pack | missing | 6 | Small |
2 | 41908 | Extra Dots - Series 1 | DOTS | 109 | 3.99 | 3.99 | 2020 | Ages_6+ | missing | missing | Foil pack | missing | 6 | Small |
The :theme
and :packaging
variables are categorical, so we make them so:
= categorical(legos.theme)
legos.theme = categorical(legos.packaging);
legos.packaging first(legos, 2)
Row | item_number | set_name | theme | pieces | price | amazon_price | year | ages | pages | minifigures | packaging | weight | unique_pieces | size |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Int64 | String | Cat…? | Int64? | Float64? | Float64? | Int64 | String15 | Int64? | Int64? | Cat…? | String31? | Int64? | String7? | |
1 | 41916 | Extra Dots - Series 2 | DOTS | 109 | 3.99 | 3.44 | 2020 | Ages_6+ | missing | missing | Foil pack | missing | 6 | Small |
2 | 41908 | Extra Dots - Series 1 | DOTS | 109 | 3.99 | 3.99 | 2020 | Ages_6+ | missing | missing | Foil pack | missing | 6 | Small |
We see data on the number of pieces and the age range. Is there some relationship?
The :age
variable should be an ordered factor, ordered by the youngest age intended for use. The :ages
variable has a pattern Ages_startXXX
where XXX
may be +
to indicate or up, or a dash to indicate a range. We use this to identify the youngest intended age.
# alternative to
# `(m = match(r"Ages_(\d+)", x); m === nothing ? missing : parse(Int, m.captures[1]))`
function pick_first_age(x)
ismissing(x) && return missing
= collect(string(0:9...))
nos = ""
out for xi in x[6:end] # drop Ages_
in nos) && break
!(xi = out * xi
out end
== "" && return missing
out parse(Int, out)
end
transform!(legos, :ages => ByRow(pick_first_age) => :youngest_age)
= categorical(legos.youngest_age, ordered=true)
legos.youngest_age first(legos[:,r"age"], 2)
Row | ages | pages | youngest_age |
---|---|---|---|
String15 | Int64? | Cat…? | |
1 | Ages_6+ | missing | 6 |
2 | Ages_6+ | missing | 6 |
With that ordering, an expected pattern becomes clear – kits for older users have on average more pieces – though there are unexpected exceptions:
@chain legos begin
groupby(:youngest_age)
combine([:pieces, :unique_pieces] .=> mean∘skipmissing
.=> [:avg_pieces, :avg_unique_pieces])
end
Row | youngest_age | avg_pieces | avg_unique_pieces |
---|---|---|---|
Cat…? | Float64 | Float64 | |
1 | missing | 127.592 | 40.5604 |
2 | 1 | 37.2083 | 26.2174 |
3 | 2 | 55.617 | 31.2979 |
4 | 3 | 140.0 | 41.0 |
5 | 4 | 194.74 | 87.2857 |
6 | 5 | 147.712 | 85.1557 |
7 | 6 | 191.079 | 87.1289 |
8 | 7 | 265.556 | 108.816 |
9 | 8 | 491.193 | 173.801 |
10 | 9 | 859.182 | 252.773 |
11 | 10 | 417.52 | 109.768 |
12 | 11 | 2357.86 | 214.143 |
13 | 12 | 1307.29 | 256.4 |
14 | 14 | 1617.25 | 349.5 |
15 | 16 | 2454.79 | 409.107 |
16 | 18 | 2073.84 | 214.294 |
(We composed the two functions skipmissing
with mean
using the \circ[tab]
operator and use .=>
in both places to avoid defining the transformation function twice.)
Lego, was the largest toy company in the world by 2021. By 2015 it had produced over 600 billion parts.
Lego is a mature company with an origin dating to 1934. The number of products per year should be fairly stable, though it is easy enough to check. The data set has a :year
variable, so we would only need to group the data by that, then count the number of cases per each group. The nrow
function will count the cases. This function is special cased in the DataFrames
’ mini language and can appear by itself:
combine(groupby(legos, :year), nrow) # use `nrow => :n`, or some such, to label differently
Row | year | nrow |
---|---|---|
Int64 | Int64 | |
1 | 2018 | 442 |
2 | 2019 | 423 |
3 | 2020 | 439 |
We can take other summaries by year, for example, here we find the average price by year and size:
@chain legos begin
groupby([:year, :size])
combine(nrow => :n,
:price => mean∘skipmissing => :avg_price
)end
Row | year | size | n | avg_price |
---|---|---|---|---|
Int64 | String7? | Int64 | Float64 | |
1 | 2018 | Small | 324 | 45.7792 |
2 | 2018 | Large | 21 | 28.59 |
3 | 2018 | missing | 97 | 25.8135 |
4 | 2019 | Small | 322 | 50.4663 |
5 | 2019 | Large | 14 | 29.6329 |
6 | 2019 | missing | 87 | 13.8536 |
7 | 2020 | Small | 335 | 49.8104 |
8 | 2020 | Large | 18 | 33.5456 |
9 | 2020 | missing | 86 | 21.6959 |
The youngest age range is a bit long. We wish to collapse it to the ranges 1-6, 7-12, and 12-18. The data has been stored as a categorical array. The cut
function can group numeric data easily, but to group categorical data, the cut-ranges must be promoted to a CategoricalValue
. We have the following, where Ref
is used to stop the broadcasting for that value:
= CategoricalValue.([1, 6, 12, 18], Ref(legos.youngest_age))
vals transform!(legos,
:youngest_age => (x ->cut(x, vals; extend=true))
=> :youngest_age_range);
We now check if the number of pieces has dramatically changed over the year. We break up the data by the age range, as we expect variation in that value so it is best to dis-aggregate over that factor:
@chain legos begin
groupby([:youngest_age_range, :year])
combine(nrow => :n,
:pieces => mean∘skipmissing => :avg_pieces)
end
Row | youngest_age_range | year | n | avg_pieces |
---|---|---|---|---|
Cat…? | Int64 | Int64 | Float64 | |
1 | missing | 2018 | 33 | 165.379 |
2 | missing | 2019 | 34 | 95.7778 |
3 | missing | 2020 | 44 | 111.8 |
4 | [1, 6) | 2018 | 98 | 131.561 |
5 | [1, 6) | 2019 | 92 | 122.533 |
6 | [1, 6) | 2020 | 92 | 153.413 |
7 | [6, 12) | 2018 | 292 | 341.866 |
8 | [6, 12) | 2019 | 275 | 389.945 |
9 | [6, 12) | 2020 | 276 | 365.887 |
10 | [12, 18] | 2018 | 19 | 1956.21 |
11 | [12, 18] | 2019 | 22 | 1938.95 |
12 | [12, 18] | 2020 | 27 | 2111.3 |
No real pattern shows up.
There are many different themes. The command unique(legos.theme)
shows 41. Which are the most popular? To see, the data is grouped by the theme, each group is counted, missing themes are dropped, then the data frame is sorted and the top 5 rows are shown:
@chain legos begin
groupby(:theme)
combine(nrow => :n)
dropmissingsort(:n; rev=true)
first(5)
end
Row | theme | n |
---|---|---|
Cat… | Int64 | |
1 | Star Wars™ | 119 |
2 | Friends | 103 |
3 | City | 101 |
4 | NINJAGO® | 78 |
5 | DUPLO® | 53 |
The SplitApplyCombine package
There is support for the split-apply-combine paradigm outside of the DataFrames
package in the package SplitApplyCombine
. This package is much lighter weight than DataFrames
. This support targets other representations of tabular data, such as a vector of nested tuples:
using SplitApplyCombine
= [
tbl = "Goruck GR2", v = 40, p = 395, lap = "Y", load = "front panel", d = Date("2022-09-01")),
(b = "Minaal 3.0", v = 35, p = 349, lap = "Y", load = "front panel", d = Date("2022-09-01")),
(b = "Genius ", v = 25, p = 228, lap = "Y", load = "clamshell", d = Date("2022-10-01"))
(b ]
3-element Vector{@NamedTuple{b::String, v::Int64, p::Int64, lap::String, load::String, d::Date}}:
(b = "Goruck GR2", v = 40, p = 395, lap = "Y", load = "front panel", d = Date("2022-09-01"))
(b = "Minaal 3.0", v = 35, p = 349, lap = "Y", load = "front panel", d = Date("2022-09-01"))
(b = "Genius ", v = 25, p = 228, lap = "Y", load = "clamshell", d = Date("2022-10-01"))
Many of the main verbs are generic functions: filter
, map
, reduce
; group
does grouping. There are some others. A few examples follow.
For indexing this structure, we can’t index by [row, col]
, rather we can use [row][col]
notation, the first to extract the row, the second to access the entries in the column. For example, he we index a selected row by position, by name, and then with multiple names:
2][2], tbl[2].v, tbl[2][ [:v, :p] ] tbl[
(35, 35, (v = 35, p = 349))
The invert
function reverses the access pattern, allowing in this case, [col][row]
access, with:
= invert(tbl)
itbl 3][1], itbl[:p][2] itbl[
(395, 349)
In some ways, the invert
function flips the viewpoint from an array of structs to a struct of arrays.
To filter out rows, we have the same calling style as with data frames: filter(pred, tbl)
. With the outer container of tbl
being an array, no special method is utilized, iteration is over each row. For example,
filter(r -> r.v >= 35, tbl) |> DataFrame
Row | b | v | p | lap | load | d |
---|---|---|---|---|---|---|
String | Int64 | Int64 | String | String | Date | |
1 | Goruck GR2 | 40 | 395 | Y | front panel | 2022-09-01 |
2 | Minaal 3.0 | 35 | 349 | Y | front panel | 2022-09-01 |
(The call to DataFrame
is to take advantage of the prettier printing. DataFrame
can consume a vector of named tuples for its input.)
The DataPipes
package composes a bit more nicely with this approach, but we continue using Chain
for examples3, like this one of nested calls to filter
:
@chain tbl begin
filter(r -> r.v >= 35, _)
filter(r -> r.p >= 350, _)
DataFrameend
Row | b | v | p | lap | load | d |
---|---|---|---|---|---|---|
String | Int64 | Int64 | String | String | Date | |
1 | Goruck GR2 | 40 | 395 | Y | front panel | 2022-09-01 |
Selecting columns and adding a transformation can be achieved through map
, which iterates over each named tuple in this example:
map(r -> (b=r.b, v=r.v, p=r.p, pv = round(r.p/r.v; digits=2)), tbl) |>
DataFrame
Row | b | v | p | pv |
---|---|---|---|---|
String | Int64 | Int64 | Float64 | |
1 | Goruck GR2 | 40 | 395 | 9.88 |
2 | Minaal 3.0 | 35 | 349 | 9.97 |
3 | Genius | 25 | 228 | 9.12 |
The group
function can group by values of a function call. This example from the package’s Readme is instructive:
= ["Andrew Smith", "John Smith", "Alice Baker",
nms "Robert Baker", "Jane Smith", "Jason Bourne"]
group(last, split.(nms))
3-element Dictionaries.Dictionary{SubString{String}, Vector{Vector{SubString{String}}}}
"Smith" │ Vector{SubString{String}}[["Andrew", "Smith"], ["John", "Smith"], […
"Baker" │ Vector{SubString{String}}[["Alice", "Baker"], ["Robert", "Baker"]]
"Bourne" │ Vector{SubString{String}}[["Jason", "Bourne"]]
The names are split on spaces, and the keys of the group are determined by the last
function, which here gets the last piece after splitting (aka the “last” name, but by coincidence, not intelligence).
For numeric operations, this is quite convenient. Here we group
by the remainder upon division by 3
:
group(x -> rem(x,3), 1:10)
3-element Dictionaries.Dictionary{Int64, Vector{Int64}}
1 │ [1, 4, 7, 10]
2 │ [2, 5, 8]
0 │ [3, 6, 9]
The grouping basically creates a dictionary and adds to the key determined by the function.
To apply a function to the values of a base Julia
dictionary is not directly supported by map
, however the dictionary used by group
(from the Dictionaries
packages) does allow this, so we could, for example, add up the values in each group with:
map(sum, group(x -> rem(x,3), 1:10))
3-element Dictionaries.Dictionary{Int64, Int64}
1 │ 22
2 │ 15
0 │ 18
The group
function can do this in one call, by placing the function to apply in between the by
function and the table. Here we apply first
to pick off the first name. That is, ["Andrew", "Smith"]
becomes just "Andrew"
, the first element of that vector.
= group(last, first, split.(nms)) gps
3-element Dictionaries.Dictionary{SubString{String}, Vector{SubString{String}}}
"Smith" │ SubString{String}["Andrew", "John", "Jane"]
"Baker" │ SubString{String}["Alice", "Robert"]
"Bourne" │ SubString{String}["Jason"]
Example 3.2 (A tally function) The Tally
package provides a quick way to tally up numbers. Here we do a simplified version.
To tally, we have three steps: group by each by value, count the number in each groups, sort the values. This is implemented with this composition of functions:
tally(v; by=identity) = SplitApplyCombine.sortkeys(map(length, group(by, v)))
tally (generic function with 1 method)
To see, we have:
= rand(1:5, 50)
v = tally(v) d
5-element Dictionaries.Dictionary{Int64, Int64}
1 │ 9
2 │ 9
3 │ 11
4 │ 9
5 │ 12
Using map
, we can provide an alternate, oft used view for tallying:
= "\u007C"^4
tallies = "┼┼┼┼ "
ftally function prison_count(x)
= divrem(x, 5)
d, r ^d * tallies[1:r]
ftallyend
map(prison_count, d)
5-element Dictionaries.Dictionary{Int64, Any}
1 │ "┼┼┼┼ ||||"
2 │ "┼┼┼┼ ||||"
3 │ "┼┼┼┼ ┼┼┼┼ |"
4 │ "┼┼┼┼ ||||"
5 │ "┼┼┼┼ ┼┼┼┼ ||"
Example 3.3 (A simple stem and leaf diagram) The grouping and sorting made quite accessible by SplitApplyCombine
could also be used to produce a simple stem-and-leaf type diagram. Here we don’t bother displaying stems with no associated leaves, which does change the interpretation of spread in the data, and could easily be added in, though it takes us a bit afield, as it requires some manipulation of the display:
= rand(0:100, 32) # 32 test scores
x = 10 # stem * 10 is value of stem
stem_unit @chain x begin
round.(Int, 10*_/stem_unit)
group(x -> x ÷ 10, x->rem(x,10), _)
SplitApplyCombine.sortkeys(_)
SplitApplyCombine.map(sort, _)
map(x -> parse(Int,join(x)), _)
end
10-element Dictionaries.Dictionary{Int64, Int64}
0 │ 1139
1 │ 78
2 │ 37
3 │ 478
4 │ 1489
5 │ 468
6 │ 233
7 │ 99
8 │ 55
9 │ 2478
Let’s return to the cars
data and use group
to identify the average mileage per type. For this task, we group by :Type
and then apply a function to extract the mileage from a row. The copy.(eachrow(cars))
command creates a vector of named tuples.
= copy.(eachrow(cars)) # Can also call `NamedTuple` for `copy`
cs = group(r -> r.Type, r -> r.MPGCity, cs) gps
6-element Dictionaries.Dictionary{CategoricalValue{String, UInt8}, Vector{Int32}}
CategoricalValue{String, UInt8} "Smal… │ Int32[25, 29, 23, 29, 31, 23, 46, 42,…
CategoricalValue{String, UInt8} "Mids… │ Int32[18, 19, 22, 22, 19, 16, 21, 21,…
CategoricalValue{String, UInt8} "Comp… │ Int32[20, 25, 25, 23, 22, 22, 24, 26,…
CategoricalValue{String, UInt8} "Larg… │ Int32[19, 16, 16, 17, 20, 20, 20, 18,…
CategoricalValue{String, UInt8} "Spor… │ Int32[19, 17, 18, 22, 24, 30, 24, 26,…
CategoricalValue{String, UInt8} "Van" │ Int32[18, 15, 17, 15, 18, 17, 18, 18,…
We now can map mean
over each group:
= map(mean, gps) ms
6-element Dictionaries.Dictionary{CategoricalValue{String, UInt8}, Float64}
CategoricalValue{String, UInt8} "Small" │ 29.857142857142858
CategoricalValue{String, UInt8} "Midsize" │ 19.545454545454547
CategoricalValue{String, UInt8} "Compact" │ 22.6875
CategoricalValue{String, UInt8} "Large" │ 18.363636363636363
CategoricalValue{String, UInt8} "Sporty" │ 21.785714285714285
CategoricalValue{String, UInt8} "Van" │ 17.0
This requires two passes through the grouped data, the first to get just the mileage values, the next to call mean
. The following DataFrames
syntax hides this:
combine(groupby(cars, :Type), :MPGHighway => mean)
Row | Type | MPGHighway_mean |
---|---|---|
Cat… | Float64 | |
1 | Compact | 29.875 |
2 | Large | 26.7273 |
3 | Midsize | 26.7273 |
4 | Small | 35.4762 |
5 | Sporty | 28.7857 |
6 | Van | 21.8889 |
Reductions, like sum and count which can be written easily using the higher-order function reduce
, have support to combine the grouping and reduction. The groupreduce
function takes the additional arguments of reduce
(an operation and an optional init
value). For example, to get the group sum, we could call:
groupreduce(r -> r.Type, r -> r.MPGCity, +, cs)
6-element Dictionaries.Dictionary{CategoricalValue{String, UInt8}, Int32}
CategoricalValue{String, UInt8} "Small" │ 627
CategoricalValue{String, UInt8} "Midsize" │ 430
CategoricalValue{String, UInt8} "Compact" │ 363
CategoricalValue{String, UInt8} "Large" │ 202
CategoricalValue{String, UInt8} "Sporty" │ 305
CategoricalValue{String, UInt8} "Van" │ 153
The group sum and group count have shortcuts, so the mean could have been computed as:
groupsum(r -> r.Type, r -> r.MPGCity, cs) ./ groupcount(r -> r.Type, cs)
6-element Dictionaries.Dictionary{CategoricalValue{String, UInt8}, Float64}
CategoricalValue{String, UInt8} "Small" │ 29.857142857142858
CategoricalValue{String, UInt8} "Midsize" │ 19.545454545454547
CategoricalValue{String, UInt8} "Compact" │ 22.6875
CategoricalValue{String, UInt8} "Large" │ 18.363636363636363
CategoricalValue{String, UInt8} "Sporty" │ 21.785714285714285
CategoricalValue{String, UInt8} "Van" │ 17.0
(Unlike for a Dict
instance, the dictionaries above allow such division.)
The “combine” part of the paradigm takes the pieces and reassembles. The data frames example returns a data frame, whereas with SplitApplyCombine.jl
we have a dictionary. A dictionary does not map directly to a data frame. While both dictionaries and named tuples are associative arrays, we wouldn’t necessarily want to map our data to a row. Rather, here we use the pairs
iterator to iterate over the keys and values to produce an array of named tuples:
Type, MPG) for (Type,MPG) in pairs(ms)] |> DataFrame [ (;
Row | Type | MPG |
---|---|---|
Cat… | Float64 | |
1 | Small | 29.8571 |
2 | Midsize | 19.5455 |
3 | Compact | 22.6875 |
4 | Large | 18.3636 |
5 | Sporty | 21.7857 |
6 | Van | 17.0 |
3.1.10 Aggregating data
Somewhat inverse to flattening data is to aggregate data, by which we mean combining multiple columns into a single one, or combining multiple rows into a single one. The select
and combine
methods can do these tasks.
Consider the three price variables in the cars
data. We can find an average of the three price numbers easily enough. This function allows the specification as different arguments, not in a container:
_mean(xs...) = mean(xs)
_mean (generic function with 1 method)
We can then use select
to aggregate the three price variables. In the following, these are selected with a regular expression:
= select(cars, 1:3, r"Price" => ByRow(_mean) => :price)
cars_price first(cars_price, 3)
Row | Manufacturer | Model | Type | price |
---|---|---|---|---|
Cat… | String | Cat… | Float64 | |
1 | Acura | Integra | Small | 15.8667 |
2 | Acura | Legend | Midsize | 33.9333 |
3 | Audi | 90 | Compact | 29.1 |
To aggregate over columns, the groupby
construct can be used to specify the groupings to aggregate over. In the following this is the car type. The Chain
style is used to make the steps more explicit:
@chain cars_price begin
groupby(:Type)
combine(:price => mean => :avg_price)
end
Row | Type | avg_price |
---|---|---|
Cat… | Float64 | |
1 | Compact | 18.2104 |
2 | Large | 24.303 |
3 | Midsize | 27.2152 |
4 | Small | 10.1667 |
5 | Sporty | 19.4024 |
6 | Van | 19.1111 |
As mean
is a reduction, the result has one row for each unique set of levels in the column selector specified to groupby
. Again, that these columns appear in the output is due to the chosen default value ofkeepkeys
.
This similar example shows grouping by two variables and then aggregating miles per gallon over each:
@chain cars begin
groupby([:Type, :DriveTrain])
combine(:MPGHighway => mean => :avg_MPG)
sort(:avg_MPG, rev=true)
end
Row | Type | DriveTrain | avg_MPG |
---|---|---|---|
Cat… | Cat… | Float64 | |
1 | Small | Front | 35.6842 |
2 | Small | 4WD | 33.5 |
3 | Sporty | Front | 30.5714 |
4 | Compact | Front | 30.0769 |
5 | Compact | 4WD | 30.0 |
6 | Compact | Rear | 28.5 |
7 | Large | Front | 27.2857 |
8 | Midsize | Front | 27.1765 |
9 | Sporty | 4WD | 27.0 |
10 | Sporty | Rear | 27.0 |
11 | Large | Rear | 25.75 |
12 | Midsize | Rear | 25.2 |
13 | Van | Front | 22.5 |
14 | Van | 4WD | 21.4 |
3.1.11 Tidy data; stack, unstack
The split-apply-combine paradigm suggest that storing data so that it can be split readily is preferable to other ways of storage. The notion of “tidy data,” (Wickham 2014) as presented in R’s tidyr
package and described in this vignette suggests data sets should follow:
- Every column is a variable.
- Every row is an observation.
- Every cell is a single value.
These terms are defined by
A dataset is a collection of values, usually either numbers (if quantitative) or strings (if qualitative). Values are organised in two ways. Every value belongs to a variable and an observation. A variable contains all values that measure the same underlying attribute (like height, temperature, duration) across units. An observation contains all values measured on the same unit (like a person, or a day, or a race) across attributes.
In addition to flatten
, the DataFrames
package provides the functions stack
and unstack
for tidying data.
The stack
and unstack
commands are used to move from “wide” format, where related values are stored in multiple columns with column names representing some level, to “long” format, where related values are stored in one column, and their categorical levels in another column.
For example, this wide format as the attributes are in columns :A
and :B
; the values in columns :C
and :D
.
= DataFrame(A=["A","B"], B=["a","b"], C=["c1","c2"], D=["d1","d2"]) wide
Row | A | B | C | D |
---|---|---|---|---|
String | String | String | String | |
1 | A | a | c1 | d1 |
2 | B | b | c2 | d2 |
With the stack
function, the values are placed into a single column (by default :value
) and the variable names placed into a single column (by default :variable
):
= stack(wide, [:C, :D]) long
Row | A | B | variable | value |
---|---|---|---|---|
String | String | String | String | |
1 | A | a | C | c1 |
2 | B | b | C | c2 |
3 | A | a | D | d1 |
4 | B | b | D | d2 |
The values in the :variable
column come from the data frame names.
With unstack
the data can be spread back out into “wide” format. We pass in the two columns indicating the variable and the values to unstack
:
unstack(long, :variable, :value)
Row | A | B | C | D |
---|---|---|---|---|
String | String | String? | String? | |
1 | A | a | c1 | d1 |
2 | B | b | c2 | d2 |
This figure may help visualize the process.
We illustrate this further with an abbreviated version of an example in the tidyr
vignette.
Consider this data from the Global Historical Climatology Network for one weather station (MX17004) in Mexico for five months in 2010. The full data has 31 days per month, this abbreviated data works with just the first 8:
= """
weather "id","year","month","element","d1","d2","d3","d4","d5","d6","d7","d8"
"MX17004","2010","1","tmax","---","---","---","---","---","---","---","---"
"MX17004","2010","1","tmin","---","---","---","---","---","---","---","---"
"MX17004","2010","2","tmax","---","27.3","24.1","---","---","---","---","---"
"MX17004","2010","2","tmin","---","14.4","14.4","---","---","---","---","---"
"MX17004","2010","3","tmax","---","---","---","---","32.1","---","---","---"
"MX17004","2010","3","tmin","---","---","---","---","14.2","---","---","---"
"MX17004","2010","4","tmax","---","---","---","---","---","---","---","---"
"MX17004","2010","4","tmin","---","---","---","---","---","---","---","---"
"MX17004","2010","5","tmax","---","---","---","---","---","---","---","---"
"MX17004","2010","5","tmin","---","---","---","---","---","---","---","---"
"""
= CSV.read(IOBuffer(weather), DataFrame; missingstring="---")
w first(w, 4)
Row | id | year | month | element | d1 | d2 | d3 | d4 | d5 | d6 | d7 | d8 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
String7 | Int64 | Int64 | String7 | Missing | Float64? | Float64? | Missing | Float64? | Missing | Missing | Missing | |
1 | MX17004 | 2010 | 1 | tmax | missing | missing | missing | missing | missing | missing | missing | missing |
2 | MX17004 | 2010 | 1 | tmin | missing | missing | missing | missing | missing | missing | missing | missing |
3 | MX17004 | 2010 | 2 | tmax | missing | 27.3 | 24.1 | missing | missing | missing | missing | missing |
4 | MX17004 | 2010 | 2 | tmin | missing | 14.4 | 14.4 | missing | missing | missing | missing | missing |
This example is said to have variables stored in both rows and columns. The first step to tidying the data is to stack days 1 through 8 into a column. We use the Between
column selector to select the columns for d1
though d8
(an alternative to, say, r"^d"
):
= @chain w begin
w1 stack(Between(:d1, :d8); variable_name = :day)
filter(:value => !ismissing, _)
end
Row | id | year | month | element | day | value |
---|---|---|---|---|---|---|
String7 | Int64 | Int64 | String7 | String | Float64? | |
1 | MX17004 | 2010 | 2 | tmax | d2 | 27.3 |
2 | MX17004 | 2010 | 2 | tmin | d2 | 14.4 |
3 | MX17004 | 2010 | 2 | tmax | d3 | 24.1 |
4 | MX17004 | 2010 | 2 | tmin | d3 | 14.4 |
5 | MX17004 | 2010 | 3 | tmax | d5 | 32.1 |
6 | MX17004 | 2010 | 3 | tmin | d5 | 14.2 |
As in the example being followed, missing values are skipped, requiring the reader to imply the lack of measurement on a given day.
The day is coded with a leading “d” and as a string, not an integer. We can strip this prefix out using a regular expression or, as here, string indexing then parse the result to an integer:
transform!(w1, :day => ByRow(x -> parse(Int, x[2:end])) => :day)
Row | id | year | month | element | day | value |
---|---|---|---|---|---|---|
String7 | Int64 | Int64 | String7 | Int64 | Float64? | |
1 | MX17004 | 2010 | 2 | tmax | 2 | 27.3 |
2 | MX17004 | 2010 | 2 | tmin | 2 | 14.4 |
3 | MX17004 | 2010 | 2 | tmax | 3 | 24.1 |
4 | MX17004 | 2010 | 2 | tmin | 3 | 14.4 |
5 | MX17004 | 2010 | 3 | tmax | 5 | 32.1 |
6 | MX17004 | 2010 | 3 | tmin | 5 | 14.2 |
Finally, the element
column does not hold a variable, rather it stores the names of two variables (tmin
and tmax
). For this the data is “unstacked” splitting the longer variable value
in two:
= unstack(w1, :element, :value) w2
Row | id | year | month | day | tmax | tmin |
---|---|---|---|---|---|---|
String7 | Int64 | Int64 | Int64 | Float64? | Float64? | |
1 | MX17004 | 2010 | 2 | 2 | 27.3 | 14.4 |
2 | MX17004 | 2010 | 2 | 3 | 24.1 | 14.4 |
3 | MX17004 | 2010 | 3 | 5 | 32.1 | 14.2 |
As described in the vignette, now each variable is in one column, and each row describes one day.
3.1.12 Joins
Consider the three following assessments of some class. The students have somewhat spotty attendance:
= """
t1 first,last,points
Alice, Smith, 100
Bob, Jones, 200
"""
= """
t2 first,last,score
Carol, Gates, 125
Chad, Gad, 200
Bob, Jones, 225
"""
= """
t3 first,last,points
Bob, Jones, 300
Carol, Gates, 150
Erin, Dan, 100
"""
= [CSV.read(IOBuffer(txt), DataFrame) for txt ∈ (t1, t2, t3)]; a1, a2, a3
We have several ways to combine these three data frames.
We can concatenate them using vcat
. Before doing so, we need to align the column names, as the second data set used a different name:
select!(a2, 1:2, :score => :points);
The vcat
method for data frames takes a named argument source
allowing a variable to be appended indicating the source of the data frame. The following will indicate with a 1
, 2
, or 3
what data frame contributed what row:
= vcat(a1, a2, a3; source=:assignment)
aₛ describe(aₛ)
Row | variable | mean | min | median | max | nmissing | eltype |
---|---|---|---|---|---|---|---|
Symbol | Union… | Any | Union… | Any | Int64 | DataType | |
1 | first | Alice | Erin | 0 | String7 | ||
2 | last | Dan | Smith | 0 | String7 | ||
3 | points | 175.0 | 100 | 175.0 | 300 | 0 | Int64 |
4 | assignment | 2.125 | 1 | 2.0 | 3 | 0 | Int64 |
Similarly, append!
could be used in combination with reduce
(as in reduce(append!, (a1, a2, a3))
).
A more common alternative to the above is to join the data frames using one of several join commands. Joins allow multiple tables to be merged, or mashed. into one. We briefly illustrate the implementations in DataFrames
of outer, left, right, inner, semi, and anti joins, leaving the cross join and further details to the documentation.
We return to the original data, with the differently named columns:
= [CSV.read(IOBuffer(txt), DataFrame) for txt ∈ (t1, t2, t3)]; a1, a2, a3
The first join we introduce is outerjoin
. It combines two data frames with rows for keys found in any of the data frames passed in.
outerjoin(a1, a2, on=["first", "last"])
Row | first | last | points | score |
---|---|---|---|---|
String7 | String7 | Int64? | Int64? | |
1 | Bob | Jones | 200 | 225 |
2 | Alice | Smith | 100 | missing |
3 | Carol | Gates | missing | 125 |
4 | Chad | Gad | missing | 200 |
The on
argument is a column name (or names as above) to join the data frames on. They need not have the same name; the values passed to on
may be pairs indicating the mapping. (That is, if we were joining on the last variable, on=:points => :score
would be possible.)
The variable to join on is often called the “key,” though in Julia
that terminology is also used with associative arrays.
The joined data frame includes all columns from either data frame not specified in on
. The resulting data frame, in this case, is in a wide, not tidy, format.
In the above, the columns do not have names in conflict. However, were a1
and a3
joined, they would have been. Passing makeunique=true
instructs distinguishing names to be produced:
outerjoin(a1, a3, on=["first", "last"], makeunique=true)
Row | first | last | points | points_1 |
---|---|---|---|---|
String7 | String7 | Int64? | Int64? | |
1 | Bob | Jones | 200 | 300 |
2 | Alice | Smith | 100 | missing |
3 | Carol | Gates | missing | 150 |
4 | Erin | Dan | missing | 100 |
More than two data frames may be passed to outerjoin
, as illustrated below where all three are. The result includes a row for each person (key) in any of the data sets.
outerjoin(a1, a2, a3, on=["first", "last"], makeunique=true)
Row | first | last | points | score | points_1 |
---|---|---|---|---|---|
String7 | String7 | Int64? | Int64? | Int64? | |
1 | Bob | Jones | 200 | 225 | 300 |
2 | Carol | Gates | missing | 125 | 150 |
3 | Alice | Smith | 100 | missing | missing |
4 | Chad | Gad | missing | 200 | missing |
5 | Erin | Dan | missing | missing | 100 |
From the outerjoin
, we easily note that only the first listed student was present for all the assessments.
Three other joins related to outerjoin
are:
leftjoin(df1, df2; on, ...)
which includes all rows in the left data framerightjoin(df1, df2; on, ...)
which includes all rows in the right data frame (similar toleftjoin(df2, df1)
.innerjoin(df1, df2; on, ...)
which includes all rows with keys in both data data frames.
To illustrate:
leftjoin(a1, a2, on=["first", "last"])
Row | first | last | points | score |
---|---|---|---|---|
String7 | String7 | Int64 | Int64? | |
1 | Bob | Jones | 200 | 225 |
2 | Alice | Smith | 100 | missing |
rightjoin(a1, a2, on=["first", "last"])
Row | first | last | points | score |
---|---|---|---|---|
String7 | String7 | Int64? | Int64 | |
1 | Bob | Jones | 200 | 225 |
2 | Carol | Gates | missing | 125 |
3 | Chad | Gad | missing | 200 |
In the following, we must qualify the function below, as SplitApplyCombine
has a conflicting use, though this isn’t typically needed:
innerjoin(a1, a2, on=["first", "last"]) DataFrames.
Row | first | last | points | score |
---|---|---|---|---|
String7 | String7 | Int64 | Int64 | |
1 | Bob | Jones | 200 | 225 |
Joins have an analog in set operations: the outer join is like a union, the inner join is like an intersection.
The next two joins do not join columns, as the ones above. Rather, they use the second data frame to filter out rows of the first data frame.
The semijoin
returns the rows of the left data frame that match a key of the right data frame:
semijoin(a1, a2[3:3, :]; on=["first", "last"])
Row | first | last | points |
---|---|---|---|
String7 | String7 | Int64 | |
1 | Bob | Jones | 200 |
An antijoin
includes rows in the left data frame that do not match a key in the right data frame. (Like the set difference). Here we see the row in common to a1
and a2
is not included:
antijoin(a1, a2, on=["first", "last"])
Row | first | last | points |
---|---|---|---|
String7 | String7 | Int64 | |
1 | Alice | Smith | 100 |
There are other storage formats available within
Julia
that can contain more detail on the data, e.g.Arrow
.↩︎This combination of column selectors is also performed by
Cols(1, :v, "p", r"^l")
.↩︎The
DataPipes
package inserts a magical_
in the last positional argument, which is common with higher-order functions likefilter
. The example could become@p tbl |> filter(_.v >= 25) |> filter(_.p >= 350)
with the new lines replacing the pipe operation as a convenience with longer chains↩︎