using StatsBase, StatsPlots
3 Tabular data
Some data sets are naturally stored in a tabular format, like a spreadsheet. This chapter looks at such data.
In the following, we will utilize these basic packages, among others introduced along the way:
3.1 Data frames
The basic framework for exploratory data analysis is a data set with 1 or more observations for \(n\) different cases. For each case (or observation) a variable will be a length-\(n\) data set, with values measuring the same type of quantity, or perhaps missing. The variables can be stored as vectors. The collection of variables is traditionally arranged in a rectangular manner with each row representing the measurements for a given case and each column representing the measured values for each variable. That is the columns are homogeneous (as they measure the same thing), but the rows need not be (as different variables are measuring different quantities).
Matrices contain rectangular data, but are homogeneous. As such, a new structure is needed to store tabular data in general. A data frame is a concept borrowed from S Plus where they were introduced in 1991. The R language has data frames as a basic built-in data container. In Julia
the structure is implemented in an external package, DataFrames
.
using DataFrames
The DataFrames
package dates back to 2012, at least, and must be one of Julia
’s oldest packages. Julia was on version 0.1
then. It is also under continual development and has a comprehensive set of documentation and tutorials. Bouchet-Valat and Kamiński (2023) has an excellent overview including comparisons to other implementations in other languages and context for certain design decisions. This introduction attempts to illustrate some basic usage; consult the provided documentation for additional details and applications.
3.1.1 Data frame construction
There are different ways to construct a data frame.
Consider the task of the Wirecutter in trying to select the best carry on travel bag. After compiling a list of possible models by scouring travel blogs etc., they select some criteria (capacity, compartment design, aesthetics, comfort, …) and compile data, similar to what one person collected in a spreadsheet. Here we create a much simplified spreadsheet for 3 listed bags with measurements of volume, price, laptop compatability, loading style, and a last-checked date – as this market improves constantly.
product v p l loads checked
Goruck GR2 40 395 Y front panel 2022-09
Minaal 3.0 35 349 Y front panel 2022-09
Genius 25 228 Y clamshell 2022-10
We see that product is a character, volume and price numeric, laptop compatability a Boolean value, load style one of a few levels, and the last checked date, a year-month date.
We create vectors to hold each. We load the CategoricalArrays
and Dates
packages for a few of the variables:
using CategoricalArrays, Dates
= ["Goruck GR2", "Minaal 3.0", "Genius"]
product = [40, 35, 25]
volume = [395, 349, 228]
price = categorical(["Y","Y","Y"])
laptop_compatability = categorical(["front panel", "front panel", "clamshell"])
loading_style = Date.(2022, [9,9,10]) date_checked
3-element Vector{Date}:
2022-09-01
2022-09-01
2022-10-01
With this, we use the DataFrame
constructor to combine these into one data set.
= DataFrame(product = product, volume=volume, price=price,
d "laptop compatability"=laptop_compatability,
var"loading style"=loading_style, var"date checked"=date_checked) var
Row | product | volume | price | laptop compatability | loading style | date checked |
---|---|---|---|---|---|---|
String | Int64 | Int64 | Cat… | Cat… | Date | |
1 | Goruck GR2 | 40 | 395 | Y | front panel | 2022-09-01 |
2 | Minaal 3.0 | 35 | 349 | Y | front panel | 2022-09-01 |
3 | Genius | 25 | 228 | Y | clamshell | 2022-10-01 |
The call to the constructor is like that of a named tuple and again, var"..."
is employed for names with spaces.
As called above, the constructor binds vectors of equal length into a new multi-variate container.
The describe
function will summarize a data frame using different summaries for different types:
describe(d)
Row | variable | mean | min | median | max | nmissing | eltype |
---|---|---|---|---|---|---|---|
Symbol | Union… | Any | Any | Any | Int64 | DataType | |
1 | product | Genius | Minaal 3.0 | 0 | String | ||
2 | volume | 33.3333 | 25 | 35.0 | 40 | 0 | Int64 |
3 | price | 324.0 | 228 | 349.0 | 395 | 0 | Int64 |
4 | laptop compatability | Y | Y | 0 | CategoricalValue{String, UInt32} | ||
5 | loading style | clamshell | front panel | 0 | CategoricalValue{String, UInt32} | ||
6 | date checked | 2022-09-01 | 2022-09-01 | 2022-10-01 | 0 | Date |
In the above construction, we repeated the names of the variables to the constructor. A style, also akin to the construction of named tuple, could also have been used where variables after a semicolon are implicitly assumed to have the desired name:
= DataFrame(; product, volume, price,
d "laptop compatability"=laptop_compatability,
var"loading style"=loading_style, var"date checked"=date_checked) var
Row | product | volume | price | laptop compatability | loading style | date checked |
---|---|---|---|---|---|---|
String | Int64 | Int64 | Cat… | Cat… | Date | |
1 | Goruck GR2 | 40 | 395 | Y | front panel | 2022-09-01 |
2 | Minaal 3.0 | 35 | 349 | Y | front panel | 2022-09-01 |
3 | Genius | 25 | 228 | Y | clamshell | 2022-10-01 |
(In fact, with the rename!
function, or other conveniences, it can be more direct to define the data frame, then rename the columns, avoiding the var
above.)
We see that the DataFrame
type displays values as we might expect with the column header showing the name of the variable and the storage type. The row numbers are added and may be used for reference to the data values.
There is a convention for attaching new columns to a data frame that allows another alternate construction:
= DataFrame() # empty data frame
d = product
d.product = volume
d.volume = price
d.price "laptop compatability" = laptop_compatability
d."loading style" = loading_style
d."date checked" = date_checked
d. d
Row | product | volume | price | laptop compatability | loading style | date checked |
---|---|---|---|---|---|---|
String | Int64 | Int64 | Cat… | Cat… | Date | |
1 | Goruck GR2 | 40 | 395 | Y | front panel | 2022-09-01 |
2 | Minaal 3.0 | 35 | 349 | Y | front panel | 2022-09-01 |
3 | Genius | 25 | 228 | Y | clamshell | 2022-10-01 |
This uses one of a few different ways to refer to a column in a data frame.
As another alternative, the insertcols!
function can insert multiple columns, for example the last one in the example, could have been written insertcols!(d, "date checked" => date_checked)
.
Before exploring how we might query the data frame, we note that an alternate means to describe the data set is by row, or case. For each row, we might use a named tuple with the names indicating the variable, and the values the measurement. The DataFrame
constructor would accept this, here we shorten the names and only do one row (for now):
= DataFrame([
d = "Goruck GR2", v = 40, p = 395, lap = "Y", load = "front panel", d = Date("2022-09-01"))
(b ])
Row | b | v | p | lap | load | d |
---|---|---|---|---|---|---|
String | Int64 | Int64 | String | String | Date | |
1 | Goruck GR2 | 40 | 395 | Y | front panel | 2022-09-01 |
Here, the variable types are different, as it takes more effort to make one categorical. The same way we defined a column allows us to redefine that column (replacing the container, not re-using it for storage). For example, we could do:
= categorical(d.lap)
d.lap = categorical(d.load) d.load
1-element CategoricalArray{String,1,UInt32}:
"front panel"
and then the two variables would be categorical.
There are other ways to define row by row:
- A data frame with empty variables can be constructed and then values as tuples may be
push!
ed onto the data frame
push!(d, ("Minaal 3.0", 35, 349, "Y", "front panel", Date("2022-09-01")))
Row | b | v | p | lap | load | d |
---|---|---|---|---|---|---|
String | Int64 | Int64 | Cat… | Cat… | Date | |
1 | Goruck GR2 | 40 | 395 | Y | front panel | 2022-09-01 |
2 | Minaal 3.0 | 35 | 349 | Y | front panel | 2022-09-01 |
A named tuple can also be used.
- Similarly, rows specified by dictionaries may be pushed onto a data frame
push!(d, Dict(:b => "Genius", :v => 25, :p => 228, :lap => "Y",
:load => "clamshell", :d => Date("2022-10-01")))
Row | b | v | p | lap | load | d |
---|---|---|---|---|---|---|
String | Int64 | Int64 | Cat… | Cat… | Date | |
1 | Goruck GR2 | 40 | 395 | Y | front panel | 2022-09-01 |
2 | Minaal 3.0 | 35 | 349 | Y | front panel | 2022-09-01 |
3 | Genius | 25 | 228 | Y | clamshell | 2022-10-01 |
(A dictionary is a key => value
container like a named tuple, but keys may be arbitrary Julia
objects – not always symbols – so we explicitly use symbols in the above command.)
Tables
interface
In the Julia
ecosystem, the Tables
package provides a lightweight means to describe tabular data, like a data frame, TableOperations
provides some basic means to query the data. The different ways in which a data frame can be defined, reflect the different ways Tables
objects can be created.
As tables are viewed as rows of homogeneous types, a vector of named tuples is a natural description, whereas tables are also naturally viewed as a collection of named columns, each column a vector. As such, both of these specify a table.
- As a struct of arrays. This is used by calling
DataFrame(a=[1,2], b= [2,1])
, say. - As an array of structs. This is used by calling
DataFrame([(a=1,b=2), (a=2,b=1)])
RDataset
Data frames are often read in from external sources, and not typed in. The RDataSets
package has collected data sets from many different R packages and bundled them into data sets for Julia
in the form of data frames. We will use some of these data sets in the examples. To load a data set we need to know its R
package and R
name:
using RDatasets
= dataset("MASS", "Cars93") # package, data set name
cars first(cars, 2) # first 2 rows only
Row | Manufacturer | Model | Type | MinPrice | Price | MaxPrice | MPGCity | MPGHighway | AirBags | DriveTrain | Cylinders | EngineSize | Horsepower | RPM | RevPerMile | ManTransAvail | FuelTankCapacity | Passengers | Length | Wheelbase | Width | TurnCircle | RearSeatRoom | LuggageRoom | Weight | Origin | Make |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Cat… | String | Cat… | Float64 | Float64 | Float64 | Int32 | Int32 | Cat… | Cat… | Cat… | Float64 | Int32 | Int32 | Int32 | Cat… | Float64 | Int32 | Int32 | Int32 | Int32 | Int32 | Float64? | Int32? | Int32 | Cat… | String | |
1 | Acura | Integra | Small | 12.9 | 15.9 | 18.8 | 25 | 31 | None | Front | 4 | 1.8 | 140 | 6300 | 2890 | Yes | 13.2 | 5 | 177 | 102 | 68 | 37 | 26.5 | 11 | 2705 | non-USA | Acura Integra |
2 | Acura | Legend | Midsize | 29.2 | 33.9 | 38.7 | 18 | 25 | Driver & Passenger | Front | 6 | 3.2 | 200 | 5500 | 2335 | Yes | 18.0 | 5 | 195 | 115 | 71 | 38 | 30.0 | 15 | 3560 | non-USA | Acura Legend |
CSV.read
For interacting with spreadsheet data, when it is stored like a data frame, it may be convenient to export the data in some format and then read that into a Julia
session. A common format is to use comma-separated values, or “csv” data. The CSV
package reads such files. The CSV
package reads in delimited text data (the source) using the Tables
interface and can be instructed to write to a data frame (the sink). The CSV.read
file is used below. Here we write to some file to show that we can read it back:
using CSV
= tempname() * ".csv"
f write(f, d) # write simple data frame to file `f`
CSV.= CSV.read(f, DataFrame) d1
Row | b | v | p | lap | load | d |
---|---|---|---|---|---|---|
String15 | Int64 | Int64 | String1 | String15 | Date | |
1 | Goruck GR2 | 40 | 395 | Y | front panel | 2022-09-01 |
2 | Minaal 3.0 | 35 | 349 | Y | front panel | 2022-09-01 |
3 | Genius | 25 | 228 | Y | clamshell | 2022-10-01 |
Comma-separated-value files are simple text files, with no metadata to indicate what type a value may be.1 Some guesswork is necessary. We observe that the two categorical variables were read back in as strings. As seen above, that can be addressed through column conversion.
The filename, may be more general. For example, it could be download(url)
for some url that points to a csv file to read data from the internet.
The methods read
and write
are qualified in the above usage with the CSV
module. In the Julia
ecosystem, the FileIO
package provides a common framework for reading and writing files; it uses the verbs load
and save
. This can also be used with DataFrames
, though it works through the CSVFiles
package – and not CSV
, as illustrated above. The read command would look like DataFrame(load(fname))
and the write command like save(fname, df)
. Here fname
would have a “.csv” extension so that the type of file could be detected.
Command | Description |
---|---|
CSV.read(file_name, DataFrame) |
Read csv file from file with given name |
CSV.read(IOBuffer(string), DataFrame) |
Read csv file from a string using an IOBuffer |
CSV.read(download(url), DataFrame) |
Read csv file an url |
open(file_name,"w") do io; CSV.write(io, dd); end |
Write data frame df to csv file |
DataFrame(load(file_name)) |
Read csv file from file with given name using CSVFiles |
save(file_name, df) |
Write data frame df to a csv file using CSVFiles |
TableScraper
The TableScraper
package can scrape tables from an HTML file in a manner that can be passed to DataFrame
for conversion to a data frame. The command DataFrame.(scrape_tables(url))
may just work to create the tables, but in practice, only well-formed tables can be scraped successfully; tables used for formatting purposes may fail. TableScraper
is built on Cascadia
and Gumbo
, as is the Harbest
package, which provides more access to the underlying data than TableScraper
.
3.1.2 Column names
The variable names of a data set are returned by the names
function, as strings:
names(d)
6-element Vector{String}:
"b"
"v"
"p"
"lap"
"load"
"d"
This function has a second argument which can be used to narrow or filter the names that are returned. For example, the following is a convenience for matching columns whose element type is a subtype of Real
:
names(d, Real)
2-element Vector{String}:
"v"
"p"
The rename!
function allows the names to be changed in-place (without returning a new data frame). There are many different calling patterns, but:
- to replace one name, the form
rename!(df, :v => :nv)
is used, where:v
is the old name, and:nv
the new one. The form follows the genericreplace!
function. The pair notation can be broadcast (.=>
) to replace more than one name, as withreplace!(df, [:v1, :v2] .=> [:nv1, :nv2])
. - to replace allnames, the form
rename!(df, [...])
is used, where the vector is of length matching the number of columns indf
.
3.1.3 Indexing and assignment
The values in a data frame can be referenced programatically by a row number and column number, both 1-based. For example, the 2nd row and 3rd column of d
can be seen to be 349
by observation
d
Row | b | v | p | lap | load | d |
---|---|---|---|---|---|---|
String | Int64 | Int64 | Cat… | Cat… | Date | |
1 | Goruck GR2 | 40 | 395 | Y | front panel | 2022-09-01 |
2 | Minaal 3.0 | 35 | 349 | Y | front panel | 2022-09-01 |
3 | Genius | 25 | 228 | Y | clamshell | 2022-10-01 |
but it can be referenced through indexing (row number first, column number second):
2, 3] d[
349
A new value can also be assigned with the same indexing reference. Suppose, that bag went on sale. This new price could be adjusted, as follows:
2, 3] = 309 # a sale price d[
309
Column selectors
The 3rd column has a name, p
in this case. We can use a column selector to avoid having to know which column number a variable is. For this a string or a symbol may be used, as in:
2, :p], d[2, "p"] d[
(309, 309)
More generally, it is possible to use more than 1 column in column selection. For example, to get both the volume and price, for the second bag we have:
2, [:v, :p]] d[
Row | v | p |
---|---|---|
Int64 | Int64 | |
2 | 35 | 309 |
As well, either [2, 3]
or ["v", "p"]
could have been used.
For data frames with many columns, a regular expression can be used. The column selector r"^l"
would match all columns whose name begins with l
(lap
and load
) in this case. (Regular expressions can be very complicated, but here we only assume that r"name"
will match “name” somewhere in the string; r"^name"
and r"name$"
will match “name” at the beginning and ending of a string. Using a regular expression will return a data frame row (when a row index is specified) – not a value – as it is possible to return 0, 1 or more columns in the selection.
Column selectors, as seen may be column number(s), column names as strings, column names as symbols, and regular expressions.
In addition:
- Boolean vectors of a length matching the number of columns (
d[2, [false, true, true, false, false, false]]
would return the 2nd and 3rd columns.) - The value
All()
which selects all columns - the value
Between(a, b)
where,a
andb
are column selectors (such asBetween(:v, 3)
). Not(...)
to select those columns not specified in...
Cols(a,b,...)
which combines different selectors, such asCols(:v, 3)
.
Slices
The All()
column selector has a shortcut inherited from the indexing of arrays in base Julia
, this being a colon, :
. When used in the column position it refers to all the values in the row:
2, :] d[
Row | b | v | p | lap | load | d |
---|---|---|---|---|---|---|
String | Int64 | Int64 | Cat… | Cat… | Date | |
2 | Minaal 3.0 | 35 | 309 | Y | front panel | 2022-09-01 |
When used in the row position, it refers to all rows:
:, 3] d[
3-element Vector{Int64}:
395
309
228
In the above use, a vector is returned, not a data frame, as only one column was selected. But which vector? Is it a copy of the one that d
holds, or is it the same container? In base Julia
the answer depends on what side of an equals sign the use is on (meaning, if used in assignment the answer is different, as mentioned later). In this usage, a copy is returned.
To get the underlying column (not a copy), the selector !
is used instead:
3] d[!,
3-element Vector{Int64}:
395
309
228
The notation d.p
also refers to the 3rd variable. In this case, d.p
is a view of the underlying data (using !
) and not a copy.
The notation d[:,:]
will refer to all rows and all columns of d
, and in this case will be a copy of the data frame.
Assignment
Single values can be assigned, as previously illustrated. These values are assigned to the underlying vector in the data frame, so as with vectors, the assigned value will be promoted to the column type, if necessary. Assigning a value that can’t be promoted will throw and error, as it would were such an assignment tried for a vector. To do such an assignment, the underlying vector must be changed to a wider type.
The row selector :
or !
had different semantics when accessing the underlying data, the first returning a copy, the second a view. When used with assignment though, :
refers to the underlying vector or does in-place assignment; whereas !
(and the dot access) will replace the the underlying container with the assigned container.
To illustrate, suppose the prices are rounded down, but actually contain an extra 99 cents. We can reassign price
as following:
:p] = price .+ 0.99 d[!,
3-element Vector{Float64}:
395.99
349.99
228.99
as even though price .+ 0.99
is no longer an integer vector, the !
row indicator will instruct d
to point to this new vector. Had d[:, :p] = ...
been used above, an error would have been thrown as the underlying container is integer, but the new prices require floating point values to represent them.
Broadcast assignment (using .=
) will be similar. The right hand side is expanded, and then assigned. Attempts to assign the wrong value type using :
to indicate the rows will error.
Missing values
Trying to assign a value of missing
to a data frame requires the underlying vector to allow missing values. As with vectors, the allowmissing
function is useful. For example, to give the 2nd bag a price of missing, we could do:
allowmissing!(d, :p)
2, :p] = missing d[
missing
By using the mutating form (with the trailing !
), the data frame d
is mutated in the allowmissing!
call. This could be observed by looking a the underlying column type.
The new type is a union of the old type and Missing
, so we could reassign the old value:
2, :p] = 349.99
d[ d
Row | b | v | p | lap | load | d |
---|---|---|---|---|---|---|
String | Int64 | Float64? | Cat… | Cat… | Date | |
1 | Goruck GR2 | 40 | 395.99 | Y | front panel | 2022-09-01 |
2 | Minaal 3.0 | 35 | 349.99 | Y | front panel | 2022-09-01 |
3 | Genius | 25 | 228.99 | Y | clamshell | 2022-10-01 |
Tables.jl interface
A table, like a data frame can be expected to loop, or iterate, over columns, say to find which columns are numeric. Or, they can be expected to iterate over rows, say to find which values are within a certain range. What should be the case with a data frame? As there are reasons to prefer either, none is specified. Rather, there is a difference based on the calling function. Data frames respect the Tables
interface, which has methods to iterate over rows or columns. When iterating over columns, the individual vectors are given; when iterating over rows, a DataFrameRow
object is returned. Calling copy
on these rows objects will return a named tuple.
The basic iterators are eachrow
and eachcol
. For example, to see the column type for a data frame, we can broadcast eltype
over the variables returned by eachcol
:
eltype.(eachcol(d)) |> permutedims
1×6 Matrix{Type}:
String Int64 Union{Missing, Float64} … Date
To get the data frame as a vector of named tuples, then broadcasting NamedTuple
over eachrow
can be used: NamedTuple.(eachrow(d))
; Base.copy
is an alias. Tables.rowtable(d)
will produce this vector of named tuples, as well.
3.1.4 Sorting
A data frame can be sorted by specifying a permutation of the indices, such as is returned by sortperm
. For example, d[sortperm(d.price), :]
will sort by price in increasing order. The keyword argument rev
can be passed true
to reverse the default sort order.
The generic sort
function has a method for data frames, that offers a more convenient functionality. The sort!
method does in-place sorting. The simplest call is to pass in a column to sort by, as in
sort(d, :p)
Row | b | v | p | lap | load | d |
---|---|---|---|---|---|---|
String | Int64 | Float64? | Cat… | Cat… | Date | |
1 | Genius | 25 | 228.99 | Y | clamshell | 2022-10-01 |
2 | Minaal 3.0 | 35 | 349.99 | Y | front panel | 2022-09-01 |
3 | Goruck GR2 | 40 | 395.99 | Y | front panel | 2022-09-01 |
This sorts by the price (:p
) variable in increasing order. To sort by decreasing order a keyword argument rev=true
can be specified:
sort(d, "v"; rev=true)
Row | b | v | p | lap | load | d |
---|---|---|---|---|---|---|
String | Int64 | Float64? | Cat… | Cat… | Date | |
1 | Goruck GR2 | 40 | 395.99 | Y | front panel | 2022-09-01 |
2 | Minaal 3.0 | 35 | 349.99 | Y | front panel | 2022-09-01 |
3 | Genius | 25 | 228.99 | Y | clamshell | 2022-10-01 |
There can be one or more column selectors. For this, the order
function can be used to reverse just one of the columns, as in the following, which first sorts the dates in reverse order (newest first) and then within a data, uses lowest price first to sort:
sort(d, [order("d", rev=true), :p])
Row | b | v | p | lap | load | d |
---|---|---|---|---|---|---|
String | Int64 | Float64? | Cat… | Cat… | Date | |
1 | Genius | 25 | 228.99 | Y | clamshell | 2022-10-01 |
2 | Minaal 3.0 | 35 | 349.99 | Y | front panel | 2022-09-01 |
3 | Goruck GR2 | 40 | 395.99 | Y | front panel | 2022-09-01 |
3.1.5 Filtering rows
Filtering, or subsetting, is a means to select a specified selection of rows.
Let’s consider the cars
data set, previously defined by loading dataset("MASS", "Cars93")
. This data set consists of 27 measurements on 93 cars from the 1990s. The data set comes from R’s MASS
package, and is documented there.
Two ready ways to filter are by using specific indices, or by using Boolean indices generated by a logical comparison.
For the cars
data set, the latter can be used to extract the Volkswagen models. To generate the indices we compare the :Manufacturer
variable with the string Volkswagen
to return a vector of length 93 that indicates if the manufacturer is VW. This is done with broadcasting: cars.Manufacturer .== "Volkswagen"
. We use the easy dot access to variables in cars
, alternatively cars[:, :Manufacturer]
(or, more efficiently, !
, as is used by the dot access) could be used were :Manufacturer
coming from some variable. With this row selector, we have:
.== "Volkswagen", :] cars[cars.Manufacturer
Row | Manufacturer | Model | Type | MinPrice | Price | MaxPrice | MPGCity | MPGHighway | AirBags | DriveTrain | Cylinders | EngineSize | Horsepower | RPM | RevPerMile | ManTransAvail | FuelTankCapacity | Passengers | Length | Wheelbase | Width | TurnCircle | RearSeatRoom | LuggageRoom | Weight | Origin | Make |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Cat… | String | Cat… | Float64 | Float64 | Float64 | Int32 | Int32 | Cat… | Cat… | Cat… | Float64 | Int32 | Int32 | Int32 | Cat… | Float64 | Int32 | Int32 | Int32 | Int32 | Int32 | Float64? | Int32? | Int32 | Cat… | String | |
1 | Volkswagen | Fox | Small | 8.7 | 9.1 | 9.5 | 25 | 33 | None | Front | 4 | 1.8 | 81 | 5500 | 2550 | Yes | 12.4 | 4 | 163 | 93 | 63 | 34 | 26.0 | 10 | 2240 | non-USA | Volkswagen Fox |
2 | Volkswagen | Eurovan | Van | 16.6 | 19.7 | 22.7 | 17 | 21 | None | Front | 5 | 2.5 | 109 | 4500 | 2915 | Yes | 21.1 | 7 | 187 | 115 | 72 | 38 | 34.0 | missing | 3960 | non-USA | Volkswagen Eurovan |
3 | Volkswagen | Passat | Compact | 17.6 | 20.0 | 22.4 | 21 | 30 | None | Front | 4 | 2.0 | 134 | 5800 | 2685 | Yes | 18.5 | 5 | 180 | 103 | 67 | 35 | 31.5 | 14 | 2985 | non-USA | Volkswagen Passat |
4 | Volkswagen | Corrado | Sporty | 22.9 | 23.3 | 23.7 | 18 | 25 | None | Front | 6 | 2.8 | 178 | 5800 | 2385 | Yes | 18.5 | 4 | 159 | 97 | 66 | 36 | 26.0 | 15 | 2810 | non-USA | Volkswagen Corrado |
This approach lends itself to the description “find all rows matching some value” then “extract the identified rows,” – written as two steps to emphasize there are two passes through the data. Another mental model would be loop over the rows, and keep those that match the query. This is done generically by the filter
function for collections in Julia
or by the subset
function of DataFrames
.
The filter(predicate, collection)
function is used to identify just the values in the collection for which the predicate function returns true
. When a data frame is used with filter
, the iteration is over the rows, so the wrapping eachrow
iterator is not needed. We need a predicate function to replace the .==
above. One follows. It doesn’t need .==
, as r
is a data frame row and access produces a value not a vector:
filter(r -> r.Manufacturer == "Volkswagen", cars)
Row | Manufacturer | Model | Type | MinPrice | Price | MaxPrice | MPGCity | MPGHighway | AirBags | DriveTrain | Cylinders | EngineSize | Horsepower | RPM | RevPerMile | ManTransAvail | FuelTankCapacity | Passengers | Length | Wheelbase | Width | TurnCircle | RearSeatRoom | LuggageRoom | Weight | Origin | Make |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Cat… | String | Cat… | Float64 | Float64 | Float64 | Int32 | Int32 | Cat… | Cat… | Cat… | Float64 | Int32 | Int32 | Int32 | Cat… | Float64 | Int32 | Int32 | Int32 | Int32 | Int32 | Float64? | Int32? | Int32 | Cat… | String | |
1 | Volkswagen | Fox | Small | 8.7 | 9.1 | 9.5 | 25 | 33 | None | Front | 4 | 1.8 | 81 | 5500 | 2550 | Yes | 12.4 | 4 | 163 | 93 | 63 | 34 | 26.0 | 10 | 2240 | non-USA | Volkswagen Fox |
2 | Volkswagen | Eurovan | Van | 16.6 | 19.7 | 22.7 | 17 | 21 | None | Front | 5 | 2.5 | 109 | 4500 | 2915 | Yes | 21.1 | 7 | 187 | 115 | 72 | 38 | 34.0 | missing | 3960 | non-USA | Volkswagen Eurovan |
3 | Volkswagen | Passat | Compact | 17.6 | 20.0 | 22.4 | 21 | 30 | None | Front | 4 | 2.0 | 134 | 5800 | 2685 | Yes | 18.5 | 5 | 180 | 103 | 67 | 35 | 31.5 | 14 | 2985 | non-USA | Volkswagen Passat |
4 | Volkswagen | Corrado | Sporty | 22.9 | 23.3 | 23.7 | 18 | 25 | None | Front | 6 | 2.8 | 178 | 5800 | 2385 | Yes | 18.5 | 4 | 159 | 97 | 66 | 36 | 26.0 | 15 | 2810 | non-USA | Volkswagen Corrado |
There are some conveniences that have been developed to make the predicate functions even easier to write, which will be introduced incrementally:
The basic binary comparisons, like ==
, are defined by functions essentially of the form (a,b) -> ==(a,b)
. There are partially applied versions, essentially a -> ==(a,b)
, with b
applied, formed by ==(b)
. So the syntax ==("Volkswagen")
is a predicate taking a single argument which is then compared for equality to the string "Volkswagen"
.
However, r
is a data frame row, what we need to pass to ==("Volkswagen")
is the value of the :Manufacturer
column. For this another convenience is available.
The pair
syntax, =>
of the form column selector(s) => predicate function
will pass just the value(s) in the column(s) to the predicate function. Combined, the above can be simplified to:
filter(:Manufacturer => ==("Volkswagen"), cars)
Row | Manufacturer | Model | Type | MinPrice | Price | MaxPrice | MPGCity | MPGHighway | AirBags | DriveTrain | Cylinders | EngineSize | Horsepower | RPM | RevPerMile | ManTransAvail | FuelTankCapacity | Passengers | Length | Wheelbase | Width | TurnCircle | RearSeatRoom | LuggageRoom | Weight | Origin | Make |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Cat… | String | Cat… | Float64 | Float64 | Float64 | Int32 | Int32 | Cat… | Cat… | Cat… | Float64 | Int32 | Int32 | Int32 | Cat… | Float64 | Int32 | Int32 | Int32 | Int32 | Int32 | Float64? | Int32? | Int32 | Cat… | String | |
1 | Volkswagen | Fox | Small | 8.7 | 9.1 | 9.5 | 25 | 33 | None | Front | 4 | 1.8 | 81 | 5500 | 2550 | Yes | 12.4 | 4 | 163 | 93 | 63 | 34 | 26.0 | 10 | 2240 | non-USA | Volkswagen Fox |
2 | Volkswagen | Eurovan | Van | 16.6 | 19.7 | 22.7 | 17 | 21 | None | Front | 5 | 2.5 | 109 | 4500 | 2915 | Yes | 21.1 | 7 | 187 | 115 | 72 | 38 | 34.0 | missing | 3960 | non-USA | Volkswagen Eurovan |
3 | Volkswagen | Passat | Compact | 17.6 | 20.0 | 22.4 | 21 | 30 | None | Front | 4 | 2.0 | 134 | 5800 | 2685 | Yes | 18.5 | 5 | 180 | 103 | 67 | 35 | 31.5 | 14 | 2985 | non-USA | Volkswagen Passat |
4 | Volkswagen | Corrado | Sporty | 22.9 | 23.3 | 23.7 | 18 | 25 | None | Front | 6 | 2.8 | 178 | 5800 | 2385 | Yes | 18.5 | 4 | 159 | 97 | 66 | 36 | 26.0 | 15 | 2810 | non-USA | Volkswagen Corrado |
It doesn’t save much for this simple example, but this specification style is the basis of the DataFrames
mini language, which provides a standardized and performant means of wrangling a data frame.
Filtering may be done on one or more values. There are different approaches.
For example, to first filter by the manufacturer, then by city mileage, we might write a more complicated predicate function using &&
to combine Boolean values:
pred(r) = r.Manufacturer == "Volkswagen" && r.MPGCity >= 20
filter(pred, cars)
Row | Manufacturer | Model | Type | MinPrice | Price | MaxPrice | MPGCity | MPGHighway | AirBags | DriveTrain | Cylinders | EngineSize | Horsepower | RPM | RevPerMile | ManTransAvail | FuelTankCapacity | Passengers | Length | Wheelbase | Width | TurnCircle | RearSeatRoom | LuggageRoom | Weight | Origin | Make |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Cat… | String | Cat… | Float64 | Float64 | Float64 | Int32 | Int32 | Cat… | Cat… | Cat… | Float64 | Int32 | Int32 | Int32 | Cat… | Float64 | Int32 | Int32 | Int32 | Int32 | Int32 | Float64? | Int32? | Int32 | Cat… | String | |
1 | Volkswagen | Fox | Small | 8.7 | 9.1 | 9.5 | 25 | 33 | None | Front | 4 | 1.8 | 81 | 5500 | 2550 | Yes | 12.4 | 4 | 163 | 93 | 63 | 34 | 26.0 | 10 | 2240 | non-USA | Volkswagen Fox |
2 | Volkswagen | Passat | Compact | 17.6 | 20.0 | 22.4 | 21 | 30 | None | Front | 4 | 2.0 | 134 | 5800 | 2685 | Yes | 18.5 | 5 | 180 | 103 | 67 | 35 | 31.5 | 14 | 2985 | non-USA | Volkswagen Passat |
The mini language can also be used. On the left hand side of (the first) =>
a vector of column selectors may be indicated, in which the predicate function (on the right hand side of the first =>
) will get multiple arguments that can be used to implement the logic:
pred(m, mpg) = m == "Volkswagen" && mpg >= 20
filter([:Manufacturer, :MPGCity] => pred, cars)
Row | Manufacturer | Model | Type | MinPrice | Price | MaxPrice | MPGCity | MPGHighway | AirBags | DriveTrain | Cylinders | EngineSize | Horsepower | RPM | RevPerMile | ManTransAvail | FuelTankCapacity | Passengers | Length | Wheelbase | Width | TurnCircle | RearSeatRoom | LuggageRoom | Weight | Origin | Make |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Cat… | String | Cat… | Float64 | Float64 | Float64 | Int32 | Int32 | Cat… | Cat… | Cat… | Float64 | Int32 | Int32 | Int32 | Cat… | Float64 | Int32 | Int32 | Int32 | Int32 | Int32 | Float64? | Int32? | Int32 | Cat… | String | |
1 | Volkswagen | Fox | Small | 8.7 | 9.1 | 9.5 | 25 | 33 | None | Front | 4 | 1.8 | 81 | 5500 | 2550 | Yes | 12.4 | 4 | 163 | 93 | 63 | 34 | 26.0 | 10 | 2240 | non-USA | Volkswagen Fox |
2 | Volkswagen | Passat | Compact | 17.6 | 20.0 | 22.4 | 21 | 30 | None | Front | 4 | 2.0 | 134 | 5800 | 2685 | Yes | 18.5 | 5 | 180 | 103 | 67 | 35 | 31.5 | 14 | 2985 | non-USA | Volkswagen Passat |
Finally, we could use filter
twice:
= filter(:Manufacturer => ==("Volkswagen"), cars)
cars1 = filter(:MPGCity => >=(20), cars1) cars2
Row | Manufacturer | Model | Type | MinPrice | Price | MaxPrice | MPGCity | MPGHighway | AirBags | DriveTrain | Cylinders | EngineSize | Horsepower | RPM | RevPerMile | ManTransAvail | FuelTankCapacity | Passengers | Length | Wheelbase | Width | TurnCircle | RearSeatRoom | LuggageRoom | Weight | Origin | Make |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Cat… | String | Cat… | Float64 | Float64 | Float64 | Int32 | Int32 | Cat… | Cat… | Cat… | Float64 | Int32 | Int32 | Int32 | Cat… | Float64 | Int32 | Int32 | Int32 | Int32 | Int32 | Float64? | Int32? | Int32 | Cat… | String | |
1 | Volkswagen | Fox | Small | 8.7 | 9.1 | 9.5 | 25 | 33 | None | Front | 4 | 1.8 | 81 | 5500 | 2550 | Yes | 12.4 | 4 | 163 | 93 | 63 | 34 | 26.0 | 10 | 2240 | non-USA | Volkswagen Fox |
2 | Volkswagen | Passat | Compact | 17.6 | 20.0 | 22.4 | 21 | 30 | None | Front | 4 | 2.0 | 134 | 5800 | 2685 | Yes | 18.5 | 5 | 180 | 103 | 67 | 35 | 31.5 | 14 | 2985 | non-USA | Volkswagen Passat |
The above required the introduction of an intermediate data frame to store the result of the first filter
call to pass to the second. This threading through of the modified data is quite common in processing pipelines. The first two approaches with complicated predicate functions can grow unwieldly, so staged modification is common. To support that, the chaining or piping operation (|>
) is often used:
filter(:Manufacturer => ==("Volkswagen"), cars) |>
-> filter(:MPGCity => >=(20), d) d
Row | Manufacturer | Model | Type | MinPrice | Price | MaxPrice | MPGCity | MPGHighway | AirBags | DriveTrain | Cylinders | EngineSize | Horsepower | RPM | RevPerMile | ManTransAvail | FuelTankCapacity | Passengers | Length | Wheelbase | Width | TurnCircle | RearSeatRoom | LuggageRoom | Weight | Origin | Make |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Cat… | String | Cat… | Float64 | Float64 | Float64 | Int32 | Int32 | Cat… | Cat… | Cat… | Float64 | Int32 | Int32 | Int32 | Cat… | Float64 | Int32 | Int32 | Int32 | Int32 | Int32 | Float64? | Int32? | Int32 | Cat… | String | |
1 | Volkswagen | Fox | Small | 8.7 | 9.1 | 9.5 | 25 | 33 | None | Front | 4 | 1.8 | 81 | 5500 | 2550 | Yes | 12.4 | 4 | 163 | 93 | 63 | 34 | 26.0 | 10 | 2240 | non-USA | Volkswagen Fox |
2 | Volkswagen | Passat | Compact | 17.6 | 20.0 | 22.4 | 21 | 30 | None | Front | 4 | 2.0 | 134 | 5800 | 2685 | Yes | 18.5 | 5 | 180 | 103 | 67 | 35 | 31.5 | 14 | 2985 | non-USA | Volkswagen Passat |
The right-hand side of |>
expects a function for application, so an anonymous function is created. There are macros available in user-contributed packages that shorten this pattern by using a placeholder.
A popular one is, @chain
, in the Chain
package, with its basic usage based on two simple principles:
- a new line is implicitly a pipe
- a
_
receives the argument from the pipe. If none is given, the first argument does
For example,
using Chain
@chain cars begin
filter(:Manufacturer => ==("Volkswagen"), _)
filter(:MPGCity => >=(20), _)
end
Row | Manufacturer | Model | Type | MinPrice | Price | MaxPrice | MPGCity | MPGHighway | AirBags | DriveTrain | Cylinders | EngineSize | Horsepower | RPM | RevPerMile | ManTransAvail | FuelTankCapacity | Passengers | Length | Wheelbase | Width | TurnCircle | RearSeatRoom | LuggageRoom | Weight | Origin | Make |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Cat… | String | Cat… | Float64 | Float64 | Float64 | Int32 | Int32 | Cat… | Cat… | Cat… | Float64 | Int32 | Int32 | Int32 | Cat… | Float64 | Int32 | Int32 | Int32 | Int32 | Int32 | Float64? | Int32? | Int32 | Cat… | String | |
1 | Volkswagen | Fox | Small | 8.7 | 9.1 | 9.5 | 25 | 33 | None | Front | 4 | 1.8 | 81 | 5500 | 2550 | Yes | 12.4 | 4 | 163 | 93 | 63 | 34 | 26.0 | 10 | 2240 | non-USA | Volkswagen Fox |
2 | Volkswagen | Passat | Compact | 17.6 | 20.0 | 22.4 | 21 | 30 | None | Front | 4 | 2.0 | 134 | 5800 | 2685 | Yes | 18.5 | 5 | 180 | 103 | 67 | 35 | 31.5 | 14 | 2985 | non-USA | Volkswagen Passat |
Subset
The calling style filter(predicate, collection)
is consistently used with several higher-order functions, for example, map
and reduce
and is supported by Julia
’s do
syntax.
However, with data frames, the convention is to have action functions expect the data frame in the first position.
The subset
function also returns a filtered copy of the data frame with rows matching a certain selection criteria. It’s usage is somewhat similar to filter
with the order of its arguments reversed, but the predicate must return a vector of indices, so broadcasting or the ByRow
function may be necessary.
For example, these produce the same data frame:
= subset(cars, :Manufacturer => m -> m .== "Volkswagen") # use .== not just ==
d1 = subset(cars, :Manufacturer => ByRow(==("Volkswagen"))) # use `ByRow` to ensure predicate is applied to each d2
Row | Manufacturer | Model | Type | MinPrice | Price | MaxPrice | MPGCity | MPGHighway | AirBags | DriveTrain | Cylinders | EngineSize | Horsepower | RPM | RevPerMile | ManTransAvail | FuelTankCapacity | Passengers | Length | Wheelbase | Width | TurnCircle | RearSeatRoom | LuggageRoom | Weight | Origin | Make |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Cat… | String | Cat… | Float64 | Float64 | Float64 | Int32 | Int32 | Cat… | Cat… | Cat… | Float64 | Int32 | Int32 | Int32 | Cat… | Float64 | Int32 | Int32 | Int32 | Int32 | Int32 | Float64? | Int32? | Int32 | Cat… | String | |
1 | Volkswagen | Fox | Small | 8.7 | 9.1 | 9.5 | 25 | 33 | None | Front | 4 | 1.8 | 81 | 5500 | 2550 | Yes | 12.4 | 4 | 163 | 93 | 63 | 34 | 26.0 | 10 | 2240 | non-USA | Volkswagen Fox |
2 | Volkswagen | Eurovan | Van | 16.6 | 19.7 | 22.7 | 17 | 21 | None | Front | 5 | 2.5 | 109 | 4500 | 2915 | Yes | 21.1 | 7 | 187 | 115 | 72 | 38 | 34.0 | missing | 3960 | non-USA | Volkswagen Eurovan |
3 | Volkswagen | Passat | Compact | 17.6 | 20.0 | 22.4 | 21 | 30 | None | Front | 4 | 2.0 | 134 | 5800 | 2685 | Yes | 18.5 | 5 | 180 | 103 | 67 | 35 | 31.5 | 14 | 2985 | non-USA | Volkswagen Passat |
4 | Volkswagen | Corrado | Sporty | 22.9 | 23.3 | 23.7 | 18 | 25 | None | Front | 6 | 2.8 | 178 | 5800 | 2385 | Yes | 18.5 | 4 | 159 | 97 | 66 | 36 | 26.0 | 15 | 2810 | non-USA | Volkswagen Corrado |
Unlike filter
, subset
allows multiple arguments to be applied:
subset(cars,
:Manufacturer => ByRow(==("Volkswagen")),
:MPGCity => ByRow(>=(20)))
Row | Manufacturer | Model | Type | MinPrice | Price | MaxPrice | MPGCity | MPGHighway | AirBags | DriveTrain | Cylinders | EngineSize | Horsepower | RPM | RevPerMile | ManTransAvail | FuelTankCapacity | Passengers | Length | Wheelbase | Width | TurnCircle | RearSeatRoom | LuggageRoom | Weight | Origin | Make |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Cat… | String | Cat… | Float64 | Float64 | Float64 | Int32 | Int32 | Cat… | Cat… | Cat… | Float64 | Int32 | Int32 | Int32 | Cat… | Float64 | Int32 | Int32 | Int32 | Int32 | Int32 | Float64? | Int32? | Int32 | Cat… | String | |
1 | Volkswagen | Fox | Small | 8.7 | 9.1 | 9.5 | 25 | 33 | None | Front | 4 | 1.8 | 81 | 5500 | 2550 | Yes | 12.4 | 4 | 163 | 93 | 63 | 34 | 26.0 | 10 | 2240 | non-USA | Volkswagen Fox |
2 | Volkswagen | Passat | Compact | 17.6 | 20.0 | 22.4 | 21 | 30 | None | Front | 4 | 2.0 | 134 | 5800 | 2685 | Yes | 18.5 | 5 | 180 | 103 | 67 | 35 | 31.5 | 14 | 2985 | non-USA | Volkswagen Passat |
The mini language uses a default of identity
for the function so, if the columns are Boolean, they can be used to subset the data:
= DataFrame(a = [true, false], b = [true, missing])
dd subset(dd, :a) # just first row
Row | a | b |
---|---|---|
Bool | Bool? | |
1 | true | true |
For data with missing values, like in the :b
column of dd
, the comparison operators implement 3-valued logic, meaning the response can be true
, false
, or missing
. Using subset(dd, :b)
above will error. The keyword skipmissing
can be passed true
to have these skipped over. The default is false.
subset(dd, :b; skipmissing=true)
Row | a | b |
---|---|---|
Bool | Bool? | |
1 | true | true |
The dropmissing
function filters out each row that has a missing
value.
3.1.6 Selecting columns; transforming data
Filtering of a data frame can be viewed as accessing the data in df
with a boolean set of row indices, as in df[inds, :]
. (There are flags to have a “view” of the data, as in df[inds, !]
). Filtering returns a data frame with a possibly reduced number of rows, and the same number of columns.
The select
function is related in that it can be viewed as passing a Boolean vector of column indices to select specific columns or variables, as in df[:, inds]
. The syntax of select
also extends this allowing transformations of the data and reassignment, as could be achieved through this pattern: df.new_variable = f(other_variables_in_df)
.
The select
function will return a data frame with the same number of rows (cases) as the original data frame. The variables returned are selected using column selectors. There is support for the DataFrames
mini language.
First, we can select one or more columns by separating column selectors with commas. In the following, using the backpack data in the variable d
, we select the first column by column number, the second using a symbol, the third by a string, and the fourth and fifth using a regular expression:2
select(d, 1, :v, "p", r"^l")
Row | b | v | p | lap | load |
---|---|---|---|---|---|
String | Int64 | Float64? | Cat… | Cat… | |
1 | Goruck GR2 | 40 | 395.99 | Y | front panel |
2 | Minaal 3.0 | 35 | 349.99 | Y | front panel |
3 | Genius | 25 | 228.99 | Y | clamshell |
One use of select
is to rearrange the column order, say by moving a column to the front. The :
or All()
column selector will add the rest. That is this command would move “d
” to the front of the other columns: select(d, :d, All())
.
Selection also allows an easy renaming of column names, using the pairs notation column name => new column name
:
select(d, :v => :volume, "p" => "price")
Row | volume | price |
---|---|---|
Int64 | Float64? | |
1 | 40 | 395.99 |
2 | 35 | 349.99 |
3 | 25 | 228.99 |
This usage is a supported abbreviation of source_column => function => target_column_name
, where the function is identity
. Here the function is not necessarily a predicate function, as it is with subset
, but rather a function which returns a vector of the same length as the number of columns in the data frame (as select
always returns the same number of columns as the data frame it is passed.)
The function receives column vectors and should return a column vector. For a scalar function, it must be applied to each entry of the column vector, that is, each row. Broadcasting is one manner to do this. As used previously, the DataFrames
package provides ByRow
to also apply a function to each row of the column vector(s).
For example to compute a ratio of price to volume, we might have:
select(d, [:p, :v] => ByRow((p,v) -> p/v) => "price to volume")
Row | price to volume |
---|---|
Float64 | |
1 | 9.89975 |
2 | 9.99971 |
3 | 9.1596 |
The above would be equivalent to select(d, [:p, :v] => ((p,v) -> p./v) => "price to volume")
(broadcasted p ./ v
and parentheses), but not select(d, [:p, :v] => (p,v) -> p./v => "price to volume")
, as the precedence rules for =>
do not parse the expressions identically. That is, the use of parentheses around an anonymous function is needed.
The specification of a target column name is not necessary, as one will be generated. In this particular example, it would be :p_v_function
. For named functions, the automatic namings are a bit more informative, as the method name replaces the “function
” part. The renamecols
keyword argument can be set to false
to drop the addition of a function name, which in this case would leave :p_v
as the name. For a single-column selector it would not rename the column, rather replace it.
The above reduced the number of columns to just that specified. The :
selector will select all columns (as it does with indexing) and return them:
select(d, [:p, :v] => ByRow((p,v) -> p/v) => "price to volume", :) # return all columns
Row | price to volume | b | v | p | lap | load | d |
---|---|---|---|---|---|---|---|
Float64 | String | Int64 | Float64? | Cat… | Cat… | Date | |
1 | 9.89975 | Goruck GR2 | 40 | 395.99 | Y | front panel | 2022-09-01 |
2 | 9.99971 | Minaal 3.0 | 35 | 349.99 | Y | front panel | 2022-09-01 |
3 | 9.1596 | Genius | 25 | 228.99 | Y | clamshell | 2022-10-01 |
As seen in this last example, the source column selectors can select more than one column. The function receives multiple arguments. It is possible that there be multiple target columns. For example, we could compute both price-to-volume and volume-to-price quantities. A function to do so would be
pv_and_vp(p,v) = [p/v, v/p]
pv_and_vp (generic function with 1 method)
When we use this, we see the two values are stored in one column.
select(d, [:p, :v] => ByRow(pv_and_vp))
Row | p_v_pv_and_vp |
---|---|
Array… | |
1 | [9.89975, 0.101013] |
2 | [9.99971, 0.100003] |
3 | [9.1596, 0.109175] |
To split these up, we can give two names:
select(d, [:p, :v] => ByRow(pv_and_vp) => [:pv, :vp])
Row | pv | vp |
---|---|---|
Float64 | Float64 | |
1 | 9.89975 | 0.101013 |
2 | 9.99971 | 0.100003 |
3 | 9.1596 | 0.109175 |
AsTable
The AsTable
function has two meanings in the mini language. If on the target side, it assumes the output of the function is a vector of containers (in this example the vectors [p/v, v/p]
) that should be expanded into multiple columns. It uses the keys (were the containers named tuples) or, in this case generates them, to create multiple columns in the destination:
select(d, [:p, :v] => ByRow(pv_and_vp) => AsTable)
Row | x1 | x2 |
---|---|---|
Float64 | Float64 | |
1 | 9.89975 | 0.101013 |
2 | 9.99971 | 0.100003 |
3 | 9.1596 | 0.109175 |
Had we used a named tuple in our function, pv_and_vp(p, v) = (pv = p/v, vp = v/p)
, then the column names would come from the tuple’s names and need not be generated.
When AsTable
is used on the source columns, as in AsTable([:p,:v])
then the function receives a named tuple with column vector entries. (For this, pv_and_vp
would be written pv_and_vp(r) = [r.p/r.v, r.v/r.p]
, with r
representing a row in a two-column table with names :p
and :v
.
Transform
Extending the columns in the data frame by select
is common enough that the function transform
is supplied which always keeps the columns of the original data frame, though they can also be modified through the mini language. The use of transfrom is equivalent to select(df, :, args...)
.
The mini language is well documented in the documentation string of transform
and the blog post of Bogumil Kasminski “DataFrames.jl minilanguage explained.”
3.1.7 Combine
The combine
function creates a new data frame whose columns are the result of transformations applied to the source data frame. The name will be more clear after we discuss grouping or splitting of a data set.
A typical usage of combine is to apply some reduction to the data, for example finding an average.
In this example, we use the mean
function from the StatsBase
package and apply that to the :MPGCity
measurements in the cars
data:
combine(cars, :MPGCity => mean)
Row | MPGCity_mean |
---|---|
Float64 | |
1 | 22.3656 |
(A second pair can be used to specify a name for the target, as in :MPGCity => mean => :AvgMPGCity
)
There are several numeric variables in this data set, we may wish to find the mean of more than 1 at a time. For this, we can broadcast the columns via .=>
, as in this example:
combine(cars, [:MPGCity, :MPGHighway] .=> mean)
Row | MPGCity_mean | MPGHighway_mean |
---|---|---|
Float64 | Float64 | |
1 | 22.3656 | 29.086 |
(That is an alternate to combine(cars, :MPGCity => mean, :MPGHighway => mean)
.)
Broadcasting =>
lends itself to succinctly applying a function to multiple columns.
For example, to apply mean
to all the numeric values, we might select them first (using the filtering feature of names
), then pass to the mini language, as follows:
= names(cars, Real)
nms combine(cars, nms .=> mean; renamecols=false)
Row | MinPrice | Price | MaxPrice | MPGCity | MPGHighway | EngineSize | Horsepower | RPM | RevPerMile | FuelTankCapacity | Passengers | Length | Wheelbase | Width | TurnCircle | Weight |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | |
1 | 17.1258 | 19.5097 | 21.8989 | 22.3656 | 29.086 | 2.66774 | 143.828 | 5280.65 | 2332.2 | 16.6645 | 5.08602 | 183.204 | 103.946 | 69.3763 | 38.957 | 3072.9 |
More than one transformation at a time is possible, as already seen in the mini language. With combine
the number of rows is determined by the output of the transformation. In this example, the mean
reduces to a number, but unique
reduces :Origin
to the number of levels. The result of mean
is repeated, it is not the mean for each group (a task we demonstrate shortly):
combine(cars, :MPGCity => mean, :Origin => unique)
Row | MPGCity_mean | Origin_unique |
---|---|---|
Float64 | String | |
1 | 22.3656 | non-USA |
2 | 22.3656 | USA |
3.1.8 Flatten
As mentioned, the repeat
function is used to repeat values in a vector, or other iterable. The simplest usage looks like this:
repeat([1,2,3], 2) |> show
[1, 2, 3, 1, 2, 3]
A common use of repetition is to take two variables, say x
and y
, perhaps of unequal length and create a data frame with two columns: one to store the values and one to indicate if a value came from x
or y
. This could be achieved like
= [1,2,3]
x = [4, 5]
y
DataFrame(variable=vcat(repeat([:x], length(x)), repeat([:y], length(y))),
= vcat(x,y)) values
Row | variable | values |
---|---|---|
Symbol | Int64 | |
1 | x | 1 |
2 | x | 2 |
3 | x | 3 |
4 | y | 4 |
5 | y | 5 |
The flatten
function offers an alternative. Consider the data frame:
= [1,2,3]
x = [4, 5]
y = DataFrame(variable=[:x,:y], value=[x, y]) dxy
Row | variable | value |
---|---|---|
Symbol | Array… | |
1 | x | [1, 2, 3] |
2 | y | [4, 5] |
The flatten
function will iterate over the indicated columns and repeat the other variables:
flatten(dxy, :value)
Row | variable | value |
---|---|---|
Symbol | Int64 | |
1 | x | 1 |
2 | x | 2 |
3 | x | 3 |
4 | y | 4 |
5 | y | 5 |
3.1.9 SplitApplyCombine
The split-apply-combine strategy for wrangling data frames was popularized in the influential paper “The Split-Apply-Combine Strategy for Data Analysis: (Wickham 2011) by H. Wickham which introduces this description:”where you break up a big problem into manageable pieces, operate on each piece independently and then put all the pieces back together.”
The groupby
function for data frames does the breaking up, the DataFrames
mini language can operate on each piece, and the combine
function can put things together again.
The groupby
function splits a data frame into pieces of type GroupedDataFrame
. The basic signature is groupby(df, cols; sort=nothing, skipmissing=false)
. The cols
are one or more column selectors. The value of sort
defaults to nothing
leaving the order of the groups in the result undefined, but the grouping uses the fastest identified algorithm. The skipmissing
argument can instruct the skipping of rows which have missing values in the selected columns.
For example, grouping by loading style, returns two sub data frames.
= groupby(d, :load) gdf
GroupedDataFrame with 2 groups based on key: load
Row | b | v | p | lap | load | d |
---|---|---|---|---|---|---|
String | Int64 | Float64? | Cat… | Cat… | Date | |
1 | Goruck GR2 | 40 | 395.99 | Y | front panel | 2022-09-01 |
2 | Minaal 3.0 | 35 | 349.99 | Y | front panel | 2022-09-01 |
⋮
Row | b | v | p | lap | load | d |
---|---|---|---|---|---|---|
String | Int64 | Float64? | Cat… | Cat… | Date | |
1 | Genius | 25 | 228.99 | Y | clamshell | 2022-10-01 |