3  Tabular data

Some data sets are naturally stored in a tabular format, like a spreadsheet. This chapter looks at such data.

In the following, we will utilize these basic packages, among others introduced along the way:

using StatsBase, StatsPlots

3.1 Data frames

The basic framework for exploratory data analysis is a data set with 1 or more observations for \(n\) different cases. For each case (or observation) a variable will be a length-\(n\) data set, with values measuring the same type of quantity, or perhaps missing. The variables can be stored as vectors. The collection of variables is traditionally arranged in a rectangular manner with each row representing the measurements for a given case and each column representing the measured values for each variable. That is the columns are homogeneous (as they measure the same thing), but the rows need not be (as different variables are measuring different quantities).

Matrices contain rectangular data, but are homogeneous. As such, a new structure is needed to store tabular data in general. A data frame is a concept borrowed from S Plus where they were introduced in 1991. The R language has data frames as a basic built-in data container. In Julia the structure is implemented in an external package, DataFrames.

using DataFrames
An OG package

The DataFrames package dates back to 2012, at least, and must be one of Julia’s oldest packages. Julia was on version 0.1 then. It is also under continual development and has a comprehensive set of documentation and tutorials. Bouchet-Valat and Kamiński (2023) has an excellent overview including comparisons to other implementations in other languages and context for certain design decisions. This introduction attempts to illustrate some basic usage; consult the provided documentation for additional details and applications.

3.1.1 Data frame construction

There are different ways to construct a data frame.

Consider the task of the Wirecutter in trying to select the best carry on travel bag. After compiling a list of possible models by scouring travel blogs etc., they select some criteria (capacity, compartment design, aesthetics, comfort, …) and compile data, similar to what one person collected in a spreadsheet. Here we create a much simplified spreadsheet for 3 listed bags with measurements of volume, price, laptop compatability, loading style, and a last-checked date – as this market improves constantly.

product         v  p    l loads       checked
Goruck GR2      40 395  Y front panel 2022-09
Minaal 3.0      35 349  Y front panel 2022-09
Genius          25 228  Y clamshell   2022-10

We see that product is a character, volume and price numeric, laptop compatability a Boolean value, load style one of a few levels, and the last checked date, a year-month date.

We create vectors to hold each. We load the CategoricalArrays and Dates packages for a few of the variables:

using CategoricalArrays, Dates
product = ["Goruck GR2", "Minaal 3.0", "Genius"]
volume = [40, 35, 25]
price = [395, 349, 228]
laptop_compatability = categorical(["Y","Y","Y"])
loading_style = categorical(["front panel", "front panel", "clamshell"])
date_checked = Date.(2022, [9,9,10])
3-element Vector{Date}:
 2022-09-01
 2022-09-01
 2022-10-01

With this, we use the DataFrame constructor to combine these into one data set.

d = DataFrame(product = product, volume=volume, price=price,
              var"laptop compatability"=laptop_compatability,
              var"loading style"=loading_style, var"date checked"=date_checked)
3×6 DataFrame
Row product volume price laptop compatability loading style date checked
String Int64 Int64 Cat… Cat… Date
1 Goruck GR2 40 395 Y front panel 2022-09-01
2 Minaal 3.0 35 349 Y front panel 2022-09-01
3 Genius 25 228 Y clamshell 2022-10-01

The call to the constructor is like that of a named tuple and again, var"..." is employed for names with spaces.

As called above, the constructor binds vectors of equal length into a new multi-variate container.

The describe function will summarize a data frame using different summaries for different types:

describe(d)
6×7 DataFrame
Row variable mean min median max nmissing eltype
Symbol Union… Any Any Any Int64 DataType
1 product Genius Minaal 3.0 0 String
2 volume 33.3333 25 35.0 40 0 Int64
3 price 324.0 228 349.0 395 0 Int64
4 laptop compatability Y Y 0 CategoricalValue{String, UInt32}
5 loading style clamshell front panel 0 CategoricalValue{String, UInt32}
6 date checked 2022-09-01 2022-09-01 2022-10-01 0 Date

In the above construction, we repeated the names of the variables to the constructor. A style, also akin to the construction of named tuple, could also have been used where variables after a semicolon are implicitly assumed to have the desired name:

d = DataFrame(; product, volume, price,
              var"laptop compatability"=laptop_compatability,
              var"loading style"=loading_style, var"date checked"=date_checked)
3×6 DataFrame
Row product volume price laptop compatability loading style date checked
String Int64 Int64 Cat… Cat… Date
1 Goruck GR2 40 395 Y front panel 2022-09-01
2 Minaal 3.0 35 349 Y front panel 2022-09-01
3 Genius 25 228 Y clamshell 2022-10-01

(In fact, with the rename! function, or other conveniences, it can be more direct to define the data frame, then rename the columns, avoiding the var above.)

We see that the DataFrame type displays values as we might expect with the column header showing the name of the variable and the storage type. The row numbers are added and may be used for reference to the data values.

There is a convention for attaching new columns to a data frame that allows another alternate construction:

d = DataFrame()   # empty data frame
d.product = product
d.volume = volume
d.price = price
d."laptop compatability" = laptop_compatability
d."loading style" = loading_style
d."date checked" = date_checked
d
3×6 DataFrame
Row product volume price laptop compatability loading style date checked
String Int64 Int64 Cat… Cat… Date
1 Goruck GR2 40 395 Y front panel 2022-09-01
2 Minaal 3.0 35 349 Y front panel 2022-09-01
3 Genius 25 228 Y clamshell 2022-10-01

This uses one of a few different ways to refer to a column in a data frame.

As another alternative, the insertcols! function can insert multiple columns, for example the last one in the example, could have been written insertcols!(d, "date checked" => date_checked).

Before exploring how we might query the data frame, we note that an alternate means to describe the data set is by row, or case. For each row, we might use a named tuple with the names indicating the variable, and the values the measurement. The DataFrame constructor would accept this, here we shorten the names and only do one row (for now):

d = DataFrame([
 (b = "Goruck GR2", v = 40, p = 395, lap = "Y", load = "front panel", d = Date("2022-09-01"))
])
1×6 DataFrame
Row b v p lap load d
String Int64 Int64 String String Date
1 Goruck GR2 40 395 Y front panel 2022-09-01

Here, the variable types are different, as it takes more effort to make one categorical. The same way we defined a column allows us to redefine that column (replacing the container, not re-using it for storage). For example, we could do:

d.lap = categorical(d.lap)
d.load = categorical(d.load)
1-element CategoricalArray{String,1,UInt32}:
 "front panel"

and then the two variables would be categorical.

There are other ways to define row by row:

  • A data frame with empty variables can be constructed and then values as tuples may be push!ed onto the data frame
push!(d,  ("Minaal 3.0", 35, 349, "Y", "front panel", Date("2022-09-01")))
2×6 DataFrame
Row b v p lap load d
String Int64 Int64 Cat… Cat… Date
1 Goruck GR2 40 395 Y front panel 2022-09-01
2 Minaal 3.0 35 349 Y front panel 2022-09-01

A named tuple can also be used.

  • Similarly, rows specified by dictionaries may be pushed onto a data frame
push!(d, Dict(:b => "Genius", :v => 25, :p => 228, :lap => "Y",
              :load => "clamshell", :d => Date("2022-10-01")))
3×6 DataFrame
Row b v p lap load d
String Int64 Int64 Cat… Cat… Date
1 Goruck GR2 40 395 Y front panel 2022-09-01
2 Minaal 3.0 35 349 Y front panel 2022-09-01
3 Genius 25 228 Y clamshell 2022-10-01

(A dictionary is a key => value container like a named tuple, but keys may be arbitrary Julia objects – not always symbols – so we explicitly use symbols in the above command.)

The Tables interface

In the Julia ecosystem, the Tables package provides a lightweight means to describe tabular data, like a data frame, TableOperations provides some basic means to query the data. The different ways in which a data frame can be defined, reflect the different ways Tables objects can be created.

As tables are viewed as rows of homogeneous types, a vector of named tuples is a natural description, whereas tables are also naturally viewed as a collection of named columns, each column a vector. As such, both of these specify a table.

  • As a struct of arrays. This is used by calling DataFrame(a=[1,2], b= [2,1]), say.
  • As an array of structs. This is used by calling DataFrame([(a=1,b=2), (a=2,b=1)])

RDataset

Data frames are often read in from external sources, and not typed in. The RDataSets package has collected data sets from many different R packages and bundled them into data sets for Julia in the form of data frames. We will use some of these data sets in the examples. To load a data set we need to know its R package and R name:

using RDatasets
cars = dataset("MASS", "Cars93")  # package, data set name
first(cars, 2)  # first 2 rows only
2×27 DataFrame
Row Manufacturer Model Type MinPrice Price MaxPrice MPGCity MPGHighway AirBags DriveTrain Cylinders EngineSize Horsepower RPM RevPerMile ManTransAvail FuelTankCapacity Passengers Length Wheelbase Width TurnCircle RearSeatRoom LuggageRoom Weight Origin Make
Cat… String Cat… Float64 Float64 Float64 Int32 Int32 Cat… Cat… Cat… Float64 Int32 Int32 Int32 Cat… Float64 Int32 Int32 Int32 Int32 Int32 Float64? Int32? Int32 Cat… String
1 Acura Integra Small 12.9 15.9 18.8 25 31 None Front 4 1.8 140 6300 2890 Yes 13.2 5 177 102 68 37 26.5 11 2705 non-USA Acura Integra
2 Acura Legend Midsize 29.2 33.9 38.7 18 25 Driver & Passenger Front 6 3.2 200 5500 2335 Yes 18.0 5 195 115 71 38 30.0 15 3560 non-USA Acura Legend

CSV.read

For interacting with spreadsheet data, when it is stored like a data frame, it may be convenient to export the data in some format and then read that into a Julia session. A common format is to use comma-separated values, or “csv” data. The CSV package reads such files. The CSV package reads in delimited text data (the source) using the Tables interface and can be instructed to write to a data frame (the sink). The CSV.read file is used below. Here we write to some file to show that we can read it back:

using CSV
f = tempname() * ".csv"
CSV.write(f, d)  # write simple data frame to file `f`
d1 = CSV.read(f, DataFrame)
3×6 DataFrame
Row b v p lap load d
String15 Int64 Int64 String1 String15 Date
1 Goruck GR2 40 395 Y front panel 2022-09-01
2 Minaal 3.0 35 349 Y front panel 2022-09-01
3 Genius 25 228 Y clamshell 2022-10-01

Comma-separated-value files are simple text files, with no metadata to indicate what type a value may be.1 Some guesswork is necessary. We observe that the two categorical variables were read back in as strings. As seen above, that can be addressed through column conversion.

The filename, may be more general. For example, it could be download(url) for some url that points to a csv file to read data from the internet.

Read and write

The methods read and write are qualified in the above usage with the CSV module. In the Julia ecosystem, the FileIO package provides a common framework for reading and writing files; it uses the verbs load and save. This can also be used with DataFrames, though it works through the CSVFiles package – and not CSV, as illustrated above. The read command would look like DataFrame(load(fname)) and the write command like save(fname, df). Here fname would have a “.csv” extension so that the type of file could be detected.

Basic usage to read/write .csv file into a data frame.
Command Description
CSV.read(file_name, DataFrame) Read csv file from file with given name
CSV.read(IOBuffer(string), DataFrame) Read csv file from a string using an IOBuffer
CSV.read(download(url), DataFrame) Read csv file an url
open(file_name,"w") do io; CSV.write(io, dd); end Write data frame df to csv file
DataFrame(load(file_name)) Read csv file from file with given name using CSVFiles
save(file_name, df) Write data frame df to a csv file using CSVFiles

TableScraper

The TableScraper package can scrape tables from an HTML file in a manner that can be passed to DataFrame for conversion to a data frame. The command DataFrame.(scrape_tables(url)) may just work to create the tables, but in practice, only well-formed tables can be scraped successfully; tables used for formatting purposes may fail. TableScraper is built on Cascadia and Gumbo, as is the Harbest package, which provides more access to the underlying data than TableScraper.

3.1.2 Column names

The variable names of a data set are returned by the names function, as strings:

names(d)
6-element Vector{String}:
 "b"
 "v"
 "p"
 "lap"
 "load"
 "d"

This function has a second argument which can be used to narrow or filter the names that are returned. For example, the following is a convenience for matching columns whose element type is a subtype of Real:

names(d, Real)
2-element Vector{String}:
 "v"
 "p"

The rename! function allows the names to be changed in-place (without returning a new data frame). There are many different calling patterns, but:

  • to replace one name, the form rename!(df, :v => :nv) is used, where :v is the old name, and :nv the new one. The form follows the generic replace! function. The pair notation can be broadcast (.=>) to replace more than one name, as with replace!(df, [:v1, :v2] .=> [:nv1, :nv2]).
  • to replace allnames, the form rename!(df, [...]) is used, where the vector is of length matching the number of columns in df.

3.1.3 Indexing and assignment

The values in a data frame can be referenced programatically by a row number and column number, both 1-based. For example, the 2nd row and 3rd column of d can be seen to be 349 by observation

d
3×6 DataFrame
Row b v p lap load d
String Int64 Int64 Cat… Cat… Date
1 Goruck GR2 40 395 Y front panel 2022-09-01
2 Minaal 3.0 35 349 Y front panel 2022-09-01
3 Genius 25 228 Y clamshell 2022-10-01

but it can be referenced through indexing (row number first, column number second):

d[2, 3]
349

A new value can also be assigned with the same indexing reference. Suppose, that bag went on sale. This new price could be adjusted, as follows:

d[2, 3] = 309 # a sale price
309

Column selectors

The 3rd column has a name, p in this case. We can use a column selector to avoid having to know which column number a variable is. For this a string or a symbol may be used, as in:

d[2, :p], d[2, "p"]
(309, 309)

More generally, it is possible to use more than 1 column in column selection. For example, to get both the volume and price, for the second bag we have:

d[2, [:v, :p]]
DataFrameRow (2 columns)
Row v p
Int64 Int64
2 35 309

As well, either [2, 3] or ["v", "p"] could have been used.

For data frames with many columns, a regular expression can be used. The column selector r"^l" would match all columns whose name begins with l (lap and load) in this case. (Regular expressions can be very complicated, but here we only assume that r"name" will match “name” somewhere in the string; r"^name" and r"name$" will match “name” at the beginning and ending of a string. Using a regular expression will return a data frame row (when a row index is specified) – not a value – as it is possible to return 0, 1 or more columns in the selection.

Column selectors, as seen may be column number(s), column names as strings, column names as symbols, and regular expressions.

In addition:

  • Boolean vectors of a length matching the number of columns (d[2, [false, true, true, false, false, false]] would return the 2nd and 3rd columns.)
  • The value All() which selects all columns
  • the value Between(a, b) where, a and b are column selectors (such as Between(:v, 3)).
  • Not(...) to select those columns not specified in ...
  • Cols(a,b,...) which combines different selectors, such as Cols(:v, 3).

Slices

The All() column selector has a shortcut inherited from the indexing of arrays in base Julia, this being a colon, :. When used in the column position it refers to all the values in the row:

d[2, :]
DataFrameRow (6 columns)
Row b v p lap load d
String Int64 Int64 Cat… Cat… Date
2 Minaal 3.0 35 309 Y front panel 2022-09-01

When used in the row position, it refers to all rows:

d[:, 3]
3-element Vector{Int64}:
 395
 309
 228

In the above use, a vector is returned, not a data frame, as only one column was selected. But which vector? Is it a copy of the one that d holds, or is it the same container? In base Julia the answer depends on what side of an equals sign the use is on (meaning, if used in assignment the answer is different, as mentioned later). In this usage, a copy is returned.

To get the underlying column (not a copy), the selector ! is used instead:

d[!, 3]
3-element Vector{Int64}:
 395
 309
 228

The notation d.p also refers to the 3rd variable. In this case, d.p is a view of the underlying data (using !) and not a copy.

The notation d[:,:] will refer to all rows and all columns of d, and in this case will be a copy of the data frame.

Assignment

Single values can be assigned, as previously illustrated. These values are assigned to the underlying vector in the data frame, so as with vectors, the assigned value will be promoted to the column type, if necessary. Assigning a value that can’t be promoted will throw and error, as it would were such an assignment tried for a vector. To do such an assignment, the underlying vector must be changed to a wider type.

The row selector : or ! had different semantics when accessing the underlying data, the first returning a copy, the second a view. When used with assignment though, : refers to the underlying vector or does in-place assignment; whereas ! (and the dot access) will replace the the underlying container with the assigned container.

To illustrate, suppose the prices are rounded down, but actually contain an extra 99 cents. We can reassign price as following:

d[!, :p] = price .+ 0.99
3-element Vector{Float64}:
 395.99
 349.99
 228.99

as even though price .+ 0.99 is no longer an integer vector, the ! row indicator will instruct d to point to this new vector. Had d[:, :p] = ... been used above, an error would have been thrown as the underlying container is integer, but the new prices require floating point values to represent them.

Broadcast assignment (using .=) will be similar. The right hand side is expanded, and then assigned. Attempts to assign the wrong value type using : to indicate the rows will error.

Missing values

Trying to assign a value of missing to a data frame requires the underlying vector to allow missing values. As with vectors, the allowmissing function is useful. For example, to give the 2nd bag a price of missing, we could do:

allowmissing!(d, :p)
d[2, :p] = missing
missing

By using the mutating form (with the trailing !), the data frame d is mutated in the allowmissing! call. This could be observed by looking a the underlying column type.

The new type is a union of the old type and Missing, so we could reassign the old value:

d[2, :p] = 349.99
d
3×6 DataFrame
Row b v p lap load d
String Int64 Float64? Cat… Cat… Date
1 Goruck GR2 40 395.99 Y front panel 2022-09-01
2 Minaal 3.0 35 349.99 Y front panel 2022-09-01
3 Genius 25 228.99 Y clamshell 2022-10-01

Tables.jl interface

A table, like a data frame can be expected to loop, or iterate, over columns, say to find which columns are numeric. Or, they can be expected to iterate over rows, say to find which values are within a certain range. What should be the case with a data frame? As there are reasons to prefer either, none is specified. Rather, there is a difference based on the calling function. Data frames respect the Tables interface, which has methods to iterate over rows or columns. When iterating over columns, the individual vectors are given; when iterating over rows, a DataFrameRow object is returned. Calling copy on these rows objects will return a named tuple.

The basic iterators are eachrow and eachcol. For example, to see the column type for a data frame, we can broadcast eltype over the variables returned by eachcol:

eltype.(eachcol(d))  |> permutedims
1×6 Matrix{Type}:
 String  Int64  Union{Missing, Float64}  …  Date

To get the data frame as a vector of named tuples, then broadcasting NamedTuple over eachrow can be used: NamedTuple.(eachrow(d)); Base.copy is an alias. Tables.rowtable(d) will produce this vector of named tuples, as well.

3.1.4 Sorting

A data frame can be sorted by specifying a permutation of the indices, such as is returned by sortperm. For example, d[sortperm(d.price), :] will sort by price in increasing order. The keyword argument rev can be passed true to reverse the default sort order.

The generic sort function has a method for data frames, that offers a more convenient functionality. The sort! method does in-place sorting. The simplest call is to pass in a column to sort by, as in

sort(d, :p)
3×6 DataFrame
Row b v p lap load d
String Int64 Float64? Cat… Cat… Date
1 Genius 25 228.99 Y clamshell 2022-10-01
2 Minaal 3.0 35 349.99 Y front panel 2022-09-01
3 Goruck GR2 40 395.99 Y front panel 2022-09-01

This sorts by the price (:p) variable in increasing order. To sort by decreasing order a keyword argument rev=true can be specified:

sort(d, "v"; rev=true)
3×6 DataFrame
Row b v p lap load d
String Int64 Float64? Cat… Cat… Date
1 Goruck GR2 40 395.99 Y front panel 2022-09-01
2 Minaal 3.0 35 349.99 Y front panel 2022-09-01
3 Genius 25 228.99 Y clamshell 2022-10-01

There can be one or more column selectors. For this, the order function can be used to reverse just one of the columns, as in the following, which first sorts the dates in reverse order (newest first) and then within a data, uses lowest price first to sort:

sort(d, [order("d", rev=true), :p])
3×6 DataFrame
Row b v p lap load d
String Int64 Float64? Cat… Cat… Date
1 Genius 25 228.99 Y clamshell 2022-10-01
2 Minaal 3.0 35 349.99 Y front panel 2022-09-01
3 Goruck GR2 40 395.99 Y front panel 2022-09-01

3.1.5 Filtering rows

Filtering, or subsetting, is a means to select a specified selection of rows.

Let’s consider the cars data set, previously defined by loading dataset("MASS", "Cars93"). This data set consists of 27 measurements on 93 cars from the 1990s. The data set comes from R’s MASS package, and is documented there.

Two ready ways to filter are by using specific indices, or by using Boolean indices generated by a logical comparison.

For the cars data set, the latter can be used to extract the Volkswagen models. To generate the indices we compare the :Manufacturer variable with the string Volkswagen to return a vector of length 93 that indicates if the manufacturer is VW. This is done with broadcasting: cars.Manufacturer .== "Volkswagen". We use the easy dot access to variables in cars, alternatively cars[:, :Manufacturer] (or, more efficiently, !, as is used by the dot access) could be used were :Manufacturer coming from some variable. With this row selector, we have:

cars[cars.Manufacturer .== "Volkswagen", :]
4×27 DataFrame
Row Manufacturer Model Type MinPrice Price MaxPrice MPGCity MPGHighway AirBags DriveTrain Cylinders EngineSize Horsepower RPM RevPerMile ManTransAvail FuelTankCapacity Passengers Length Wheelbase Width TurnCircle RearSeatRoom LuggageRoom Weight Origin Make
Cat… String Cat… Float64 Float64 Float64 Int32 Int32 Cat… Cat… Cat… Float64 Int32 Int32 Int32 Cat… Float64 Int32 Int32 Int32 Int32 Int32 Float64? Int32? Int32 Cat… String
1 Volkswagen Fox Small 8.7 9.1 9.5 25 33 None Front 4 1.8 81 5500 2550 Yes 12.4 4 163 93 63 34 26.0 10 2240 non-USA Volkswagen Fox
2 Volkswagen Eurovan Van 16.6 19.7 22.7 17 21 None Front 5 2.5 109 4500 2915 Yes 21.1 7 187 115 72 38 34.0 missing 3960 non-USA Volkswagen Eurovan
3 Volkswagen Passat Compact 17.6 20.0 22.4 21 30 None Front 4 2.0 134 5800 2685 Yes 18.5 5 180 103 67 35 31.5 14 2985 non-USA Volkswagen Passat
4 Volkswagen Corrado Sporty 22.9 23.3 23.7 18 25 None Front 6 2.8 178 5800 2385 Yes 18.5 4 159 97 66 36 26.0 15 2810 non-USA Volkswagen Corrado

This approach lends itself to the description “find all rows matching some value” then “extract the identified rows,” – written as two steps to emphasize there are two passes through the data. Another mental model would be loop over the rows, and keep those that match the query. This is done generically by the filter function for collections in Julia or by the subset function of DataFrames.

The filter(predicate, collection) function is used to identify just the values in the collection for which the predicate function returns true. When a data frame is used with filter, the iteration is over the rows, so the wrapping eachrow iterator is not needed. We need a predicate function to replace the .== above. One follows. It doesn’t need .==, as r is a data frame row and access produces a value not a vector:

filter(r -> r.Manufacturer == "Volkswagen", cars)
4×27 DataFrame
Row Manufacturer Model Type MinPrice Price MaxPrice MPGCity MPGHighway AirBags DriveTrain Cylinders EngineSize Horsepower RPM RevPerMile ManTransAvail FuelTankCapacity Passengers Length Wheelbase Width TurnCircle RearSeatRoom LuggageRoom Weight Origin Make
Cat… String Cat… Float64 Float64 Float64 Int32 Int32 Cat… Cat… Cat… Float64 Int32 Int32 Int32 Cat… Float64 Int32 Int32 Int32 Int32 Int32 Float64? Int32? Int32 Cat… String
1 Volkswagen Fox Small 8.7 9.1 9.5 25 33 None Front 4 1.8 81 5500 2550 Yes 12.4 4 163 93 63 34 26.0 10 2240 non-USA Volkswagen Fox
2 Volkswagen Eurovan Van 16.6 19.7 22.7 17 21 None Front 5 2.5 109 4500 2915 Yes 21.1 7 187 115 72 38 34.0 missing 3960 non-USA Volkswagen Eurovan
3 Volkswagen Passat Compact 17.6 20.0 22.4 21 30 None Front 4 2.0 134 5800 2685 Yes 18.5 5 180 103 67 35 31.5 14 2985 non-USA Volkswagen Passat
4 Volkswagen Corrado Sporty 22.9 23.3 23.7 18 25 None Front 6 2.8 178 5800 2385 Yes 18.5 4 159 97 66 36 26.0 15 2810 non-USA Volkswagen Corrado

There are some conveniences that have been developed to make the predicate functions even easier to write, which will be introduced incrementally:

The basic binary comparisons, like ==, are defined by functions essentially of the form (a,b) -> ==(a,b). There are partially applied versions, essentially a -> ==(a,b), with b applied, formed by ==(b). So the syntax ==("Volkswagen") is a predicate taking a single argument which is then compared for equality to the string "Volkswagen".

However, r is a data frame row, what we need to pass to ==("Volkswagen") is the value of the :Manufacturer column. For this another convenience is available.

The pair syntax, => of the form column selector(s) => predicate function will pass just the value(s) in the column(s) to the predicate function. Combined, the above can be simplified to:

filter(:Manufacturer => ==("Volkswagen"), cars)
4×27 DataFrame
Row Manufacturer Model Type MinPrice Price MaxPrice MPGCity MPGHighway AirBags DriveTrain Cylinders EngineSize Horsepower RPM RevPerMile ManTransAvail FuelTankCapacity Passengers Length Wheelbase Width TurnCircle RearSeatRoom LuggageRoom Weight Origin Make
Cat… String Cat… Float64 Float64 Float64 Int32 Int32 Cat… Cat… Cat… Float64 Int32 Int32 Int32 Cat… Float64 Int32 Int32 Int32 Int32 Int32 Float64? Int32? Int32 Cat… String
1 Volkswagen Fox Small 8.7 9.1 9.5 25 33 None Front 4 1.8 81 5500 2550 Yes 12.4 4 163 93 63 34 26.0 10 2240 non-USA Volkswagen Fox
2 Volkswagen Eurovan Van 16.6 19.7 22.7 17 21 None Front 5 2.5 109 4500 2915 Yes 21.1 7 187 115 72 38 34.0 missing 3960 non-USA Volkswagen Eurovan
3 Volkswagen Passat Compact 17.6 20.0 22.4 21 30 None Front 4 2.0 134 5800 2685 Yes 18.5 5 180 103 67 35 31.5 14 2985 non-USA Volkswagen Passat
4 Volkswagen Corrado Sporty 22.9 23.3 23.7 18 25 None Front 6 2.8 178 5800 2385 Yes 18.5 4 159 97 66 36 26.0 15 2810 non-USA Volkswagen Corrado

It doesn’t save much for this simple example, but this specification style is the basis of the DataFrames mini language, which provides a standardized and performant means of wrangling a data frame.

Filtering may be done on one or more values. There are different approaches.

For example, to first filter by the manufacturer, then by city mileage, we might write a more complicated predicate function using && to combine Boolean values:

pred(r) = r.Manufacturer == "Volkswagen" && r.MPGCity >= 20
filter(pred, cars)
2×27 DataFrame
Row Manufacturer Model Type MinPrice Price MaxPrice MPGCity MPGHighway AirBags DriveTrain Cylinders EngineSize Horsepower RPM RevPerMile ManTransAvail FuelTankCapacity Passengers Length Wheelbase Width TurnCircle RearSeatRoom LuggageRoom Weight Origin Make
Cat… String Cat… Float64 Float64 Float64 Int32 Int32 Cat… Cat… Cat… Float64 Int32 Int32 Int32 Cat… Float64 Int32 Int32 Int32 Int32 Int32 Float64? Int32? Int32 Cat… String
1 Volkswagen Fox Small 8.7 9.1 9.5 25 33 None Front 4 1.8 81 5500 2550 Yes 12.4 4 163 93 63 34 26.0 10 2240 non-USA Volkswagen Fox
2 Volkswagen Passat Compact 17.6 20.0 22.4 21 30 None Front 4 2.0 134 5800 2685 Yes 18.5 5 180 103 67 35 31.5 14 2985 non-USA Volkswagen Passat

The mini language can also be used. On the left hand side of (the first) => a vector of column selectors may be indicated, in which the predicate function (on the right hand side of the first =>) will get multiple arguments that can be used to implement the logic:

pred(m, mpg) = m == "Volkswagen" && mpg >= 20
filter([:Manufacturer, :MPGCity] => pred, cars)
2×27 DataFrame
Row Manufacturer Model Type MinPrice Price MaxPrice MPGCity MPGHighway AirBags DriveTrain Cylinders EngineSize Horsepower RPM RevPerMile ManTransAvail FuelTankCapacity Passengers Length Wheelbase Width TurnCircle RearSeatRoom LuggageRoom Weight Origin Make
Cat… String Cat… Float64 Float64 Float64 Int32 Int32 Cat… Cat… Cat… Float64 Int32 Int32 Int32 Cat… Float64 Int32 Int32 Int32 Int32 Int32 Float64? Int32? Int32 Cat… String
1 Volkswagen Fox Small 8.7 9.1 9.5 25 33 None Front 4 1.8 81 5500 2550 Yes 12.4 4 163 93 63 34 26.0 10 2240 non-USA Volkswagen Fox
2 Volkswagen Passat Compact 17.6 20.0 22.4 21 30 None Front 4 2.0 134 5800 2685 Yes 18.5 5 180 103 67 35 31.5 14 2985 non-USA Volkswagen Passat

Finally, we could use filter twice:

cars1 = filter(:Manufacturer => ==("Volkswagen"), cars)
cars2 = filter(:MPGCity => >=(20), cars1)
2×27 DataFrame
Row Manufacturer Model Type MinPrice Price MaxPrice MPGCity MPGHighway AirBags DriveTrain Cylinders EngineSize Horsepower RPM RevPerMile ManTransAvail FuelTankCapacity Passengers Length Wheelbase Width TurnCircle RearSeatRoom LuggageRoom Weight Origin Make
Cat… String Cat… Float64 Float64 Float64 Int32 Int32 Cat… Cat… Cat… Float64 Int32 Int32 Int32 Cat… Float64 Int32 Int32 Int32 Int32 Int32 Float64? Int32? Int32 Cat… String
1 Volkswagen Fox Small 8.7 9.1 9.5 25 33 None Front 4 1.8 81 5500 2550 Yes 12.4 4 163 93 63 34 26.0 10 2240 non-USA Volkswagen Fox
2 Volkswagen Passat Compact 17.6 20.0 22.4 21 30 None Front 4 2.0 134 5800 2685 Yes 18.5 5 180 103 67 35 31.5 14 2985 non-USA Volkswagen Passat

The above required the introduction of an intermediate data frame to store the result of the first filter call to pass to the second. This threading through of the modified data is quite common in processing pipelines. The first two approaches with complicated predicate functions can grow unwieldly, so staged modification is common. To support that, the chaining or piping operation (|>) is often used:

filter(:Manufacturer => ==("Volkswagen"), cars) |>
    d -> filter(:MPGCity => >=(20), d)
2×27 DataFrame
Row Manufacturer Model Type MinPrice Price MaxPrice MPGCity MPGHighway AirBags DriveTrain Cylinders EngineSize Horsepower RPM RevPerMile ManTransAvail FuelTankCapacity Passengers Length Wheelbase Width TurnCircle RearSeatRoom LuggageRoom Weight Origin Make
Cat… String Cat… Float64 Float64 Float64 Int32 Int32 Cat… Cat… Cat… Float64 Int32 Int32 Int32 Cat… Float64 Int32 Int32 Int32 Int32 Int32 Float64? Int32? Int32 Cat… String
1 Volkswagen Fox Small 8.7 9.1 9.5 25 33 None Front 4 1.8 81 5500 2550 Yes 12.4 4 163 93 63 34 26.0 10 2240 non-USA Volkswagen Fox
2 Volkswagen Passat Compact 17.6 20.0 22.4 21 30 None Front 4 2.0 134 5800 2685 Yes 18.5 5 180 103 67 35 31.5 14 2985 non-USA Volkswagen Passat

The right-hand side of |> expects a function for application, so an anonymous function is created. There are macros available in user-contributed packages that shorten this pattern by using a placeholder.

A popular one is, @chain, in the Chain package, with its basic usage based on two simple principles:

  • a new line is implicitly a pipe
  • a _ receives the argument from the pipe. If none is given, the first argument does

For example,

using Chain
@chain cars begin
    filter(:Manufacturer => ==("Volkswagen"), _)
    filter(:MPGCity => >=(20), _)
end
2×27 DataFrame
Row Manufacturer Model Type MinPrice Price MaxPrice MPGCity MPGHighway AirBags DriveTrain Cylinders EngineSize Horsepower RPM RevPerMile ManTransAvail FuelTankCapacity Passengers Length Wheelbase Width TurnCircle RearSeatRoom LuggageRoom Weight Origin Make
Cat… String Cat… Float64 Float64 Float64 Int32 Int32 Cat… Cat… Cat… Float64 Int32 Int32 Int32 Cat… Float64 Int32 Int32 Int32 Int32 Int32 Float64? Int32? Int32 Cat… String
1 Volkswagen Fox Small 8.7 9.1 9.5 25 33 None Front 4 1.8 81 5500 2550 Yes 12.4 4 163 93 63 34 26.0 10 2240 non-USA Volkswagen Fox
2 Volkswagen Passat Compact 17.6 20.0 22.4 21 30 None Front 4 2.0 134 5800 2685 Yes 18.5 5 180 103 67 35 31.5 14 2985 non-USA Volkswagen Passat

Subset

The calling style filter(predicate, collection) is consistently used with several higher-order functions, for example, map and reduce and is supported by Julia’s do syntax.

However, with data frames, the convention is to have action functions expect the data frame in the first position.

The subset function also returns a filtered copy of the data frame with rows matching a certain selection criteria. It’s usage is somewhat similar to filter with the order of its arguments reversed, but the predicate must return a vector of indices, so broadcasting or the ByRow function may be necessary.

For example, these produce the same data frame:

d1 = subset(cars, :Manufacturer => m -> m .== "Volkswagen") # use .== not just ==
d2 = subset(cars, :Manufacturer => ByRow(==("Volkswagen"))) # use `ByRow` to ensure predicate is applied to each
4×27 DataFrame
Row Manufacturer Model Type MinPrice Price MaxPrice MPGCity MPGHighway AirBags DriveTrain Cylinders EngineSize Horsepower RPM RevPerMile ManTransAvail FuelTankCapacity Passengers Length Wheelbase Width TurnCircle RearSeatRoom LuggageRoom Weight Origin Make
Cat… String Cat… Float64 Float64 Float64 Int32 Int32 Cat… Cat… Cat… Float64 Int32 Int32 Int32 Cat… Float64 Int32 Int32 Int32 Int32 Int32 Float64? Int32? Int32 Cat… String
1 Volkswagen Fox Small 8.7 9.1 9.5 25 33 None Front 4 1.8 81 5500 2550 Yes 12.4 4 163 93 63 34 26.0 10 2240 non-USA Volkswagen Fox
2 Volkswagen Eurovan Van 16.6 19.7 22.7 17 21 None Front 5 2.5 109 4500 2915 Yes 21.1 7 187 115 72 38 34.0 missing 3960 non-USA Volkswagen Eurovan
3 Volkswagen Passat Compact 17.6 20.0 22.4 21 30 None Front 4 2.0 134 5800 2685 Yes 18.5 5 180 103 67 35 31.5 14 2985 non-USA Volkswagen Passat
4 Volkswagen Corrado Sporty 22.9 23.3 23.7 18 25 None Front 6 2.8 178 5800 2385 Yes 18.5 4 159 97 66 36 26.0 15 2810 non-USA Volkswagen Corrado

Unlike filter, subset allows multiple arguments to be applied:

subset(cars,
       :Manufacturer => ByRow(==("Volkswagen")),
       :MPGCity      => ByRow(>=(20)))
2×27 DataFrame
Row Manufacturer Model Type MinPrice Price MaxPrice MPGCity MPGHighway AirBags DriveTrain Cylinders EngineSize Horsepower RPM RevPerMile ManTransAvail FuelTankCapacity Passengers Length Wheelbase Width TurnCircle RearSeatRoom LuggageRoom Weight Origin Make
Cat… String Cat… Float64 Float64 Float64 Int32 Int32 Cat… Cat… Cat… Float64 Int32 Int32 Int32 Cat… Float64 Int32 Int32 Int32 Int32 Int32 Float64? Int32? Int32 Cat… String
1 Volkswagen Fox Small 8.7 9.1 9.5 25 33 None Front 4 1.8 81 5500 2550 Yes 12.4 4 163 93 63 34 26.0 10 2240 non-USA Volkswagen Fox
2 Volkswagen Passat Compact 17.6 20.0 22.4 21 30 None Front 4 2.0 134 5800 2685 Yes 18.5 5 180 103 67 35 31.5 14 2985 non-USA Volkswagen Passat

The mini language uses a default of identity for the function so, if the columns are Boolean, they can be used to subset the data:

dd = DataFrame(a = [true, false], b = [true, missing])
subset(dd, :a)  # just first row
1×2 DataFrame
Row a b
Bool Bool?
1 true true

For data with missing values, like in the :b column of dd, the comparison operators implement 3-valued logic, meaning the response can be true, false, or missing. Using subset(dd, :b) above will error. The keyword skipmissing can be passed true to have these skipped over. The default is false.

subset(dd, :b; skipmissing=true)
1×2 DataFrame
Row a b
Bool Bool?
1 true true

The dropmissing function filters out each row that has a missing value.

3.1.6 Selecting columns; transforming data

Filtering of a data frame can be viewed as accessing the data in df with a boolean set of row indices, as in df[inds, :]. (There are flags to have a “view” of the data, as in df[inds, !]). Filtering returns a data frame with a possibly reduced number of rows, and the same number of columns.

The select function is related in that it can be viewed as passing a Boolean vector of column indices to select specific columns or variables, as in df[:, inds]. The syntax of select also extends this allowing transformations of the data and reassignment, as could be achieved through this pattern: df.new_variable = f(other_variables_in_df).

The select function will return a data frame with the same number of rows (cases) as the original data frame. The variables returned are selected using column selectors. There is support for the DataFrames mini language.

First, we can select one or more columns by separating column selectors with commas. In the following, using the backpack data in the variable d, we select the first column by column number, the second using a symbol, the third by a string, and the fourth and fifth using a regular expression:2

select(d, 1, :v, "p", r"^l")
3×5 DataFrame
Row b v p lap load
String Int64 Float64? Cat… Cat…
1 Goruck GR2 40 395.99 Y front panel
2 Minaal 3.0 35 349.99 Y front panel
3 Genius 25 228.99 Y clamshell

One use of select is to rearrange the column order, say by moving a column to the front. The : or All() column selector will add the rest. That is this command would move “d” to the front of the other columns: select(d, :d, All()).

Selection also allows an easy renaming of column names, using the pairs notation column name => new column name:

select(d, :v => :volume, "p" => "price")
3×2 DataFrame
Row volume price
Int64 Float64?
1 40 395.99
2 35 349.99
3 25 228.99

This usage is a supported abbreviation of source_column => function => target_column_name, where the function is identity. Here the function is not necessarily a predicate function, as it is with subset, but rather a function which returns a vector of the same length as the number of columns in the data frame (as select always returns the same number of columns as the data frame it is passed.)

The function receives column vectors and should return a column vector. For a scalar function, it must be applied to each entry of the column vector, that is, each row. Broadcasting is one manner to do this. As used previously, the DataFrames package provides ByRow to also apply a function to each row of the column vector(s).

For example to compute a ratio of price to volume, we might have:

select(d, [:p, :v] => ByRow((p,v) -> p/v) => "price to volume")
3×1 DataFrame
Row price to volume
Float64
1 9.89975
2 9.99971
3 9.1596

The above would be equivalent to select(d, [:p, :v] => ((p,v) -> p./v) => "price to volume") (broadcasted p ./ v and parentheses), but not select(d, [:p, :v] => (p,v) -> p./v => "price to volume"), as the precedence rules for => do not parse the expressions identically. That is, the use of parentheses around an anonymous function is needed.

The specification of a target column name is not necessary, as one will be generated. In this particular example, it would be :p_v_function. For named functions, the automatic namings are a bit more informative, as the method name replaces the “function” part. The renamecols keyword argument can be set to false to drop the addition of a function name, which in this case would leave :p_v as the name. For a single-column selector it would not rename the column, rather replace it.

The above reduced the number of columns to just that specified. The : selector will select all columns (as it does with indexing) and return them:

select(d, [:p, :v] => ByRow((p,v) -> p/v) => "price to volume", :) # return all columns
3×7 DataFrame
Row price to volume b v p lap load d
Float64 String Int64 Float64? Cat… Cat… Date
1 9.89975 Goruck GR2 40 395.99 Y front panel 2022-09-01
2 9.99971 Minaal 3.0 35 349.99 Y front panel 2022-09-01
3 9.1596 Genius 25 228.99 Y clamshell 2022-10-01

As seen in this last example, the source column selectors can select more than one column. The function receives multiple arguments. It is possible that there be multiple target columns. For example, we could compute both price-to-volume and volume-to-price quantities. A function to do so would be

pv_and_vp(p,v) = [p/v, v/p]
pv_and_vp (generic function with 1 method)

When we use this, we see the two values are stored in one column.

select(d, [:p, :v] => ByRow(pv_and_vp))
3×1 DataFrame
Row p_v_pv_and_vp
Array…
1 [9.89975, 0.101013]
2 [9.99971, 0.100003]
3 [9.1596, 0.109175]

To split these up, we can give two names:

select(d, [:p, :v] => ByRow(pv_and_vp) => [:pv, :vp])
3×2 DataFrame
Row pv vp
Float64 Float64
1 9.89975 0.101013
2 9.99971 0.100003
3 9.1596 0.109175

AsTable

The AsTable function has two meanings in the mini language. If on the target side, it assumes the output of the function is a vector of containers (in this example the vectors [p/v, v/p]) that should be expanded into multiple columns. It uses the keys (were the containers named tuples) or, in this case generates them, to create multiple columns in the destination:

select(d, [:p, :v] => ByRow(pv_and_vp) => AsTable)
3×2 DataFrame
Row x1 x2
Float64 Float64
1 9.89975 0.101013
2 9.99971 0.100003
3 9.1596 0.109175

Had we used a named tuple in our function, pv_and_vp(p, v) = (pv = p/v, vp = v/p), then the column names would come from the tuple’s names and need not be generated.

When AsTable is used on the source columns, as in AsTable([:p,:v]) then the function receives a named tuple with column vector entries. (For this, pv_and_vp would be written pv_and_vp(r) = [r.p/r.v, r.v/r.p], with r representing a row in a two-column table with names :p and :v.

Transform

Extending the columns in the data frame by select is common enough that the function transform is supplied which always keeps the columns of the original data frame, though they can also be modified through the mini language. The use of transfrom is equivalent to select(df, :, args...).

The DataFrames’ mini language

The mini language is well documented in the documentation string of transform and the blog post of Bogumil Kasminski “DataFrames.jl minilanguage explained.”

3.1.7 Combine

The combine function creates a new data frame whose columns are the result of transformations applied to the source data frame. The name will be more clear after we discuss grouping or splitting of a data set.

A typical usage of combine is to apply some reduction to the data, for example finding an average.

In this example, we use the mean function from the StatsBase package and apply that to the :MPGCity measurements in the cars data:

combine(cars, :MPGCity => mean)
1×1 DataFrame
Row MPGCity_mean
Float64
1 22.3656

(A second pair can be used to specify a name for the target, as in :MPGCity => mean => :AvgMPGCity)

There are several numeric variables in this data set, we may wish to find the mean of more than 1 at a time. For this, we can broadcast the columns via .=>, as in this example:

combine(cars, [:MPGCity, :MPGHighway] .=> mean)
1×2 DataFrame
Row MPGCity_mean MPGHighway_mean
Float64 Float64
1 22.3656 29.086

(That is an alternate to combine(cars, :MPGCity => mean, :MPGHighway => mean).)

Broadcasting => lends itself to succinctly applying a function to multiple columns.

For example, to apply mean to all the numeric values, we might select them first (using the filtering feature of names), then pass to the mini language, as follows:

nms = names(cars, Real)
combine(cars, nms .=> mean; renamecols=false)
1×16 DataFrame
Row MinPrice Price MaxPrice MPGCity MPGHighway EngineSize Horsepower RPM RevPerMile FuelTankCapacity Passengers Length Wheelbase Width TurnCircle Weight
Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64
1 17.1258 19.5097 21.8989 22.3656 29.086 2.66774 143.828 5280.65 2332.2 16.6645 5.08602 183.204 103.946 69.3763 38.957 3072.9

More than one transformation at a time is possible, as already seen in the mini language. With combine the number of rows is determined by the output of the transformation. In this example, the mean reduces to a number, but unique reduces :Origin to the number of levels. The result of mean is repeated, it is not the mean for each group (a task we demonstrate shortly):

combine(cars, :MPGCity => mean, :Origin => unique)
2×2 DataFrame
Row MPGCity_mean Origin_unique
Float64 String
1 22.3656 non-USA
2 22.3656 USA

3.1.8 Flatten

As mentioned, the repeat function is used to repeat values in a vector, or other iterable. The simplest usage looks like this:

repeat([1,2,3], 2) |> show
[1, 2, 3, 1, 2, 3]

A common use of repetition is to take two variables, say x and y, perhaps of unequal length and create a data frame with two columns: one to store the values and one to indicate if a value came from x or y. This could be achieved like

x = [1,2,3]
y = [4, 5]

DataFrame(variable=vcat(repeat([:x], length(x)), repeat([:y], length(y))),
          values = vcat(x,y))
5×2 DataFrame
Row variable values
Symbol Int64
1 x 1
2 x 2
3 x 3
4 y 4
5 y 5

The flatten function offers an alternative. Consider the data frame:

x = [1,2,3]
y = [4, 5]
dxy = DataFrame(variable=[:x,:y], value=[x, y])
2×2 DataFrame
Row variable value
Symbol Array…
1 x [1, 2, 3]
2 y [4, 5]

The flatten function will iterate over the indicated columns and repeat the other variables:

flatten(dxy, :value)
5×2 DataFrame
Row variable value
Symbol Int64
1 x 1
2 x 2
3 x 3
4 y 4
5 y 5

3.1.9 SplitApplyCombine

The split-apply-combine strategy for wrangling data frames was popularized in the influential paper “The Split-Apply-Combine Strategy for Data Analysis: (Wickham 2011) by H. Wickham which introduces this description:”where you break up a big problem into manageable pieces, operate on each piece independently and then put all the pieces back together.”

Grouping data

The groupby function for data frames does the breaking up, the DataFrames mini language can operate on each piece, and the combine function can put things together again.

Grouping takes one or more categorical variables and creates a “grouped” data frame for each different set of combined levels. That is one grouped data frame for each possible combination of the levels. The groupby function splits the data frame into pieces of type GroupedDataFrame. The basic signature is groupby(df, cols; sort=nothing, skipmissing=false). The cols are one or more column selectors. The value of sort defaults to nothing leaving the order of the groups in the result undefined, but the grouping uses the fastest identified algorithm. The skipmissing argument can instruct the skipping of rows which have missing values in the selected columns.

For example, grouping the baggage data by loading style, returns two sub data frames.

gdf = groupby(d, :load)

GroupedDataFrame with 2 groups based on key: load

First Group (2 rows): load = CategoricalValue{String, UInt32} "front panel"
Row b v p lap load d
String Int64 Float64? Cat… Cat… Date
1 Goruck GR2 40 395.99 Y front panel 2022-09-01
2 Minaal 3.0 35 349.99 Y front panel 2022-09-01

Last Group (1 row): load = CategoricalValue{String, UInt32} "clamshell"
Row b v p lap load d
String Int64 Float64? Cat… Cat… Date
1 Genius 25 228.99 Y clamshell 2022-10-01

Grouping by more than one variable is achieved by passing in a selector with multiple variables. The following also creates two sub data frames, but only because the unique combination of both load and date are the same as just load:

gdf = groupby(d, [:load, :d])

GroupedDataFrame with 2 groups based on keys: load, d

First Group (2 rows): load = CategoricalValue{String, UInt32} "front panel", d = 2022-09-01
Row b v p lap load d
String Int64 Float64? Cat… Cat… Date
1 Goruck GR2 40 395.99 Y front panel 2022-09-01
2 Minaal 3.0 35 349.99 Y front panel 2022-09-01

Last Group (1 row): load = CategoricalValue{String, UInt32} "clamshell", d = 2022-10-01
Row b v p lap load d
String Int64 Float64? Cat… Cat… Date
1 Genius 25 228.99 Y clamshell 2022-10-01

Indices may be used to get each sub data frame. Moreover, the keys method returns keys, like indices, to efficiently look up the different sub data frames in gdf. The levels for the keys are found in the property .load, as with keys(gdf)[1].load. Grouped data frames may be iterated over; the pairs iterator iterates over both keys and values.

The select and subset functions work over GroupedDataFrame objects, here we see the returned values are combined:

subset(gdf, :p => ByRow(<=(350)))
2×6 DataFrame
Row b v p lap load d
String Int64 Float64? Cat… Cat… Date
1 Minaal 3.0 35 349.99 Y front panel 2022-09-01
2 Genius 25 228.99 Y clamshell 2022-10-01

(The command [subset(g, :p => ByRow(<=(350))) for g in gdf] avoids the combine step.)

Group and combine

We can see that combine also works over GroupedDataFrame objects. This example shows how to find the mean city mileage of groups formed by the value of :Origin in the cars data set:

combine(groupby(cars, :Origin), :MPGCity => mean)
2×2 DataFrame
Row Origin MPGCity_mean
Cat… Float64
1 USA 20.9583
2 non-USA 23.8667

Typically, the function applied to a column is a reduction, which results in aggregated data.

By default the keys used to group (:Origin) are kept by combine. The keepkeys argument for combine when applied to grouped data controls this feature.

More than one column can be used to group the data. This example first groups by the number of cylinders, then groups by origin. For each of the 9 resulting groups, the group mean for city mileage is taken:

combine(groupby(cars, [:Cylinders, :Origin]), :MPGCity => mean)
9×3 DataFrame
Row Cylinders Origin MPGCity_mean
Cat… Cat… Float64
1 3 non-USA 39.3333
2 4 USA 24.4091
3 4 non-USA 25.2222
4 5 non-USA 18.5
5 6 USA 18.35
6 6 non-USA 18.5455
7 8 USA 17.0
8 8 non-USA 17.0
9 rotary non-USA 17.0

The @chain macro may make this more readable:

@chain cars begin
    groupby([:Cylinders, :Origin])
    combine(:MPGCity => mean)
    describe
end
3×7 DataFrame
Row variable mean min median max nmissing eltype
Symbol Union… Any Union… Any Int64 DataType
1 Cylinders 3 rotary 0 CategoricalValue{String, UInt8}
2 Origin USA non-USA 0 CategoricalValue{String, UInt8}
3 MPGCity_mean 21.7067 17.0 18.5 39.3333 0 Float64

Example 3.1 (Lego sets) For an example, we download a data set from the internet describing all the sets of Legos sold during some time period. This data set is part of the data sets provided by openintro.org, an organization trying to make interesting educational products that are free and transparent.

csv_data = download("https://www.openintro.org/data/csv/lego_population.csv")
legos = CSV.read(csv_data, DataFrame)
first(legos, 2)
2×14 DataFrame
Row item_number set_name theme pieces price amazon_price year ages pages minifigures packaging weight unique_pieces size
Int64 String String31 String7 String31 String31 Int64 String15 String7 String3 String31 String31 String7 String7
1 41916 Extra Dots - Series 2 DOTS 109 3.99 3.44 2020 Ages_6+ NA NA Foil pack NA 6 Small
2 41908 Extra Dots - Series 1 DOTS 109 3.99 3.99 2020 Ages_6+ NA NA Foil pack NA 6 Small

Let’s explore this data.

The size of the data set is printed, though suppressed above as only the first 2 rows are requested to be shown, or can be programmatically identified with size(legos).

size(legos)
(1304, 14)

Each row is a product; there are a lot of different ones.

The types identified by CSV.read are not perfect. The data uses NA for not-available. We read again using the argument missingstring="NA":

legos = CSV.read(csv_data, DataFrame; missingstring="NA")
first(legos, 2)  # show first 2 rows
2×14 DataFrame
Row item_number set_name theme pieces price amazon_price year ages pages minifigures packaging weight unique_pieces size
Int64 String String31? Int64? Float64? Float64? Int64 String15 Int64? Int64? String31? String31? Int64? String7?
1 41916 Extra Dots - Series 2 DOTS 109 3.99 3.44 2020 Ages_6+ missing missing Foil pack missing 6 Small
2 41908 Extra Dots - Series 1 DOTS 109 3.99 3.99 2020 Ages_6+ missing missing Foil pack missing 6 Small

The :theme and :packaging variables are categorical, so we make them so:

legos.theme = categorical(legos.theme)
legos.packaging = categorical(legos.packaging);
first(legos, 2)
2×14 DataFrame
Row item_number set_name theme pieces price amazon_price year ages pages minifigures packaging weight unique_pieces size
Int64 String Cat…? Int64? Float64? Float64? Int64 String15 Int64? Int64? Cat…? String31? Int64? String7?
1 41916 Extra Dots - Series 2 DOTS 109 3.99 3.44 2020 Ages_6+ missing missing Foil pack missing 6 Small
2 41908 Extra Dots - Series 1 DOTS 109 3.99 3.99 2020 Ages_6+ missing missing Foil pack missing 6 Small

We see data on the number of pieces and the age range. Is there some relationship?

The :age variable should be an ordered factor, ordered by the youngest age intended for use. The :ages variable has a pattern Ages_startXXX where XXX may be + to indicate or up, or a dash to indicate a range. We use this to identify the youngest intended age.

# alternative to
# `(m = match(r"Ages_(\d+)", x); m === nothing ? missing : parse(Int, m.captures[1]))`
function pick_first_age(x)
    ismissing(x) && return missing
    nos = collect(string(0:9...))
    out = ""
    for xi in x[6:end] # drop Ages_
        !(xi in nos) && break
        out = out * xi
    end
    out == "" && return missing
    parse(Int, out)
end


transform!(legos, :ages => ByRow(pick_first_age) => :youngest_age)
legos.youngest_age = categorical(legos.youngest_age, ordered=true)
first(legos[:,r"age"], 2)
2×3 DataFrame
Row ages pages youngest_age
String15 Int64? Cat…?
1 Ages_6+ missing 6
2 Ages_6+ missing 6

With that ordering, an expected pattern becomes clear – kits for older users have on average more pieces – though there are unexpected exceptions:

@chain legos begin
    groupby(:youngest_age)
    combine([:pieces, :unique_pieces] .=> meanskipmissing
            .=> [:avg_pieces, :avg_unique_pieces])
end
16×3 DataFrame
Row youngest_age avg_pieces avg_unique_pieces
Cat…? Float64 Float64
1 missing 127.592 40.5604
2 1 37.2083 26.2174
3 2 55.617 31.2979
4 3 140.0 41.0
5 4 194.74 87.2857
6 5 147.712 85.1557
7 6 191.079 87.1289
8 7 265.556 108.816
9 8 491.193 173.801
10 9 859.182 252.773
11 10 417.52 109.768
12 11 2357.86 214.143
13 12 1307.29 256.4
14 14 1617.25 349.5
15 16 2454.79 409.107
16 18 2073.84 214.294

(We composed the two functions skipmissing with mean using the \circ[tab] operator and use .=> in both places to avoid defining the transformation function twice.)


The Lego Group

Lego, was the largest toy company in the world by 2021. By 2015 it had produced over 600 billion parts.

Lego is a mature company with an origin dating to 1934. The number of products per year should be fairly stable, though it is easy enough to check. The data set has a :year variable, so we would only need to group the data by that, then count the number of cases per each group. The nrow function will count the cases. This function is special cased in the DataFrames’ mini language and can appear by itself:

combine(groupby(legos, :year), nrow) # use `nrow => :n`, or some such, to label differently
3×2 DataFrame
Row year nrow
Int64 Int64
1 2018 442
2 2019 423
3 2020 439

We can take other summaries by year, for example, here we find the average price by year and size:

@chain legos begin
    groupby([:year, :size])
    combine(nrow => :n,
            :price => meanskipmissing => :avg_price
            )
end
9×4 DataFrame
Row year size n avg_price
Int64 String7? Int64 Float64
1 2018 Small 324 45.7792
2 2018 Large 21 28.59
3 2018 missing 97 25.8135
4 2019 Small 322 50.4663
5 2019 Large 14 29.6329
6 2019 missing 87 13.8536
7 2020 Small 335 49.8104
8 2020 Large 18 33.5456
9 2020 missing 86 21.6959

The youngest age range is a bit long. We wish to collapse it to the ranges 1-6, 7-12, and 12-18. The data has been stored as a categorical array. The cut function can group numeric data easily, but to group categorical data, the cut-ranges must be promoted to a CategoricalValue. We have the following, where Ref is used to stop the broadcasting for that value:

vals = CategoricalValue.([1, 6, 12, 18], Ref(legos.youngest_age))
transform!(legos,
           :youngest_age => (x ->cut(x, vals; extend=true))
           => :youngest_age_range);

We now check if the number of pieces has dramatically changed over the year. We break up the data by the age range, as we expect variation in that value so it is best to dis-aggregate over that factor:

@chain legos begin
    groupby([:youngest_age_range, :year])
    combine(nrow => :n,
            :pieces => meanskipmissing => :avg_pieces)
end
12×4 DataFrame
Row youngest_age_range year n avg_pieces
Cat…? Int64 Int64 Float64
1 missing 2018 33 165.379
2 missing 2019 34 95.7778
3 missing 2020 44 111.8
4 [1, 6) 2018 98 131.561
5 [1, 6) 2019 92 122.533
6 [1, 6) 2020 92 153.413
7 [6, 12) 2018 292 341.866
8 [6, 12) 2019 275 389.945
9 [6, 12) 2020 276 365.887
10 [12, 18] 2018 19 1956.21
11 [12, 18] 2019 22 1938.95
12 [12, 18] 2020 27 2111.3

No real pattern shows up.

There are many different themes. The command unique(legos.theme) shows 41. Which are the most popular? To see, the data is grouped by the theme, each group is counted, missing themes are dropped, then the data frame is sorted and the top 5 rows are shown:

@chain legos begin
    groupby(:theme)
    combine(nrow => :n)
    dropmissing
    sort(:n; rev=true)
    first(5)
end
5×2 DataFrame
Row theme n
Cat… Int64
1 Star Wars™ 119
2 Friends 103
3 City 101
4 NINJAGO® 78
5 DUPLO® 53

The SplitApplyCombine package

There is support for the split-apply-combine paradigm outside of the DataFrames package in the package SplitApplyCombine. This package is much lighter weight than DataFrames. This support targets other representations of tabular data, such as a vector of nested tuples:

using SplitApplyCombine
tbl = [
(b = "Goruck GR2", v = 40, p = 395, lap = "Y", load = "front panel", d = Date("2022-09-01")),
(b = "Minaal 3.0", v = 35, p = 349, lap = "Y", load = "front panel", d = Date("2022-09-01")),
(b = "Genius    ",     v = 25, p = 228, lap = "Y", load = "clamshell",   d = Date("2022-10-01"))
]
3-element Vector{@NamedTuple{b::String, v::Int64, p::Int64, lap::String, load::String, d::Date}}:
 (b = "Goruck GR2", v = 40, p = 395, lap = "Y", load = "front panel", d = Date("2022-09-01"))
 (b = "Minaal 3.0", v = 35, p = 349, lap = "Y", load = "front panel", d = Date("2022-09-01"))
 (b = "Genius    ", v = 25, p = 228, lap = "Y", load = "clamshell", d = Date("2022-10-01"))

Many of the main verbs are generic functions: filter, map, reduce; group does grouping. There are some others. A few examples follow.

For indexing this structure, we can’t index by [row, col], rather we can use [row][col] notation, the first to extract the row, the second to access the entries in the column. For example, he we index a selected row by position, by name, and then with multiple names:

tbl[2][2], tbl[2].v, tbl[2][ [:v, :p] ]
(35, 35, (v = 35, p = 349))

The invert function reverses the access pattern, allowing in this case, [col][row] access, with:

itbl = invert(tbl)
itbl[3][1], itbl[:p][2]
(395, 349)

In some ways, the invert function flips the viewpoint from an array of structs to a struct of arrays.

To filter out rows, we have the same calling style as with data frames: filter(pred, tbl). With the outer container of tbl being an array, no special method is utilized, iteration is over each row. For example,

filter(r -> r.v >= 35, tbl) |> DataFrame
2×6 DataFrame
Row b v p lap load d
String Int64 Int64 String String Date
1 Goruck GR2 40 395 Y front panel 2022-09-01
2 Minaal 3.0 35 349 Y front panel 2022-09-01

(The call to DataFrame is to take advantage of the prettier printing. DataFrame can consume a vector of named tuples for its input.)

The DataPipes package composes a bit more nicely with this approach, but we continue using Chain for examples3, like this one of nested calls to filter:

@chain tbl begin
    filter(r -> r.v >= 35,  _)
    filter(r -> r.p >= 350, _)
    DataFrame
end
1×6 DataFrame
Row b v p lap load d
String Int64 Int64 String String Date
1 Goruck GR2 40 395 Y front panel 2022-09-01

Selecting columns and adding a transformation can be achieved through map, which iterates over each named tuple in this example:

map(r -> (b=r.b, v=r.v, p=r.p, pv = round(r.p/r.v; digits=2)), tbl) |>
    DataFrame
3×4 DataFrame
Row b v p pv
String Int64 Int64 Float64
1 Goruck GR2 40 395 9.88
2 Minaal 3.0 35 349 9.97
3 Genius 25 228 9.12

The group function can group by values of a function call. This example from the package’s Readme is instructive:

nms = ["Andrew Smith", "John Smith", "Alice Baker",
       "Robert Baker", "Jane Smith", "Jason Bourne"]
group(last, split.(nms))
3-element Dictionaries.Dictionary{SubString{String}, Vector{Vector{SubString{String}}}}
  "Smith" │ Vector{SubString{String}}[["Andrew", "Smith"], ["John", "Smith"], […
  "Baker" │ Vector{SubString{String}}[["Alice", "Baker"], ["Robert", "Baker"]]
 "Bourne" │ Vector{SubString{String}}[["Jason", "Bourne"]]

The names are split on spaces, and the keys of the group are determined by the last function, which here gets the last piece after splitting (aka the “last” name, but by coincidence, not intelligence).

For numeric operations, this is quite convenient. Here we group by the remainder upon division by 3:

group(x -> rem(x,3), 1:10)
3-element Dictionaries.Dictionary{Int64, Vector{Int64}}
 1 │ [1, 4, 7, 10]
 2 │ [2, 5, 8]
 0 │ [3, 6, 9]

The grouping basically creates a dictionary and adds to the key determined by the function.

To apply a function to the values of a base Julia dictionary is not directly supported by map, however the dictionary used by group (from the Dictionaries packages) does allow this, so we could, for example, add up the values in each group with:

map(sum, group(x -> rem(x,3), 1:10))
3-element Dictionaries.Dictionary{Int64, Int64}
 1 │ 22
 2 │ 15
 0 │ 18

The group function can do this in one call, by placing the function to apply in between the by function and the table. Here we apply first to pick off the first name. That is, ["Andrew", "Smith"] becomes just "Andrew", the first element of that vector.

gps = group(last, first, split.(nms))
3-element Dictionaries.Dictionary{SubString{String}, Vector{SubString{String}}}
  "Smith" │ SubString{String}["Andrew", "John", "Jane"]
  "Baker" │ SubString{String}["Alice", "Robert"]
 "Bourne" │ SubString{String}["Jason"]

Example 3.2 (A tally function) The Tally package provides a quick way to tally up numbers. Here we do a simplified version.

To tally, we have three steps: group by each by value, count the number in each groups, sort the values. This is implemented with this composition of functions:

tally(v; by=identity) = SplitApplyCombine.sortkeys(map(length, group(by, v)))
tally (generic function with 1 method)

To see, we have:

v = rand(1:5, 50)
d = tally(v)
5-element Dictionaries.Dictionary{Int64, Int64}
 1 │ 9
 2 │ 9
 3 │ 11
 4 │ 9
 5 │ 12

Using map, we can provide an alternate, oft used view for tallying:

tallies = "\u007C"^4
ftally = "┼┼┼┼ "
function prison_count(x)
    d, r = divrem(x, 5)
    ftally^d * tallies[1:r]
end


map(prison_count, d)
5-element Dictionaries.Dictionary{Int64, Any}
 1 │ "┼┼┼┼ ||||"
 2 │ "┼┼┼┼ ||||"
 3 │ "┼┼┼┼ ┼┼┼┼ |"
 4 │ "┼┼┼┼ ||||"
 5 │ "┼┼┼┼ ┼┼┼┼ ||"

Example 3.3 (A simple stem and leaf diagram) The grouping and sorting made quite accessible by SplitApplyCombine could also be used to produce a simple stem-and-leaf type diagram. Here we don’t bother displaying stems with no associated leaves, which does change the interpretation of spread in the data, and could easily be added in, though it takes us a bit afield, as it requires some manipulation of the display:

x = rand(0:100, 32)  # 32 test scores
stem_unit = 10       # stem * 10 is value of stem
@chain x begin
    round.(Int, 10*_/stem_unit)
    SplitApplyCombine.group(x -> x ÷ 10, x->rem(x,10), _)
    SplitApplyCombine.sortkeys(_)
    map(sort, _)
    map(x -> parse(Int,join(x)), _)
end
10-element Dictionaries.Dictionary{Int64, Int64}
 0 │ 1139
 1 │ 78
 2 │ 37
 3 │ 478
 4 │ 1489
 5 │ 468
 6 │ 233
 7 │ 99
 8 │ 55
 9 │ 2478

Let’s return to the cars data and use group to identify the average mileage per type. For this task, we group by :Typeand then apply a function to extract the mileage from a row. The copy.(eachrow(cars)) command creates a vector of named tuples.

cs = copy.(eachrow(cars))  # Can also call `NamedTuple` for `copy`
gps = group(r -> r.Type, r -> r.MPGCity, cs)
6-element Dictionaries.Dictionary{CategoricalValue{String, UInt8}, Vector{Int32}}
 CategoricalValue{String, UInt8} "Smal… │ Int32[25, 29, 23, 29, 31, 23, 46, 42,…
 CategoricalValue{String, UInt8} "Mids… │ Int32[18, 19, 22, 22, 19, 16, 21, 21,…
 CategoricalValue{String, UInt8} "Comp… │ Int32[20, 25, 25, 23, 22, 22, 24, 26,…
 CategoricalValue{String, UInt8} "Larg… │ Int32[19, 16, 16, 17, 20, 20, 20, 18,…
 CategoricalValue{String, UInt8} "Spor… │ Int32[19, 17, 18, 22, 24, 30, 24, 26,…
  CategoricalValue{String, UInt8} "Van" │ Int32[18, 15, 17, 15, 18, 17, 18, 18,…

We now can map mean over each group:

ms = map(mean, gps)
6-element Dictionaries.Dictionary{CategoricalValue{String, UInt8}, Float64}
   CategoricalValue{String, UInt8} "Small" │ 29.857142857142858
 CategoricalValue{String, UInt8} "Midsize" │ 19.545454545454547
 CategoricalValue{String, UInt8} "Compact" │ 22.6875
   CategoricalValue{String, UInt8} "Large" │ 18.363636363636363
  CategoricalValue{String, UInt8} "Sporty" │ 21.785714285714285
     CategoricalValue{String, UInt8} "Van" │ 17.0

This requires two passes through the grouped data, the first to get just the mileage values, the next to call mean. The following DataFrames syntax hides this:

combine(groupby(cars, :Type), :MPGHighway => mean)
6×2 DataFrame
Row Type MPGHighway_mean
Cat… Float64
1 Compact 29.875
2 Large 26.7273
3 Midsize 26.7273
4 Small 35.4762
5 Sporty 28.7857
6 Van 21.8889

Reductions, like sum and count which can be written easily using the higher-order function reduce, have support to combine the grouping and reduction. The groupreduce function takes the additional arguments of reduce (an operation and an optional init value). For example, to get the group sum, we could call:

groupreduce(r -> r.Type, r -> r.MPGCity, +, cs)
6-element Dictionaries.Dictionary{CategoricalValue{String, UInt8}, Int32}
   CategoricalValue{String, UInt8} "Small" │ 627
 CategoricalValue{String, UInt8} "Midsize" │ 430
 CategoricalValue{String, UInt8} "Compact" │ 363
   CategoricalValue{String, UInt8} "Large" │ 202
  CategoricalValue{String, UInt8} "Sporty" │ 305
     CategoricalValue{String, UInt8} "Van" │ 153

The group sum and group count have shortcuts, so the mean could have been computed as:

groupsum(r -> r.Type, r -> r.MPGCity, cs) ./ groupcount(r -> r.Type, cs)
6-element Dictionaries.Dictionary{CategoricalValue{String, UInt8}, Float64}
   CategoricalValue{String, UInt8} "Small" │ 29.857142857142858
 CategoricalValue{String, UInt8} "Midsize" │ 19.545454545454547
 CategoricalValue{String, UInt8} "Compact" │ 22.6875
   CategoricalValue{String, UInt8} "Large" │ 18.363636363636363
  CategoricalValue{String, UInt8} "Sporty" │ 21.785714285714285
     CategoricalValue{String, UInt8} "Van" │ 17.0

(Unlike for a Dict instance, the dictionaries above allow such division.)


The “combine” part of the paradigm takes the pieces and reassembles. The data frames example returns a data frame, whereas with SplitApplyCombine.jl we have a dictionary. A dictionary does not map directly to a data frame. While both dictionaries and named tuples are associative arrays, we wouldn’t necessarily want to map our data to a row. Rather, here we use the pairs iterator to iterate over the keys and values to produce an array of named tuples:

[ (; Type, MPG) for (Type,MPG) in pairs(ms)] |> DataFrame
6×2 DataFrame
Row Type MPG
Cat… Float64
1 Small 29.8571
2 Midsize 19.5455
3 Compact 22.6875
4 Large 18.3636
5 Sporty 21.7857
6 Van 17.0

3.1.10 Aggregating data

Somewhat inverse to flattening data is to aggregate data, by which we mean combining multiple columns into a single one, or combining multiple rows into a single one. The select and combine methods can do these tasks.

Consider the three price variables in the cars data. We can find an average of the three price numbers easily enough. This function allows the specification as different arguments, not in a container:

_mean(xs...) = mean(xs)
_mean (generic function with 1 method)

We can then use select to aggregate the three price variables. In the following, these are selected with a regular expression:

cars_price = select(cars, 1:3, r"Price" => ByRow(_mean) => :price)
first(cars_price, 3)
3×4 DataFrame
Row Manufacturer Model Type price
Cat… String Cat… Float64
1 Acura Integra Small 15.8667
2 Acura Legend Midsize 33.9333
3 Audi 90 Compact 29.1

To aggregate over columns, the groupby construct can be used to specify the groupings to aggregate over. In the following this is the car type. The Chain style is used to make the steps more explicit:

@chain cars_price begin
    groupby(:Type)
    combine(:price => mean => :avg_price)
end
6×2 DataFrame
Row Type avg_price
Cat… Float64
1 Compact 18.2104
2 Large 24.303
3 Midsize 27.2152
4 Small 10.1667
5 Sporty 19.4024
6 Van 19.1111

As mean is a reduction, the result has one row for each unique set of levels in the column selector specified to groupby. Again, that these columns appear in the output is due to the chosen default value ofkeepkeys.

This similar example shows grouping by two variables and then aggregating miles per gallon over each:

@chain cars begin
       groupby([:Type, :DriveTrain])
       combine(:MPGHighway => mean => :avg_MPG)
       sort(:avg_MPG, rev=true)
end
14×3 DataFrame
Row Type DriveTrain avg_MPG
Cat… Cat… Float64
1 Small Front 35.6842
2 Small 4WD 33.5
3 Sporty Front 30.5714
4 Compact Front 30.0769
5 Compact 4WD 30.0
6 Compact Rear 28.5
7 Large Front 27.2857
8 Midsize Front 27.1765
9 Sporty 4WD 27.0
10 Sporty Rear 27.0
11 Large Rear 25.75
12 Midsize Rear 25.2
13 Van Front 22.5
14 Van 4WD 21.4

3.1.11 Tidy data; stack, unstack

The split-apply-combine paradigm suggest that storing data so that it can be split readily is preferable to other ways of storage. The notion of “tidy data,” (Wickham 2014) as presented in R’s tidyr package and described in this vignette suggests data sets should follow:

  • Every column is a variable.
  • Every row is an observation.
  • Every cell is a single value.

These terms are defined by

A dataset is a collection of values, usually either numbers (if quantitative) or strings (if qualitative). Values are organised in two ways. Every value belongs to a variable and an observation. A variable contains all values that measure the same underlying attribute (like height, temperature, duration) across units. An observation contains all values measured on the same unit (like a person, or a day, or a race) across attributes.

In addition to flatten, the DataFrames package provides the functions stack and unstack for tidying data.

Stack and unstack

The stack and unstack commands are used to move from “wide” format, where related values are stored in multiple columns with column names representing some level, to “long” format, where related values are stored in one column, and their categorical levels in another column.

For example, this wide format as the attributes are in columns :A and :B; the values in columns :C and :D.

wide = DataFrame(A=["A","B"], B=["a","b"], C=["c1","c2"], D=["d1","d2"])
2×4 DataFrame
Row A B C D
String String String String
1 A a c1 d1
2 B b c2 d2

With the stack function, the values are placed into a single column (by default :value) and the variable names placed into a single column (by default :variable):

long = stack(wide, [:C, :D])
4×4 DataFrame
Row A B variable value
String String String String
1 A a C c1
2 B b C c2
3 A a D d1
4 B b D d2

The values in the :variable column come from the data frame names.

With unstack the data can be spread back out into “wide” format. We pass in the two columns indicating the variable and the values to unstack:

unstack(long, :variable, :value)
2×4 DataFrame
Row A B C D
String String String? String?
1 A a c1 d1
2 B b c2 d2

This figure may help visualize the process.

Illustration of wide and long formats. The stack and unstack functions can reshape data from one to the other.

We illustrate this further with an abbreviated version of an example in the tidyr vignette.

Consider this data from the Global Historical Climatology Network for one weather station (MX17004) in Mexico for five months in 2010. The full data has 31 days per month, this abbreviated data works with just the first 8:

weather = """
"id","year","month","element","d1","d2","d3","d4","d5","d6","d7","d8"
"MX17004","2010","1","tmax","---","---","---","---","---","---","---","---"
"MX17004","2010","1","tmin","---","---","---","---","---","---","---","---"
"MX17004","2010","2","tmax","---","27.3","24.1","---","---","---","---","---"
"MX17004","2010","2","tmin","---","14.4","14.4","---","---","---","---","---"
"MX17004","2010","3","tmax","---","---","---","---","32.1","---","---","---"
"MX17004","2010","3","tmin","---","---","---","---","14.2","---","---","---"
"MX17004","2010","4","tmax","---","---","---","---","---","---","---","---"
"MX17004","2010","4","tmin","---","---","---","---","---","---","---","---"
"MX17004","2010","5","tmax","---","---","---","---","---","---","---","---"
"MX17004","2010","5","tmin","---","---","---","---","---","---","---","---"
"""

w = CSV.read(IOBuffer(weather), DataFrame; missingstring="---")
first(w, 4)
4×12 DataFrame
Row id year month element d1 d2 d3 d4 d5 d6 d7 d8
String7 Int64 Int64 String7 Missing Float64? Float64? Missing Float64? Missing Missing Missing
1 MX17004 2010 1 tmax missing missing missing missing missing missing missing missing
2 MX17004 2010 1 tmin missing missing missing missing missing missing missing missing
3 MX17004 2010 2 tmax missing 27.3 24.1 missing missing missing missing missing
4 MX17004 2010 2 tmin missing 14.4 14.4 missing missing missing missing missing

This example is said to have variables stored in both rows and columns. The first step to tidying the data is to stack days 1 through 8 into a column. We use the Between column selector to select the columns for d1 though d8 (an alternative to, say, r"^d"):

w1 = @chain w begin
    stack(Between(:d1, :d8); variable_name = :day)
    filter(:value => !ismissing, _)
end
6×6 DataFrame
Row id year month element day value
String7 Int64 Int64 String7 String Float64?
1 MX17004 2010 2 tmax d2 27.3
2 MX17004 2010 2 tmin d2 14.4
3 MX17004 2010 2 tmax d3 24.1
4 MX17004 2010 2 tmin d3 14.4
5 MX17004 2010 3 tmax d5 32.1
6 MX17004 2010 3 tmin d5 14.2

As in the example being followed, missing values are skipped, requiring the reader to imply the lack of measurement on a given day.

The day is coded with a leading “d” and as a string, not an integer. We can strip this prefix out using a regular expression or, as here, string indexing then parse the result to an integer:

transform!(w1, :day => ByRow(x -> parse(Int, x[2:end])) => :day)
6×6 DataFrame
Row id year month element day value
String7 Int64 Int64 String7 Int64 Float64?
1 MX17004 2010 2 tmax 2 27.3
2 MX17004 2010 2 tmin 2 14.4
3 MX17004 2010 2 tmax 3 24.1
4 MX17004 2010 2 tmin 3 14.4
5 MX17004 2010 3 tmax 5 32.1
6 MX17004 2010 3 tmin 5 14.2

Finally, the element column does not hold a variable, rather it stores the names of two variables (tmin and tmax). For this the data is “unstacked” splitting the longer variable value in two:

w2 = unstack(w1, :element, :value)
3×6 DataFrame
Row id year month day tmax tmin
String7 Int64 Int64 Int64 Float64? Float64?
1 MX17004 2010 2 2 27.3 14.4
2 MX17004 2010 2 3 24.1 14.4
3 MX17004 2010 3 5 32.1 14.2

As described in the vignette, now each variable is in one column, and each row describes one day.

3.1.12 Joins

Consider the three following assessments of some class. The students have somewhat spotty attendance:

t1 = """
first,last,points
Alice, Smith, 100
Bob, Jones, 200
"""

t2 = """
first,last,score
Carol, Gates, 125
Chad, Gad, 200
Bob, Jones, 225
"""

t3 = """
first,last,points
Bob, Jones, 300
Carol, Gates, 150
Erin, Dan, 100
"""
a1, a2, a3 = [CSV.read(IOBuffer(txt), DataFrame) for txt  (t1, t2, t3)];

We have several ways to combine these three data frames.

We can concatenate them using vcat. Before doing so, we need to align the column names, as the second data set used a different name:

select!(a2, 1:2, :score => :points);

The vcat method for data frames takes a named argument source allowing a variable to be appended indicating the source of the data frame. The following will indicate with a 1, 2, or 3 what data frame contributed what row:

aₛ = vcat(a1, a2, a3; source=:assignment)
describe(aₛ)
4×7 DataFrame
Row variable mean min median max nmissing eltype
Symbol Union… Any Union… Any Int64 DataType
1 first Alice Erin 0 String7
2 last Dan Smith 0 String7
3 points 175.0 100 175.0 300 0 Int64
4 assignment 2.125 1 2.0 3 0 Int64

Similarly, append! could be used in combination with reduce (as in reduce(append!, (a1, a2, a3))).

A more common alternative to the above is to join the data frames using one of several join commands. Joins allow multiple tables to be merged, or mashed. into one. We briefly illustrate the implementations in DataFrames of outer, left, right, inner, semi, and anti joins, leaving the cross join and further details to the documentation.

We return to the original data, with the differently named columns:

a1, a2, a3 = [CSV.read(IOBuffer(txt), DataFrame) for txt  (t1, t2, t3)];

The first join we introduce is outerjoin. It combines two data frames with rows for keys found in any of the data frames passed in.

outerjoin(a1, a2, on=["first", "last"])
4×4 DataFrame
Row first last points score
String7 String7 Int64? Int64?
1 Bob Jones 200 225
2 Alice Smith 100 missing
3 Carol Gates missing 125
4 Chad Gad missing 200

The on argument is a column name (or names as above) to join the data frames on. They need not have the same name; the values passed to on may be pairs indicating the mapping. (That is, if we were joining on the last variable, on=:points => :score would be possible.)

The variable to join on is often called the “key,” though in Julia that terminology is also used with associative arrays.

The joined data frame includes all columns from either data frame not specified in on. The resulting data frame, in this case, is in a wide, not tidy, format.

In the above, the columns do not have names in conflict. However, were a1 and a3 joined, they would have been. Passing makeunique=true instructs distinguishing names to be produced:

outerjoin(a1, a3, on=["first", "last"], makeunique=true)
4×4 DataFrame
Row first last points points_1
String7 String7 Int64? Int64?
1 Bob Jones 200 300
2 Alice Smith 100 missing
3 Carol Gates missing 150
4 Erin Dan missing 100

More than two data frames may be passed to outerjoin, as illustrated below where all three are. The result includes a row for each person (key) in any of the data sets.

outerjoin(a1, a2, a3, on=["first", "last"], makeunique=true)
5×5 DataFrame
Row first last points score points_1
String7 String7 Int64? Int64? Int64?
1 Bob Jones 200 225 300
2 Carol Gates missing 125 150
3 Alice Smith 100 missing missing
4 Chad Gad missing 200 missing
5 Erin Dan missing missing 100

From the outerjoin, we easily note that only the first listed student was present for all the assessments.

Three other joins related to outerjoin are:

  • leftjoin(df1, df2; on, ...) which includes all rows in the left data frame
  • rightjoin(df1, df2; on, ...) which includes all rows in the right data frame (similar to leftjoin(df2, df1).
  • innerjoin(df1, df2; on, ...) which includes all rows with keys in both data data frames.

To illustrate:

leftjoin(a1, a2, on=["first", "last"])
2×4 DataFrame
Row first last points score
String7 String7 Int64 Int64?
1 Bob Jones 200 225
2 Alice Smith 100 missing
rightjoin(a1, a2, on=["first", "last"])
3×4 DataFrame
Row first last points score
String7 String7 Int64? Int64
1 Bob Jones 200 225
2 Carol Gates missing 125
3 Chad Gad missing 200

In the following, we must qualify the function below, as SplitApplyCombine has a conflicting use, though this isn’t typically needed:

DataFrames.innerjoin(a1, a2, on=["first", "last"])
1×4 DataFrame
Row first last points score
String7 String7 Int64 Int64
1 Bob Jones 200 225

Joins have an analog in set operations: the outer join is like a union, the inner join is like an intersection.

The next two joins do not join columns, as the ones above. Rather, they use the second data frame to filter out rows of the first data frame.

The semijoin returns the rows of the left data frame that match a key of the right data frame:

semijoin(a1, a2[3:3, :]; on=["first", "last"])
1×3 DataFrame
Row first last points
String7 String7 Int64
1 Bob Jones 200

An antijoin includes rows in the left data frame that do not match a key in the right data frame. (Like the set difference). Here we see the row in common to a1 and a2 is not included:

antijoin(a1, a2, on=["first", "last"])
1×3 DataFrame
Row first last points
String7 String7 Int64
1 Alice Smith 100

  1. There are other storage formats available within Julia that can contain more detail on the data, e.g. Arrow.↩︎

  2. This combination of column selectors is also performed by Cols(1, :v, "p", r"^l").↩︎

  3. The DataPipes package inserts a magical _ in the last positional argument, which is common with higher-order functions like filter. The example could become @p tbl |> filter(_.v >= 25) |> filter(_.p >= 350) with the new lines replacing the pipe operation as a convenience with longer chains↩︎