This is lecture is a slightly modified version of https://lectures.quantecon.org/jl/data_statistical_packages.html Thank you to the amazing Quantecon.org team!
This lecture explores some of the key packages for working with data and doing statistics in Julia
In particular, we will examine the DataFrame
object in detail (i.e., construction, manipulation, querying, visualization, and nuances like missing data)
While Julia is not an ideal language for pure cookie-cutter statistical analysis, it has many useful packages to provide those tools as part of a more general solution
Examples include GLM.jl
and FixedEffectModels.jl
, which we discuss
This list is not exhaustive, and others can be found in organizations such as JuliaStats, JuliaData, and QueryVerse
using LinearAlgebra, Statistics
using DataFrames, RDatasets, DataFramesMeta, CategoricalArrays, Query, VegaLite
using DataVoyager, GLM, RegressionTables, FixedEffectModels
A useful package for working with data is DataFrames.jl
The most important data type provided is a DataFrame
, a two dimensional array for storing heterogeneous data
Although data can be heterogeneous within a DataFrame
, the contents of the columns must be homogeneous
(of the same type)
This is analogous to a data.frame
in R, a DataFrame
in Pandas (Python) or, more loosely, a spreadsheet in Excel
There are a few different ways to create a DataFrame
The first is to set up columns and construct a dataframe by assigning names
using DataFrames, RDatasets # RDatasets provides good standard data examples from R
# note use of missing
commodities = ["crude", "gas", "gold", "silver"]
last_price = [4.2, 11.3, 12.1, missing]
df = DataFrame(commod = commodities, price = last_price)
Columns of the DataFrame
can be accessed by name using a symbol df[:row]
or a struct-style df.row
, as below
df[!,:price]
df.price
Note that the type of this array has values Union{Missing, Float64}
since it was created with a missing
value
df.commod
The DataFrames.jl
package provides a number of methods for acting on DataFrame
’s, such as describe
describe(df)
While often data will be generated all at once, or read from a file, you can add to a DataFrame
by providing the key parameters
nt = (commod = "nickel", price= 5.1)
push!(df, nt)
Named tuples can also be used to construct a DataFrame
, and have it properly deduce all types
ntx = (t = 1, col1 = 3.0)
df2 = DataFrame([ntx])
push!(df2, (t=2, col1 = 4.0))
As we discussed in fundamental types, the semantics of missing
are that mathematical operations will not silently ignore it
In order to allow missing
in a column, you can create/load the DataFrame
from a source with missing
’s, or call allowmissing!
on a column
allowmissing!(df2, :col1) # necessary to add in a for col1
push!(df2, (t=3, col1 = missing))
push!(df2, (t=4, col1 = 5.1))
eltype(df2.col1)
We can see the propagation of missing
to caller functions, as well as a way to efficiently calculate with non-missing data
@show mean(df2.col1)
@show mean(skipmissing(df2.col1))
And to replace the missing
df2.col1 .= coalesce.(df2.col1, 0.0) # replace all missing with 0.0
One way to do an additional calculation with a DataFrame
is to tuse the @transform
macro from DataFramesMeta.jl
using DataFramesMeta
f(x) = x^2
df2 = @transform(df2, col2 = f.(:col1), col3 = ones(size(df2,1)))
For data that is categorical
using CategoricalArrays
id = [1, 2, 3, 4]
y = ["old", "young", "young", "old"]
y = CategoricalArray(y)
df = DataFrame(id=id, y=y)
levels(df.y)
The DataFrame
(and similar types that fulfill a standard generic interface) can fit into a variety of packages
One set of them is the QueryVerse
Note: The QueryVerse, in the same spirit as R’s tidyverse, makes heavy use of the pipeline syntax |>
x = 3.0
f(x) = x^2
g(x) = log(x)
@show g(f(x))
@show x |> f |> g; # pipes nest function calls
To give an example directly from the source of the LINQ inspired Query.jl
using Query
df = DataFrame(name=["John", "Sally", "Kirk"], age=[23., 42., 59.], children=[3,5,2])
x = @from i in df begin
@where i.age>50
@select {i.name, i.children}
@collect DataFrame
end
While it is possible to just use the Plots.jl
library, there may be better options for displaying tabular data – such as VegaLite.jl
# Error with VegaLite and NodeJS in Jupyter
#using RDatasets, VegaLite
#iris = dataset("datasets", "iris")
#iris |> @vlplot(
# :point,
# x=:PetalLength,
# y=:PetalWidth,
# color=:Species
#)
Another useful tool for exploring tabular data is DataVoyager.jl
using DataVoyager
iris |> Voyager()
While Julia is not intended as a replacement for R, Stata, and similar specialty languages, it has a growing number of packages aimed at statistics and econometrics
Many of the packages live in the JuliaStats organization
A few to point out
using GLM
x = randn(100)
y = 0.9 .* x + 0.5 * rand(100)
df = DataFrame(x=x, y=y)
ols = lm(@formula(y ~ x), df) # R-style notation
To display the results in a useful tables for LaTeX and the REPL, use RegressionTables developed at SciencesPo by Johannes Böhm for output similar to the Stata package esttab and the R package stargazer
# Need to install NVIDIA Cuda (?)
using RegressionTables
regtable(ols)
While Julia may be overkill for estimating a simple linear regression, fixed-effects estimation with dummies for multiple variables are much more computationally intensive. The FixedEffectsModel.jl is a great tool for this task.
For a 2-way fixed-effect, taking the example directly from the documentation using cigarette consumption data
using FixedEffectModels
cigar = dataset("plm", "Cigar")
cigar.StateCategorical = categorical(cigar.State) # convert to categorical data type
cigar.YearCategorical = categorical(cigar.Year)
describe(cigar)
# fixedeffectresults = reg(cigar,
# @formula(Sales ~ NDI + fe(StateCategorical) + fe(YearCategorical)),
# Vcov.cluster(:StateCategorical),
# weights = :Pop)
fixedeffectresults = reg(cigar, @model(Sales ~ NDI, fe = StateCategorical + YearCategorical,
weights = Pop, vcov = cluster(StateCategorical)))
regtable(fixedeffectresults)
?reg
"""
FEbenchmark(;N=10_000_000,K= 100)
A function to create a 2-way fixed effect dataset. `N` observations, `K`
and `N/K` categories for Fixed effects respectively. We generate 7 regressors as well as a weight vector.
"""
function FEbenchmark(;N=10_000_000,K= 100)
id1_int = Int.(rand(1:(N/K), N))
id2_int = Int.(rand(1:K, N))
w = cos.(id1_int)
x1 = 5 * cos.(id1_int) + 5 * sin.(id2_int) + randn(N)
x2 = cos.(id1_int) + sin.(id2_int) + randn(N)
x3 = cos.(id1_int) + sin.(id2_int) + randn(N)
x4 = cos.(id1_int) + sin.(id2_int) + randn(N)
x5 = cos.(id1_int) + sin.(id2_int) + randn(N)
x6 = cos.(id1_int) + sin.(id2_int) + randn(N)
x7 = cos.(id1_int) + sin.(id2_int) + randn(N)
y= 3 .* x1 .+ 5 .* x2 .+ x3 .+ x4 .+ x5 .+ x6 .+ x7 .+ cos.(id1_int) .+ cos.(id2_int).^2 .+ randn(N)
df = DataFrame(id1 = categorical(id1_int),id1_int = id1_int,
id2 = categorical(id2_int), id2_int = id2_int,
x1 = x1,
x2 = x2,
x3 = x3,
x4 = x4,
x5 = x5,
x6 = x6,
x7 = x7,
w = w, y = y)
df
end
# create our dataset
df = FEbenchmark()
# run a bunch of different models
# the comments show timings of a previous version of FixedEffectModels (faster!)
@time reg(df, @formula(y ~ x1 + x2))
# 0.582029 seconds (852 allocations: 535.311 MiB, 18.28% gc time)
@time reg(df, @formula(y ~ x1 + x2), Vcov.cluster(:id2))
# 0.809652 seconds (1.55 k allocations: 649.821 MiB, 14.40% gc time)
@time reg(df, @formula(y ~ x1 + x2 + fe(id1)))
# 1.655732 seconds (1.21 k allocations: 734.353 MiB, 16.88% gc time)
@time reg(df, @formula(y ~ x1 + x2 + fe(id1)), Vcov.cluster(:id1))
# 2.113248 seconds (499.66 k allocations: 811.751 MiB, 15.08% gc time)
@time reg(df, @formula(y ~ x1 + x2 + fe(id1) + fe(id2)))
# 3.553678 seconds (1.36 k allocations: 1005.101 MiB, 10.55% gc time))
# let's take the biggest model as our benchmark timing
@time result = reg(df, @formula(y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + fe(id1) + fe(id2)))
using DataFrames, RDatasets, FixedEffectModels
df = dataset("plm", "Cigar")
reg(df, @formula(Sales ~ NDI + fe(State) + fe(Year)), Vcov.cluster(:State), weights = :Pop)
# using R lfe package
using RCall
rstring = R"""
# factorize data
# notice that we have to pass the integer data in id1_int.
r_d = $df
r_d[,"id1"] = factor(r_d[,"id1_int"])
r_d[,"id2"] = factor(r_d[,"id2_int"])
library(lfe)
r_time = system.time(lfe <- felm(y ~x1 + x2 + x3 + x4 + x5 + x6 + x7| id1 + id2, data=r_d))
print(paste0("R lfe time: ",r_time[1]))
print(summary(lfe))
r_time[1]
#
## OLS.
#ols <- lm(y ~x1 + x2 + x3 + x4 + x5 + x6 + x7 + id1 + id2 -1, data=r_d)
#summary(ols)
"""
# use option save=true to store estimated fixed effects
@time result = reg(df, @formula(y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + fe(id1) + fe(id2)), save = true)
# using RegressionTables
# regtable(result)
# current incompatibility https://github.com/jmboehm/RegressionTables.jl/issues/35
cor(result.augmentdf.fe_id1,result.augmentdf.fe_id2) # AKM sorting?
# more models
result2 = reg(df, @formula(y ~ x1 + x2 + x3 + fe(id1) + fe(id2)))
result3 = reg(df, @formula(y ~ x1 + x3 + fe(id1) ))
# how to look at those? -> regtable!
# regtable(result, result2, result3 )
# latex?
# regtable(result, result2, result3, renderSettings = latexOutput() ) # latexOutput(filename::String)
# options
# regtable(result, result2, result3 , below_statistic = :blank)
using GLM
ols = lm(@formula(y ~ x1 + x2),df)
# regtable(result, result2,ols,below_decoration = s -> "[$(s)]")
💪 @jmboehm
To explore the data use the interactive DataVoyager and VegaLite
# cigar |> Voyager()
cigar |> @vlplot(
:point,
x=:Price,
y=:Sales,
color=:Year,
size=:NDI
)