Data and Statistics Packages

 Florian Oswald

This is lecture is a slightly modified version of https://lectures.quantecon.org/jl/data_statistical_packages.html Thank you to the amazing Quantecon.org team!

Overview

This lecture explores some of the key packages for working with data and doing statistics in Julia

In particular, we will examine the DataFrame object in detail (i.e., construction, manipulation, querying, visualization, and nuances like missing data)

While Julia is not an ideal language for pure cookie-cutter statistical analysis, it has many useful packages to provide those tools as part of a more general solution

Examples include GLM.jl and FixedEffectModels.jl, which we discuss

This list is not exhaustive, and others can be found in organizations such as JuliaStats, JuliaData, and QueryVerse

In [1]:
# It's a massive contribution of the QuantEcon team that we can now instantiate a fully tested package
# environment directly from the web. The below commands will download and precompile 
# an extensive list of packages, the first time it's called. Next time, we will immediately be able to use them.

using InstantiateFromURL

# activate the QuantEcon environment
activate_github("QuantEcon/QuantEconLectureAllPackages", tag = "v0.9.7");
Downloading QuantEcon/QuantEconLectureAllPackages-v0.9.7 → /Users/florian.oswald/Dropbox/teaching/ScPo/ScPo-CompEcon/CoursePack/Notebooks/.projects
######################################################################### 100.0%
Instantiating /Users/florian.oswald/Dropbox/teaching/ScPo/ScPo-CompEcon/CoursePack/Notebooks/.projects/QuantEconLectureAllPackages-v0.9.7
  Updating registry at `~/.julia/registries/General`
  Updating git-repo `https://github.com/JuliaRegistries/General.git`
[1mFetching: [========================================>]  100.0 %.0 %]  45.8 %=====================================>   ]  92.0 %Precompiling project...
Precompiling ForwardDiff
┌ Info: Recompiling stale cache file /Users/florian.oswald/.julia/compiled/v1.1/ForwardDiff/k0ETY.ji for ForwardDiff [f6369f11-7733-5829-9624-2563aa707210]
└ @ Base loading.jl:1184
Precompiling LazyArrays
┌ Info: Recompiling stale cache file /Users/florian.oswald/.julia/compiled/v1.1/LazyArrays/h0pkJ.ji for LazyArrays [5078a376-72f3-5289-bfd5-ec5146d43c02]
└ @ Base loading.jl:1184
Precompiling VegaDatasets
┌ Info: Recompiling stale cache file /Users/florian.oswald/.julia/compiled/v1.1/VegaDatasets/v6OW9.ji for VegaDatasets [0ae4a718-28b7-58ec-9efb-cded64d6d5b4]
└ @ Base loading.jl:1184
Precompiling Compat
┌ Info: Recompiling stale cache file /Users/florian.oswald/.julia/compiled/v1.1/Compat/GSFWK.ji for Compat [34da2185-b29b-5c13-b0c7-acf172513d20]
└ @ Base loading.jl:1184
Precompiling Latexify
┌ Info: Precompiling Latexify [23fbe1c1-3f47-55db-b15f-69d7ec21a316]
└ @ Base loading.jl:1186
┌ Warning: Module Compat with build ID 95224860146667 is missing from the cache.
│ This may mean Compat [34da2185-b29b-5c13-b0c7-acf172513d20] does not support precompilation but is imported by a module that does.
└ @ Base loading.jl:947
Precompiling Expectations
┌ Warning: Module Compat with build ID 95224860146667 is missing from the cache.
│ This may mean Compat [34da2185-b29b-5c13-b0c7-acf172513d20] does not support precompilation but is imported by a module that does.
└ @ Base loading.jl:947
┌ Info: Precompiling Expectations [2fe49d83-0758-5602-8f54-1f90ad0d522b]
└ @ Base loading.jl:1186
Precompiling DataFrames
┌ Warning: Module Compat with build ID 95224860146667 is missing from the cache.
│ This may mean Compat [34da2185-b29b-5c13-b0c7-acf172513d20] does not support precompilation but is imported by a module that does.
└ @ Base loading.jl:947
┌ Info: Recompiling stale cache file /Users/florian.oswald/.julia/compiled/v1.1/DataFrames/AR9oZ.ji for DataFrames [a93c6f00-e57d-5684-b7b6-d8193f3e46c0]
└ @ Base loading.jl:1184
Precompiling LeastSquaresOptim
┌ Warning: Module Compat with build ID 95224860146667 is missing from the cache.
│ This may mean Compat [34da2185-b29b-5c13-b0c7-acf172513d20] does not support precompilation but is imported by a module that does.
└ @ Base loading.jl:947
┌ Info: Precompiling LeastSquaresOptim [0fc2ff8b-aaa3-5acd-a817-1944a5e08891]
└ @ Base loading.jl:1186
┌ Warning: Module Compat with build ID 95224860146667 is missing from the cache.
│ This may mean Compat [34da2185-b29b-5c13-b0c7-acf172513d20] does not support precompilation but is imported by a module that does.
└ @ Base loading.jl:947
┌ Warning: Module Compat with build ID 95224860146667 is missing from the cache.
│ This may mean Compat [34da2185-b29b-5c13-b0c7-acf172513d20] does not support precompilation but is imported by a module that does.
└ @ Base loading.jl:947
┌ Warning: Module Compat with build ID 95224860146667 is missing from the cache.
│ This may mean Compat [34da2185-b29b-5c13-b0c7-acf172513d20] does not support precompilation but is imported by a module that does.
└ @ Base loading.jl:947
Precompiling GR
┌ Warning: Module Compat with build ID 95224860146667 is missing from the cache.
│ This may mean Compat [34da2185-b29b-5c13-b0c7-acf172513d20] does not support precompilation but is imported by a module that does.
└ @ Base loading.jl:947
┌ Info: Recompiling stale cache file /Users/florian.oswald/.julia/compiled/v1.1/GR/NDU5Y.ji for GR [28b8d3ca-fb5f-59d9-8090-bfdbd6d07a71]
└ @ Base loading.jl:1184
Precompiling FastGaussQuadrature
┌ Info: Recompiling stale cache file /Users/florian.oswald/.julia/compiled/v1.1/FastGaussQuadrature/vg51R.ji for FastGaussQuadrature [442a2c76-b920-505d-bb47-c5924d526838]
└ @ Base loading.jl:1184
Precompiling Queryverse
┌ Warning: Module Compat with build ID 95224860146667 is missing from the cache.
│ This may mean Compat [34da2185-b29b-5c13-b0c7-acf172513d20] does not support precompilation but is imported by a module that does.
└ @ Base loading.jl:947
┌ Info: Precompiling Queryverse [612083be-0b0f-5412-89c1-4e7c75506a58]
└ @ Base loading.jl:1186
┌ Warning: Module Compat with build ID 95224860146667 is missing from the cache.
│ This may mean Compat [34da2185-b29b-5c13-b0c7-acf172513d20] does not support precompilation but is imported by a module that does.
└ @ Base loading.jl:947
┌ Warning: Module Compat with build ID 95224860146667 is missing from the cache.
│ This may mean Compat [34da2185-b29b-5c13-b0c7-acf172513d20] does not support precompilation but is imported by a module that does.
└ @ Base loading.jl:947
┌ Warning: Module Compat with build ID 95224860146667 is missing from the cache.
│ This may mean Compat [34da2185-b29b-5c13-b0c7-acf172513d20] does not support precompilation but is imported by a module that does.
└ @ Base loading.jl:947
┌ Warning: Module Compat with build ID 95224860146667 is missing from the cache.
│ This may mean Compat [34da2185-b29b-5c13-b0c7-acf172513d20] does not support precompilation but is imported by a module that does.
└ @ Base loading.jl:947
Precompiling Sundials
┌ Info: Recompiling stale cache file /Users/florian.oswald/.julia/compiled/v1.1/Sundials/j8Ppj.ji for Sundials [c3572dad-4567-51f8-b174-8c6c989267f4]
└ @ Base loading.jl:1184
┌ Warning: Module Compat with build ID 95224860146667 is missing from the cache.
│ This may mean Compat [34da2185-b29b-5c13-b0c7-acf172513d20] does not support precompilation but is imported by a module that does.
└ @ Base loading.jl:947
┌ Warning: Module Compat with build ID 95224860146667 is missing from the cache.
│ This may mean Compat [34da2185-b29b-5c13-b0c7-acf172513d20] does not support precompilation but is imported by a module that does.
└ @ Base loading.jl:947
Precompiling StatsPlots
┌ Info: Precompiling StatsPlots [f3b207a7-027a-5e70-b257-86293d7955fd]
└ @ Base loading.jl:1186
┌ Warning: Module Compat with build ID 95224860146667 is missing from the cache.
│ This may mean Compat [34da2185-b29b-5c13-b0c7-acf172513d20] does not support precompilation but is imported by a module that does.
└ @ Base loading.jl:947
┌ Warning: Module Compat with build ID 95224860146667 is missing from the cache.
│ This may mean Compat [34da2185-b29b-5c13-b0c7-acf172513d20] does not support precompilation but is imported by a module that does.
└ @ Base loading.jl:947
┌ Warning: Module Compat with build ID 95224860146667 is missing from the cache.
│ This may mean Compat [34da2185-b29b-5c13-b0c7-acf172513d20] does not support precompilation but is imported by a module that does.
└ @ Base loading.jl:947
Precompiling Traceur
┌ Warning: Module Compat with build ID 95224860146667 is missing from the cache.
│ This may mean Compat [34da2185-b29b-5c13-b0c7-acf172513d20] does not support precompilation but is imported by a module that does.
└ @ Base loading.jl:947
┌ Info: Precompiling Traceur [37b6cedf-1f77-55f8-9503-c64b63398394]
└ @ Base loading.jl:1186
┌ Warning: Module Compat with build ID 95224860146667 is missing from the cache.
│ This may mean Compat [34da2185-b29b-5c13-b0c7-acf172513d20] does not support precompilation but is imported by a module that does.
└ @ Base loading.jl:947
Precompiling BandedMatrices
┌ Warning: Module Compat with build ID 95224860146667 is missing from the cache.
│ This may mean Compat [34da2185-b29b-5c13-b0c7-acf172513d20] does not support precompilation but is imported by a module that does.
└ @ Base loading.jl:947
┌ Info: Recompiling stale cache file /Users/florian.oswald/.julia/compiled/v1.1/BandedMatrices/OxlqV.ji for BandedMatrices [aae01518-5342-5314-be14-df237901396f]
└ @ Base loading.jl:1184
Precompiling Flux
┌ Info: Precompiling Flux [587475ba-b771-5e3f-ad9e-33799f191a9c]
└ @ Base loading.jl:1186
┌ Warning: Module Compat with build ID 95224860146667 is missing from the cache.
│ This may mean Compat [34da2185-b29b-5c13-b0c7-acf172513d20] does not support precompilation but is imported by a module that does.
└ @ Base loading.jl:947
Precompiling Plots
┌ Warning: Module Compat with build ID 95224860146667 is missing from the cache.
│ This may mean Compat [34da2185-b29b-5c13-b0c7-acf172513d20] does not support precompilation but is imported by a module that does.
└ @ Base loading.jl:947
┌ Info: Precompiling Plots [91a5bcdd-55d7-5caf-9e0b-520d859cae80]
└ @ Base loading.jl:1186
┌ Warning: Module Compat with build ID 95224860146667 is missing from the cache.
│ This may mean Compat [34da2185-b29b-5c13-b0c7-acf172513d20] does not support precompilation but is imported by a module that does.
└ @ Base loading.jl:947
┌ Warning: Module Compat with build ID 95224860146667 is missing from the cache.
│ This may mean Compat [34da2185-b29b-5c13-b0c7-acf172513d20] does not support precompilation but is imported by a module that does.
└ @ Base loading.jl:947
Precompiling BlackBoxOptim
┌ Info: Precompiling BlackBoxOptim [a134a8b2-14d6-55f6-9291-3336d3ab0209]
└ @ Base loading.jl:1186
┌ Warning: Module Compat with build ID 95224860146667 is missing from the cache.
│ This may mean Compat [34da2185-b29b-5c13-b0c7-acf172513d20] does not support precompilation but is imported by a module that does.
└ @ Base loading.jl:947
Precompiling DiffEqDiffTools
┌ Info: Recompiling stale cache file /Users/florian.oswald/.julia/compiled/v1.1/DiffEqDiffTools/UPWPf.ji for DiffEqDiffTools [01453d9d-ee7c-5054-8395-0335cb756afa]
└ @ Base loading.jl:1184
Precompiling LightGraphs
┌ Info: Precompiling LightGraphs [093fc24a-ae57-5d10-9952-331d41423f4d]
└ @ Base loading.jl:1186
┌ Warning: Module Compat with build ID 95224860146667 is missing from the cache.
│ This may mean Compat [34da2185-b29b-5c13-b0c7-acf172513d20] does not support precompilation but is imported by a module that does.
└ @ Base loading.jl:947
┌ Warning: Module Compat with build ID 95224860146667 is missing from the cache.
│ This may mean Compat [34da2185-b29b-5c13-b0c7-acf172513d20] does not support precompilation but is imported by a module that does.
└ @ Base loading.jl:947
┌ Warning: Module Compat with build ID 95224860146667 is missing from the cache.
│ This may mean Compat [34da2185-b29b-5c13-b0c7-acf172513d20] does not support precompilation but is imported by a module that does.
└ @ Base loading.jl:947
┌ Warning: Module Compat with build ID 95224860146667 is missing from the cache.
│ This may mean Compat [34da2185-b29b-5c13-b0c7-acf172513d20] does not support precompilation but is imported by a module that does.
└ @ Base loading.jl:947
Precompiling Interpolations
┌ Info: Recompiling stale cache file /Users/florian.oswald/.julia/compiled/v1.1/Interpolations/VpKVx.ji for Interpolations [a98d9a8b-a2ab-59e6-89dd-64a1c18fca59]
└ @ Base loading.jl:1184
┌ Warning: Module Compat with build ID 95224860146667 is missing from the cache.
│ This may mean Compat [34da2185-b29b-5c13-b0c7-acf172513d20] does not support precompilation but is imported by a module that does.
└ @ Base loading.jl:947
Precompiling JuMP
┌ Warning: Module Compat with build ID 95224860146667 is missing from the cache.
│ This may mean Compat [34da2185-b29b-5c13-b0c7-acf172513d20] does not support precompilation but is imported by a module that does.
└ @ Base loading.jl:947
┌ Info: Precompiling JuMP [4076af6c-e467-56ae-b986-b466b2749572]
└ @ Base loading.jl:1186
┌ Warning: Module Compat with build ID 95224860146667 is missing from the cache.
│ This may mean Compat [34da2185-b29b-5c13-b0c7-acf172513d20] does not support precompilation but is imported by a module that does.
└ @ Base loading.jl:947
Precompiling Query
┌ Warning: Module Compat with build ID 95224860146667 is missing from the cache.
│ This may mean Compat [34da2185-b29b-5c13-b0c7-acf172513d20] does not support precompilation but is imported by a module that does.
└ @ Base loading.jl:947
┌ Info: Precompiling Query [1a8c2f83-1ff3-5112-b086-8aa67b057ba1]
└ @ Base loading.jl:1186
┌ Warning: Module Compat with build ID 95224860146667 is missing from the cache.
│ This may mean Compat [34da2185-b29b-5c13-b0c7-acf172513d20] does not support precompilation but is imported by a module that does.
└ @ Base loading.jl:947
Precompiling Ipopt
┌ Warning: Module Compat with build ID 95224860146667 is missing from the cache.
│ This may mean Compat [34da2185-b29b-5c13-b0c7-acf172513d20] does not support precompilation but is imported by a module that does.
└ @ Base loading.jl:947
┌ Info: Precompiling Ipopt [b6b21f68-93f8-5de0-b562-5493be1d77c9]
└ @ Base loading.jl:1186
┌ Warning: Module Compat with build ID 95224860146667 is missing from the cache.
│ This may mean Compat [34da2185-b29b-5c13-b0c7-acf172513d20] does not support precompilation but is imported by a module that does.
└ @ Base loading.jl:947
Precompiling NLopt
┌ Info: Recompiling stale cache file /Users/florian.oswald/.julia/compiled/v1.1/NLopt/faRdv.ji for NLopt [76087f3c-5699-56af-9a33-bf431cd00edd]
└ @ Base loading.jl:1184
┌ Warning: Module Compat with build ID 95224860146667 is missing from the cache.
│ This may mean Compat [34da2185-b29b-5c13-b0c7-acf172513d20] does not support precompilation but is imported by a module that does.
└ @ Base loading.jl:947
┌ Warning: Module Compat with build ID 95224860146667 is missing from the cache.
│ This may mean Compat [34da2185-b29b-5c13-b0c7-acf172513d20] does not support precompilation but is imported by a module that does.
└ @ Base loading.jl:947
Precompiling LaTeXStrings
┌ Info: Precompiling LaTeXStrings [b964fa9f-0449-5b57-a5c2-d3ea65f4040f]
└ @ Base loading.jl:1186
┌ Warning: Module Compat with build ID 95224860146667 is missing from the cache.
│ This may mean Compat [34da2185-b29b-5c13-b0c7-acf172513d20] does not support precompilation but is imported by a module that does.
└ @ Base loading.jl:947
Precompiling QuantEcon
┌ Info: Precompiling QuantEcon [fcd29c91-0bd7-5a09-975d-7ac3f643a60c]
└ @ Base loading.jl:1186
┌ Warning: Module Compat with build ID 95224860146667 is missing from the cache.
│ This may mean Compat [34da2185-b29b-5c13-b0c7-acf172513d20] does not support precompilation but is imported by a module that does.
└ @ Base loading.jl:947
┌ Warning: Module Compat with build ID 95224860146667 is missing from the cache.
│ This may mean Compat [34da2185-b29b-5c13-b0c7-acf172513d20] does not support precompilation but is imported by a module that does.
└ @ Base loading.jl:947
In [4]:
using LinearAlgebra, Statistics, Compat
using DataFrames, RDatasets, DataFramesMeta, CategoricalArrays, Query, VegaLite
using DataVoyager, GLM, RegressionTables, FixedEffectModels
┌ Info: Precompiling RDatasets [ce6b1742-4840-55fa-b093-852dadbb1d8b]
└ @ Base loading.jl:1186
┌ Info: Precompiling DataFramesMeta [1313f7d8-7da2-5740-9ea0-a2ca25f37964]
└ @ Base loading.jl:1186
┌ Info: Precompiling Query [1a8c2f83-1ff3-5112-b086-8aa67b057ba1]
└ @ Base loading.jl:1186
┌ Info: Precompiling VegaLite [112f6efa-9a02-5b7d-90c0-432ed331239a]
└ @ Base loading.jl:1186
┌ Info: Precompiling DataVoyager [5721bf48-af8e-5845-8445-c9e18126e773]
└ @ Base loading.jl:1186
┌ Info: Precompiling GLM [38e38edf-8417-5370-95a0-9cbb8c7f171a]
└ @ Base loading.jl:1186
┌ Info: Precompiling RegressionTables [d519eb52-b820-54da-95a6-98e1306fdade]
└ @ Base loading.jl:1186
WARNING: could not import StatsBase.df_residual into FixedEffectModels

DataFrames

A useful package for working with data is DataFrames.jl

The most important data type provided is a DataFrame, a two dimensional array for storing heterogeneous data

Although data can be heterogeneous within a DataFrame, the contents of the columns must be homogeneous (of the same type)

This is analogous to a data.frame in R, a DataFrame in Pandas (Python) or, more loosely, a spreadsheet in Excel

There are a few different ways to create a DataFrame

Constructing a DataFrame

The first is to set up columns and construct a dataframe by assigning names

In [3]:
using DataFrames, RDatasets  # RDatasets provides good standard data examples from R

# note use of missing
commodities = ["crude", "gas", "gold", "silver"]
last_price = [4.2, 11.3, 12.1, missing]
df = DataFrame(commod = commodities, price = last_price)
Out[3]:

4 rows × 2 columns

commodprice
StringFloat64⍰
1crude4.2
2gas11.3
3gold12.1
4silvermissing

Columns of the DataFrame can be accessed by name using a symbol df[:row] or a struct-style df.row, as below

In [4]:
df[:price]
Out[4]:
4-element Array{Union{Missing, Float64},1}:
  4.2     
 11.3     
 12.1     
   missing
In [5]:
df.price
Out[5]:
4-element Array{Union{Missing, Float64},1}:
  4.2     
 11.3     
 12.1     
   missing

Note that the type of this array has values Union{Missing, Float64} since it was created with a missing value

In [6]:
df.commod
Out[6]:
4-element Array{String,1}:
 "crude" 
 "gas"   
 "gold"  
 "silver"

The DataFrames.jl package provides a number of methods for acting on DataFrame’s, such as describe

In [4]:
describe(df)
Out[4]:

2 rows × 8 columns

variablemeanminmedianmaxnuniquenmissingeltype
SymbolUnion…AnyUnion…AnyUnion…Union…DataType
1commodcrudesilver4String
2price9.24.211.312.11Float64

While often data will be generated all at once, or read from a file, you can add to a DataFrame by providing the key parameters

In [5]:
nt = (commod = "nickel", price= 5.1)
push!(df, nt)
Out[5]:

5 rows × 2 columns

commodprice
StringFloat64⍰
1crude4.2
2gas11.3
3gold12.1
4silvermissing
5nickel5.1

Named tuples can also be used to construct a DataFrame, and have it properly deduce all types

In [7]:
ntx = (t = 1, col1 = 3.0)
df2 = DataFrame([ntx])
push!(df2, (t=2, col1 = 4.0))
Out[7]:

2 rows × 2 columns

tcol1
Int64Float64
113.0
224.0

Working with Missing

As we discussed in fundamental types, the semantics of missing are that mathematical operations will not silently ignore it

In order to allow missing in a column, you can create/load the DataFrame from a source with missing’s, or call allowmissing! on a column

In [9]:
allowmissing!(df2, :col1) # necessary to add in a for col1
push!(df2, (t=3, col1 = missing))
push!(df2, (t=4, col1 = 5.1))
Out[9]:

4 rows × 2 columns

tcol1
Int64Float64⍰
113.0
224.0
33missing
445.1
In [12]:
eltype(df2.col1)
Out[12]:
Union{Missing, Float64}

We can see the propagation of missing to caller functions, as well as a way to efficiently calculate with non-missing data

In [11]:
@show mean(df2.col1)
@show mean(skipmissing(df2.col1))
mean(df2.col1) = missing
mean(skipmissing(df2.col1)) = 4.033333333333333
Out[11]:
4.033333333333333

And to replace the missing

In [13]:
df2.col1  .= coalesce.(df2.col1, 0.0) # replace all missing with 0.0
Out[13]:
4-element Array{Union{Missing, Float64},1}:
 3.0
 4.0
 0.0
 5.1

Manipulating and Transforming DataFrames

One way to do an additional calculation with a DataFrame is to tuse the @transform macro from DataFramesMeta.jl

In [14]:
using DataFramesMeta
f(x) = x^2
df2 = @transform(df2, col2 = f.(:col1), col3 = 1.0)
Out[14]:

4 rows × 4 columns

tcol1col2col3
Int64Float64⍰Float64Float64
113.09.01.0
224.016.01.0
330.00.01.0
445.126.011.0

Categorical Data

For data that is categorical

In [15]:
using CategoricalArrays
id = [1, 2, 3, 4]
y = ["old", "young", "young", "old"]
y = CategoricalArray(y)
df = DataFrame(id=id, y=y)
Out[15]:

4 rows × 2 columns

idy
Int64Categorical…
11old
22young
33young
44old
In [16]:
levels(df.y)
Out[16]:
2-element Array{String,1}:
 "old"  
 "young"

Visualization, Querying, and Plots

The DataFrame (and similar types that fulfill a standard generic interface) can fit into a variety of packages

One set of them is the QueryVerse

Note: The QueryVerse, in the same spirit as R’s tidyverse, makes heavy use of the pipeline syntax |>

In [16]:
x = 3.0
f(x) = x^2
g(x) = log(x)

@show g(f(x))
@show x |> f |> g; # pipes nest function calls
g(f(x)) = 2.1972245773362196
(x |> f) |> g = 2.1972245773362196

To give an example directly from the source of the LINQ inspired Query.jl

In [18]:
using Query

df = DataFrame(name=["John", "Sally", "Kirk"], age=[23., 42., 59.], children=[3,5,2])

x = @from i in df begin
    @where i.age>50
    @select {i.name, i.children}
    @collect DataFrame
end
Out[18]:

1 rows × 2 columns

namechildren
StringInt64
1Kirk2

While it is possible to just use the Plots.jl library, there may be better options for displaying tabular data – such as VegaLite.jl

In [19]:
using RDatasets, VegaLite
iris = dataset("datasets", "iris")

iris |> @vlplot(
    :point,
    x=:PetalLength,
    y=:PetalWidth,
    color=:Species
)
Out[19]:
setosaversicolorvirginicaSpecies01234567PetalLength0.00.51.01.52.02.5PetalWidth
In [20]:
iris
Out[20]:

150 rows × 5 columns

SepalLengthSepalWidthPetalLengthPetalWidthSpecies
Float64Float64Float64Float64Categorical…
15.13.51.40.2setosa
24.93.01.40.2setosa
34.73.21.30.2setosa
44.63.11.50.2setosa
55.03.61.40.2setosa
65.43.91.70.4setosa
74.63.41.40.3setosa
85.03.41.50.2setosa
94.42.91.40.2setosa
104.93.11.50.1setosa
115.43.71.50.2setosa
124.83.41.60.2setosa
134.83.01.40.1setosa
144.33.01.10.1setosa
155.84.01.20.2setosa
165.74.41.50.4setosa
175.43.91.30.4setosa
185.13.51.40.3setosa
195.73.81.70.3setosa
205.13.81.50.3setosa
215.43.41.70.2setosa
225.13.71.50.4setosa
234.63.61.00.2setosa
245.13.31.70.5setosa
254.83.41.90.2setosa
265.03.01.60.2setosa
275.03.41.60.4setosa
285.23.51.50.2setosa
295.23.41.40.2setosa
304.73.21.60.2setosa

Another useful tool for exploring tabular data is DataVoyager.jl

using DataVoyager
iris |> Voyager()

Statistics and Econometrics

While Julia is not intended as a replacement for R, Stata, and similar specialty languages, it has a growing number of packages aimed at statistics and econometrics

Many of the packages live in the JuliaStats organization

A few to point out

  • StatsBase has basic statistical functions such as geometric and harmonic means, auto-correlations, robust statistics, etc.
  • StatsFuns has a variety of mathematical functions and constants such as pdf and cdf of many distributions, softmax, etc.

General Linear Models

To run linear regressions and similar statistics, use the GLM package

In [21]:
using GLM

x = randn(100)
y = 0.9 .* x + 0.5 * rand(100)
df = DataFrame(x=x, y=y)
ols = lm(@formula(y ~ x), df) # R-style notation
Out[21]:
StatsModels.DataFrameRegressionModel{LinearModel{LmResp{Array{Float64,1}},DensePredChol{Float64,Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

Formula: y ~ 1 + x

Coefficients:
             Estimate Std.Error t value Pr(>|t|)
(Intercept)  0.254383 0.0142877 17.8044   <1e-31
x            0.901569 0.0132557 68.0138   <1e-83

To display the results in a useful tables for LaTeX and the REPL, use RegressionTables developed at SciencesPo by Johannes Böhm for output similar to the Stata package esttab and the R package stargazer

In [22]:
using RegressionTables
regtable(ols)
----------------------
                  y   
              --------
                   (1)
----------------------
(Intercept)   0.254***
               (0.014)
x             0.902***
               (0.013)
----------------------
Estimator          OLS
----------------------
N                  100
R2               0.979
----------------------


Fixed Effects

While Julia may be overkill for estimating a simple linear regression, fixed-effects estimation with dummies for multiple variables are much more computationally intensive. The FixedEffectsModel.jl is a great tool for this task.

For a 2-way fixed-effect, taking the example directly from the documentation using cigarette consumption data

In [5]:
using FixedEffectModels
cigar = dataset("plm", "Cigar")
cigar.StateCategorical =  categorical(cigar.State)   # convert to categorical data type
cigar.YearCategorical =  categorical(cigar.Year)

describe(cigar)
Out[5]:

11 rows × 8 columns

variablemeanminmedianmaxnuniquenmissingeltype
SymbolUnion…AnyUnion…AnyUnion…Int64DataType
1State26.8261126.5510Int64
2Year77.56377.5920Int64
3Price68.699923.452.3201.90Float64
4Pop4537.11319.03174.030703.30Float64
5Pop163366.62215.22315.322920.00Float64
6CPI73.596730.662.9140.30Float64
7NDI7525.021322.576281.223074.00Float64
8Sales123.95153.4121.2297.90Float64
9Pimin62.899323.446.4178.50Float64
10StateCategorical151460CategoricalValue{Int64,UInt32}
11YearCategorical6392300CategoricalValue{Int64,UInt32}
In [24]:
fixedeffectresults = reg(cigar, 
    @model(Sales ~ NDI, fe = StateCategorical + YearCategorical,
                            weights = Pop, vcov = cluster(StateCategorical)))
regtable(fixedeffectresults)
----------------------------
                     Sales  
                   ---------
                         (1)
----------------------------
NDI                -0.005***
                     (0.001)
----------------------------
StateCategorical         Yes
YearCategorical          Yes
----------------------------
Estimator                OLS
----------------------------
N                      1,380
R2                     0.804
----------------------------


In [25]:
?reg
search: reg Regex regtable RegexMatch RegressionModel RegressionResult

Out[25]:

Estimate a linear model with high dimensional categorical variables / instrumental variables

Arguments

  • df::AbstractDataFrame
  • model::Model: A model created using @model
  • save::Union{Bool, Symbol} = false: Should residuals and eventual estimated fixed effects saved in a dataframe? Use save = :residuals to only save residuals. Use save = :fe to only save fixed effects.
  • method::Symbol = :lsmr: Method to deman regressors. :lsmr is akin to conjugate gradient descent. With parallel use :lsmr_parallel. To use multi threaded use lsmr_threads. Other choices are :qr and :cholesky (factorization methods)
  • contrasts::Dict = Dict() An optional Dict of contrast codings for each categorical variable in the formula. Any unspecified variables will have DummyCoding.
  • maxiter::Integer = 10000: Maximum number of iterations
  • tol::Real =1e-8: Tolerance

Details

Models with instruments variables are estimated using 2SLS. reg tests for weak instruments by computing the Kleibergen-Paap rk Wald F statistic, a generalization of the Cragg-Donald Wald F statistic for non i.i.d. errors. The statistic is similar to the one returned by the Stata command ivreg2.

Examples

using DataFrames, RDatasets, FixedEffectModels
df = dataset("plm", "Cigar")
df[:StateC] =  categorical(df[:State])
df[:YearC] =  categorical(df[:Year])
reg(df, @model(Sales ~ Price, fe = StateC + YearC))
reg(df, @model(Sales ~ NDI, fe = StateC + StateC&Year))
reg(df, @model(Sales ~ NDI, fe = StateC*Year))
reg(df, @model(Sales ~ (Price ~ Pimin)))
reg(df, @model(Sales ~ Price, weights = Pop))
reg(df, @model(Sales ~ NDI, subset = State .< 30))
reg(df, @model(Sales ~ NDI, vcov = robust))
reg(df, @model(Sales ~ NDI, vcov = cluster(StateC)))
reg(df, @model(Sales ~ NDI, vcov = cluster(StateC + YearC)))
reg(df, @model(Sales ~ YearC), contrasts = Dict(:YearC => DummyCoding(base = 80)))

R LFE vs FixedEffects.jl Shootout

  • High dimensional Fixed effect estimation is compuationally intensive
  • The R packge lfe is a good option that uses multithreading, if available.
  • Let's see how FixedEffects.jl compares to it.
In [2]:
"""
    FEbenchmark(;N=10_000_000,K= 100)

A function to create a 2-way fixed effect dataset. `N` observations, `K` 
and `N/K` categories for Fixed effects respectively. We generate 7 regressors as well as a weight vector.
"""
function FEbenchmark(;N=10_000_000,K= 100)
    id1_int = Int.(rand(1:(N/K), N))
    id2_int = Int.(rand(1:K, N))
    w = cos.(id1_int)
    
    x1 = 5 * cos.(id1_int) + 5 * sin.(id2_int) + randn(N)
    x2 =  cos.(id1_int) + sin.(id2_int) + randn(N)
    x3 =  cos.(id1_int) + sin.(id2_int) + randn(N)
    x4 =  cos.(id1_int) + sin.(id2_int) + randn(N)
    x5 =  cos.(id1_int) + sin.(id2_int) + randn(N)
    x6 =  cos.(id1_int) + sin.(id2_int) + randn(N)
    x7 =  cos.(id1_int) + sin.(id2_int) + randn(N)
    y= 3 .* x1 .+ 5 .* x2 .+ x3 .+ x4 .+ x5 .+ x6 .+ x7 .+ cos.(id1_int) .+ cos.(id2_int).^2 .+ randn(N)
    df = DataFrame(id1 = categorical(id1_int),id1_int = id1_int, 
                   id2 = categorical(id2_int), id2_int = id2_int,
                   x1 = x1, 
                   x2 = x2,
                   x3 = x3,
                   x4 = x4,
                   x5 = x5,
                   x6 = x6,
                   x7 = x7,        
                   w = w, y = y)
    df
end
Out[2]:
FEbenchmark
In [3]:
# create our dataset
df = FEbenchmark()

# run a bunch of different models
@time reg(df, @model(y ~ x1 + x2))
# 0.582029 seconds (852 allocations: 535.311 MiB, 18.28% gc time)
@time reg(df, @model(y ~ x1 + x2, vcov = cluster(id2)))
# 0.809652 seconds (1.55 k allocations: 649.821 MiB, 14.40% gc time)
@time reg(df, @model(y ~ x1 + x2, fe = id1))
# 1.655732 seconds (1.21 k allocations: 734.353 MiB, 16.88% gc time)
@time reg(df, @model(y ~ x1 + x2, fe = id1, vcov = cluster(id1)))
#  2.113248 seconds (499.66 k allocations: 811.751 MiB, 15.08% gc time)
@time reg(df, @model(y ~ x1 + x2, fe = id1 + id2))
# 3.553678 seconds (1.36 k allocations: 1005.101 MiB, 10.55% gc time))

# let's take the biggest model as our benchmark timing
@time result = reg(df, @model(y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7, fe = id1 + id2))
UndefVarError: categorical not defined

Stacktrace:
 [1] #FEbenchmark#5(::Int64, ::Int64, ::Function) at ./In[2]:20
 [2] FEbenchmark() at ./In[2]:8
 [3] top-level scope at In[3]:1
In [1]:
# using R lfe package
using RCall
rstring = R"""
# factorize data
# notice that we have to pass the integer data in id1_int.
r_d = $df
r_d[,"id1"] = factor(r_d[,"id1_int"])
r_d[,"id2"] = factor(r_d[,"id2_int"])

library(lfe)
r_time = system.time(lfe <- felm(y ~x1 + x2 + x3 + x4 + x5 + x6 + x7| id1 + id2, data=r_d))
print(paste0("R lfe time: ",r_time[1]))
print(summary(lfe))
r_time[1]
#
## OLS.
#ols <- lm(y ~x1 + x2 + x3 + x4 + x5 + x6 + x7 + id1 + id2 -1, data=r_d)
#summary(ols)
"""
┌ Info: Precompiling RCall [6f49c342-dc21-5d91-9882-a32aef131414]
└ @ Base loading.jl:1186
UndefVarError: df not defined

Stacktrace:
 [1] top-level scope at /Users/florian.oswald/.julia/packages/RCall/29zDq/src/macros.jl:66
 [2] top-level scope at In[1]:3
In [35]:
# use option save=true to store estimated fixed effects
@time result = reg(df, @model(y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7, fe = id1 + id2), save = true)
using RegressionTables
regtable(result)
 20.261003 seconds (2.21 k allocations: 3.096 GiB, 9.50% gc time)

----------------------
                 y    
            ----------
                   (1)
----------------------
x1            3.000***
               (0.000)
x2            5.000***
               (0.000)
x3            1.000***
               (0.000)
x4            1.000***
               (0.000)
x5            1.000***
               (0.000)
x6            1.000***
               (0.000)
x7            1.000***
               (0.000)
----------------------
id1                Yes
id2                Yes
----------------------
Estimator          OLS
----------------------
N           10,000,000
R2               0.999
----------------------


In [27]:
cor(result.augmentdf.id1,result.augmentdf.id2)   # AKM sorting?
Out[27]:
-0.00020160140915523005
In [28]:
# more models
result2 = reg(df, @model(y ~ x1 + x2 + x3, fe = id1 + id2))
result3 = reg(df, @model(y ~ x1      + x3, fe = id1 ))
Out[28]:
                      Fixed Effect Model                      
==============================================================
Number of obs:        10000000  Degrees of freedom:     100002
R2:                      0.953  R2 within:               0.907
F-Statistic:         8.63907e7  p-value:                 0.000
Iterations:                  1  Converged:                true
     Estimate   Std.Error t value Pr(>|t|) Lower 95% Upper 95%
x1    4.60759 0.000596781 7720.74    0.000   4.60642   4.60876
x3    1.31957  0.00179355  735.73    0.000   1.31606   1.32309
==============================================================
--------------------------------------------------------------
==============================================================
In [29]:
# how to look at those? -> regtable!
regtable(result, result2, result3 )
------------------------------------------------
                              y                 
            ------------------------------------
                   (1)          (2)          (3)
------------------------------------------------
x1            3.000***     3.000***     4.608***
               (0.000)      (0.001)      (0.001)
x2            5.000***     5.001***             
               (0.000)      (0.001)             
x3            1.000***     1.000***     1.320***
               (0.000)      (0.001)      (0.002)
x4            1.000***                          
               (0.000)                          
x5            1.000***                          
               (0.000)                          
x6            1.000***                          
               (0.000)                          
x7            1.000***                          
               (0.000)                          
------------------------------------------------
id1                Yes          Yes          Yes
id2                Yes          Yes             
------------------------------------------------
Estimator          OLS          OLS          OLS
------------------------------------------------
N           10,000,000   10,000,000   10,000,000
R2               0.999        0.993        0.953
------------------------------------------------


In [30]:
# latex?
regtable(result, result2, result3, renderSettings = latexOutput() )  # latexOutput(filename::String)
\begin{tabular}{lrrr}
\toprule
          &         \multicolumn{3}{c}{y}        \\ 
\cmidrule(lr){2-4} 
          &        (1) &        (2) &        (3) \\ 
\midrule
x1        &   3.000*** &   3.000*** &   4.608*** \\ 
          &    (0.000) &    (0.001) &    (0.001) \\ 
x2        &   5.000*** &   5.001*** &            \\ 
          &    (0.000) &    (0.001) &            \\ 
x3        &   1.000*** &   1.000*** &   1.320*** \\ 
          &    (0.000) &    (0.001) &    (0.002) \\ 
x4        &   1.000*** &            &            \\ 
          &    (0.000) &            &            \\ 
x5        &   1.000*** &            &            \\ 
          &    (0.000) &            &            \\ 
x6        &   1.000*** &            &            \\ 
          &    (0.000) &            &            \\ 
x7        &   1.000*** &            &            \\ 
          &    (0.000) &            &            \\ 
\midrule
id1       &        Yes &        Yes &        Yes \\ 
id2       &        Yes &        Yes &            \\ 
\midrule
Estimator &        OLS &        OLS &        OLS \\ 
\midrule
$N$       & 10,000,000 & 10,000,000 & 10,000,000 \\ 
$R^2$     &      0.999 &      0.993 &      0.953 \\ 
\bottomrule
\end{tabular}

In [31]:
# options
regtable(result, result2, result3 , below_statistic = :blank)
------------------------------------------------
                              y                 
            ------------------------------------
                   (1)          (2)          (3)
------------------------------------------------
x1            3.000***     3.000***     4.608***
                                                
x2            5.000***     5.001***             
                                                
x3            1.000***     1.000***     1.320***
                                                
x4            1.000***                          
                                                
x5            1.000***                          
                                                
x6            1.000***                          
                                                
x7            1.000***                          
                                                
------------------------------------------------
id1                Yes          Yes          Yes
id2                Yes          Yes             
------------------------------------------------
Estimator          OLS          OLS          OLS
------------------------------------------------
N           10,000,000   10,000,000   10,000,000
R2               0.999        0.993        0.953
------------------------------------------------


In [32]:
using GLM
ols = lm(@formula(y ~ x1 + x2),df)
Out[32]:
StatsModels.DataFrameRegressionModel{LinearModel{LmResp{Array{Float64,1}},DensePredChol{Float64,Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

Formula: y ~ 1 + x1 + x2

Coefficients:
             Estimate   Std.Error t value Pr(>|t|)
(Intercept)  0.497543 0.000865641 574.768   <1e-99
x1            4.01824 0.000235451 17066.2   <1e-99
x2            5.20388 0.000849626 6124.91   <1e-99
In [33]:
regtable(result, result2,ols,below_decoration = s -> "[$(s)]")
--------------------------------------------------
                                y                 
              ------------------------------------
                     (1)          (2)          (3)
--------------------------------------------------
x1              3.000***     3.000***     4.018***
                 [0.000]      [0.001]      [0.000]
x2              5.000***     5.001***     5.204***
                 [0.000]      [0.001]      [0.001]
x3              1.000***     1.000***             
                 [0.000]      [0.001]             
x4              1.000***                          
                 [0.000]                          
x5              1.000***                          
                 [0.000]                          
x6              1.000***                          
                 [0.000]                          
x7              1.000***                          
                 [0.000]                          
(Intercept)                               0.498***
                                           [0.001]
--------------------------------------------------
id1                  Yes          Yes             
id2                  Yes          Yes             
--------------------------------------------------
Estimator            OLS          OLS          OLS
--------------------------------------------------
N             10,000,000   10,000,000   10,000,000
R2                 0.999        0.993        0.989
--------------------------------------------------


💪 @jmboehm

To explore the data use the interactive DataVoyager and VegaLite

In [34]:
# cigar |> Voyager()

cigar |> @vlplot(
    :point,
    x=:Price,
    y=:Sales,
    color=:Year,
    size=:NDI
)
Out[34]:
708090Year5,00010,00015,00020,000NDI050100150200Price050100150200250300Sales
In [ ]: