This is lecture is a slightly modified version of https://lectures.quantecon.org/jl/data_statistical_packages.html Thank you to the amazing Quantecon.org team!
This lecture explores some of the key packages for working with data and doing statistics in Julia
In particular, we will examine the DataFrame
object in detail (i.e., construction, manipulation, querying, visualization, and nuances like missing data)
While Julia is not an ideal language for pure cookie-cutter statistical analysis, it has many useful packages to provide those tools as part of a more general solution
Examples include GLM.jl
and FixedEffectModels.jl
, which we discuss
This list is not exhaustive, and others can be found in organizations such as JuliaStats, JuliaData, and QueryVerse
# It's a massive contribution of the QuantEcon team that we can now instantiate a fully tested package
# environment directly from the web. The below commands will download and precompile
# an extensive list of packages, the first time it's called. Next time, we will immediately be able to use them.
using InstantiateFromURL
# activate the QuantEcon environment
activate_github("QuantEcon/QuantEconLectureAllPackages", tag = "v0.9.7");
Downloading QuantEcon/QuantEconLectureAllPackages-v0.9.7 → /Users/florian.oswald/Dropbox/teaching/ScPo/ScPo-CompEcon/CoursePack/Notebooks/.projects
######################################################################### 100.0%
Instantiating /Users/florian.oswald/Dropbox/teaching/ScPo/ScPo-CompEcon/CoursePack/Notebooks/.projects/QuantEconLectureAllPackages-v0.9.7 Updating registry at `~/.julia/registries/General` Updating git-repo `https://github.com/JuliaRegistries/General.git` [1mFetching: [========================================>] 100.0 %.0 %] 45.8 %=====================================> ] 92.0 %Precompiling project... Precompiling ForwardDiff
┌ Info: Recompiling stale cache file /Users/florian.oswald/.julia/compiled/v1.1/ForwardDiff/k0ETY.ji for ForwardDiff [f6369f11-7733-5829-9624-2563aa707210] └ @ Base loading.jl:1184
Precompiling LazyArrays
┌ Info: Recompiling stale cache file /Users/florian.oswald/.julia/compiled/v1.1/LazyArrays/h0pkJ.ji for LazyArrays [5078a376-72f3-5289-bfd5-ec5146d43c02] └ @ Base loading.jl:1184
Precompiling VegaDatasets
┌ Info: Recompiling stale cache file /Users/florian.oswald/.julia/compiled/v1.1/VegaDatasets/v6OW9.ji for VegaDatasets [0ae4a718-28b7-58ec-9efb-cded64d6d5b4] └ @ Base loading.jl:1184
Precompiling Compat
┌ Info: Recompiling stale cache file /Users/florian.oswald/.julia/compiled/v1.1/Compat/GSFWK.ji for Compat [34da2185-b29b-5c13-b0c7-acf172513d20] └ @ Base loading.jl:1184
Precompiling Latexify
┌ Info: Precompiling Latexify [23fbe1c1-3f47-55db-b15f-69d7ec21a316] └ @ Base loading.jl:1186 ┌ Warning: Module Compat with build ID 95224860146667 is missing from the cache. │ This may mean Compat [34da2185-b29b-5c13-b0c7-acf172513d20] does not support precompilation but is imported by a module that does. └ @ Base loading.jl:947
Precompiling Expectations
┌ Warning: Module Compat with build ID 95224860146667 is missing from the cache. │ This may mean Compat [34da2185-b29b-5c13-b0c7-acf172513d20] does not support precompilation but is imported by a module that does. └ @ Base loading.jl:947 ┌ Info: Precompiling Expectations [2fe49d83-0758-5602-8f54-1f90ad0d522b] └ @ Base loading.jl:1186
Precompiling DataFrames
┌ Warning: Module Compat with build ID 95224860146667 is missing from the cache. │ This may mean Compat [34da2185-b29b-5c13-b0c7-acf172513d20] does not support precompilation but is imported by a module that does. └ @ Base loading.jl:947 ┌ Info: Recompiling stale cache file /Users/florian.oswald/.julia/compiled/v1.1/DataFrames/AR9oZ.ji for DataFrames [a93c6f00-e57d-5684-b7b6-d8193f3e46c0] └ @ Base loading.jl:1184
Precompiling LeastSquaresOptim
┌ Warning: Module Compat with build ID 95224860146667 is missing from the cache. │ This may mean Compat [34da2185-b29b-5c13-b0c7-acf172513d20] does not support precompilation but is imported by a module that does. └ @ Base loading.jl:947 ┌ Info: Precompiling LeastSquaresOptim [0fc2ff8b-aaa3-5acd-a817-1944a5e08891] └ @ Base loading.jl:1186 ┌ Warning: Module Compat with build ID 95224860146667 is missing from the cache. │ This may mean Compat [34da2185-b29b-5c13-b0c7-acf172513d20] does not support precompilation but is imported by a module that does. └ @ Base loading.jl:947 ┌ Warning: Module Compat with build ID 95224860146667 is missing from the cache. │ This may mean Compat [34da2185-b29b-5c13-b0c7-acf172513d20] does not support precompilation but is imported by a module that does. └ @ Base loading.jl:947 ┌ Warning: Module Compat with build ID 95224860146667 is missing from the cache. │ This may mean Compat [34da2185-b29b-5c13-b0c7-acf172513d20] does not support precompilation but is imported by a module that does. └ @ Base loading.jl:947
Precompiling GR
┌ Warning: Module Compat with build ID 95224860146667 is missing from the cache. │ This may mean Compat [34da2185-b29b-5c13-b0c7-acf172513d20] does not support precompilation but is imported by a module that does. └ @ Base loading.jl:947 ┌ Info: Recompiling stale cache file /Users/florian.oswald/.julia/compiled/v1.1/GR/NDU5Y.ji for GR [28b8d3ca-fb5f-59d9-8090-bfdbd6d07a71] └ @ Base loading.jl:1184
Precompiling FastGaussQuadrature
┌ Info: Recompiling stale cache file /Users/florian.oswald/.julia/compiled/v1.1/FastGaussQuadrature/vg51R.ji for FastGaussQuadrature [442a2c76-b920-505d-bb47-c5924d526838] └ @ Base loading.jl:1184
Precompiling Queryverse
┌ Warning: Module Compat with build ID 95224860146667 is missing from the cache. │ This may mean Compat [34da2185-b29b-5c13-b0c7-acf172513d20] does not support precompilation but is imported by a module that does. └ @ Base loading.jl:947 ┌ Info: Precompiling Queryverse [612083be-0b0f-5412-89c1-4e7c75506a58] └ @ Base loading.jl:1186 ┌ Warning: Module Compat with build ID 95224860146667 is missing from the cache. │ This may mean Compat [34da2185-b29b-5c13-b0c7-acf172513d20] does not support precompilation but is imported by a module that does. └ @ Base loading.jl:947 ┌ Warning: Module Compat with build ID 95224860146667 is missing from the cache. │ This may mean Compat [34da2185-b29b-5c13-b0c7-acf172513d20] does not support precompilation but is imported by a module that does. └ @ Base loading.jl:947 ┌ Warning: Module Compat with build ID 95224860146667 is missing from the cache. │ This may mean Compat [34da2185-b29b-5c13-b0c7-acf172513d20] does not support precompilation but is imported by a module that does. └ @ Base loading.jl:947 ┌ Warning: Module Compat with build ID 95224860146667 is missing from the cache. │ This may mean Compat [34da2185-b29b-5c13-b0c7-acf172513d20] does not support precompilation but is imported by a module that does. └ @ Base loading.jl:947
Precompiling Sundials
┌ Info: Recompiling stale cache file /Users/florian.oswald/.julia/compiled/v1.1/Sundials/j8Ppj.ji for Sundials [c3572dad-4567-51f8-b174-8c6c989267f4] └ @ Base loading.jl:1184 ┌ Warning: Module Compat with build ID 95224860146667 is missing from the cache. │ This may mean Compat [34da2185-b29b-5c13-b0c7-acf172513d20] does not support precompilation but is imported by a module that does. └ @ Base loading.jl:947 ┌ Warning: Module Compat with build ID 95224860146667 is missing from the cache. │ This may mean Compat [34da2185-b29b-5c13-b0c7-acf172513d20] does not support precompilation but is imported by a module that does. └ @ Base loading.jl:947
Precompiling StatsPlots
┌ Info: Precompiling StatsPlots [f3b207a7-027a-5e70-b257-86293d7955fd] └ @ Base loading.jl:1186 ┌ Warning: Module Compat with build ID 95224860146667 is missing from the cache. │ This may mean Compat [34da2185-b29b-5c13-b0c7-acf172513d20] does not support precompilation but is imported by a module that does. └ @ Base loading.jl:947 ┌ Warning: Module Compat with build ID 95224860146667 is missing from the cache. │ This may mean Compat [34da2185-b29b-5c13-b0c7-acf172513d20] does not support precompilation but is imported by a module that does. └ @ Base loading.jl:947 ┌ Warning: Module Compat with build ID 95224860146667 is missing from the cache. │ This may mean Compat [34da2185-b29b-5c13-b0c7-acf172513d20] does not support precompilation but is imported by a module that does. └ @ Base loading.jl:947
Precompiling Traceur
┌ Warning: Module Compat with build ID 95224860146667 is missing from the cache. │ This may mean Compat [34da2185-b29b-5c13-b0c7-acf172513d20] does not support precompilation but is imported by a module that does. └ @ Base loading.jl:947 ┌ Info: Precompiling Traceur [37b6cedf-1f77-55f8-9503-c64b63398394] └ @ Base loading.jl:1186 ┌ Warning: Module Compat with build ID 95224860146667 is missing from the cache. │ This may mean Compat [34da2185-b29b-5c13-b0c7-acf172513d20] does not support precompilation but is imported by a module that does. └ @ Base loading.jl:947
Precompiling BandedMatrices
┌ Warning: Module Compat with build ID 95224860146667 is missing from the cache. │ This may mean Compat [34da2185-b29b-5c13-b0c7-acf172513d20] does not support precompilation but is imported by a module that does. └ @ Base loading.jl:947 ┌ Info: Recompiling stale cache file /Users/florian.oswald/.julia/compiled/v1.1/BandedMatrices/OxlqV.ji for BandedMatrices [aae01518-5342-5314-be14-df237901396f] └ @ Base loading.jl:1184
Precompiling Flux
┌ Info: Precompiling Flux [587475ba-b771-5e3f-ad9e-33799f191a9c] └ @ Base loading.jl:1186 ┌ Warning: Module Compat with build ID 95224860146667 is missing from the cache. │ This may mean Compat [34da2185-b29b-5c13-b0c7-acf172513d20] does not support precompilation but is imported by a module that does. └ @ Base loading.jl:947
Precompiling Plots
┌ Warning: Module Compat with build ID 95224860146667 is missing from the cache. │ This may mean Compat [34da2185-b29b-5c13-b0c7-acf172513d20] does not support precompilation but is imported by a module that does. └ @ Base loading.jl:947 ┌ Info: Precompiling Plots [91a5bcdd-55d7-5caf-9e0b-520d859cae80] └ @ Base loading.jl:1186 ┌ Warning: Module Compat with build ID 95224860146667 is missing from the cache. │ This may mean Compat [34da2185-b29b-5c13-b0c7-acf172513d20] does not support precompilation but is imported by a module that does. └ @ Base loading.jl:947 ┌ Warning: Module Compat with build ID 95224860146667 is missing from the cache. │ This may mean Compat [34da2185-b29b-5c13-b0c7-acf172513d20] does not support precompilation but is imported by a module that does. └ @ Base loading.jl:947
Precompiling BlackBoxOptim
┌ Info: Precompiling BlackBoxOptim [a134a8b2-14d6-55f6-9291-3336d3ab0209] └ @ Base loading.jl:1186 ┌ Warning: Module Compat with build ID 95224860146667 is missing from the cache. │ This may mean Compat [34da2185-b29b-5c13-b0c7-acf172513d20] does not support precompilation but is imported by a module that does. └ @ Base loading.jl:947
Precompiling DiffEqDiffTools
┌ Info: Recompiling stale cache file /Users/florian.oswald/.julia/compiled/v1.1/DiffEqDiffTools/UPWPf.ji for DiffEqDiffTools [01453d9d-ee7c-5054-8395-0335cb756afa] └ @ Base loading.jl:1184
Precompiling LightGraphs
┌ Info: Precompiling LightGraphs [093fc24a-ae57-5d10-9952-331d41423f4d] └ @ Base loading.jl:1186 ┌ Warning: Module Compat with build ID 95224860146667 is missing from the cache. │ This may mean Compat [34da2185-b29b-5c13-b0c7-acf172513d20] does not support precompilation but is imported by a module that does. └ @ Base loading.jl:947 ┌ Warning: Module Compat with build ID 95224860146667 is missing from the cache. │ This may mean Compat [34da2185-b29b-5c13-b0c7-acf172513d20] does not support precompilation but is imported by a module that does. └ @ Base loading.jl:947 ┌ Warning: Module Compat with build ID 95224860146667 is missing from the cache. │ This may mean Compat [34da2185-b29b-5c13-b0c7-acf172513d20] does not support precompilation but is imported by a module that does. └ @ Base loading.jl:947 ┌ Warning: Module Compat with build ID 95224860146667 is missing from the cache. │ This may mean Compat [34da2185-b29b-5c13-b0c7-acf172513d20] does not support precompilation but is imported by a module that does. └ @ Base loading.jl:947
Precompiling Interpolations
┌ Info: Recompiling stale cache file /Users/florian.oswald/.julia/compiled/v1.1/Interpolations/VpKVx.ji for Interpolations [a98d9a8b-a2ab-59e6-89dd-64a1c18fca59] └ @ Base loading.jl:1184 ┌ Warning: Module Compat with build ID 95224860146667 is missing from the cache. │ This may mean Compat [34da2185-b29b-5c13-b0c7-acf172513d20] does not support precompilation but is imported by a module that does. └ @ Base loading.jl:947
Precompiling JuMP
┌ Warning: Module Compat with build ID 95224860146667 is missing from the cache. │ This may mean Compat [34da2185-b29b-5c13-b0c7-acf172513d20] does not support precompilation but is imported by a module that does. └ @ Base loading.jl:947 ┌ Info: Precompiling JuMP [4076af6c-e467-56ae-b986-b466b2749572] └ @ Base loading.jl:1186 ┌ Warning: Module Compat with build ID 95224860146667 is missing from the cache. │ This may mean Compat [34da2185-b29b-5c13-b0c7-acf172513d20] does not support precompilation but is imported by a module that does. └ @ Base loading.jl:947
Precompiling Query
┌ Warning: Module Compat with build ID 95224860146667 is missing from the cache. │ This may mean Compat [34da2185-b29b-5c13-b0c7-acf172513d20] does not support precompilation but is imported by a module that does. └ @ Base loading.jl:947 ┌ Info: Precompiling Query [1a8c2f83-1ff3-5112-b086-8aa67b057ba1] └ @ Base loading.jl:1186 ┌ Warning: Module Compat with build ID 95224860146667 is missing from the cache. │ This may mean Compat [34da2185-b29b-5c13-b0c7-acf172513d20] does not support precompilation but is imported by a module that does. └ @ Base loading.jl:947
Precompiling Ipopt
┌ Warning: Module Compat with build ID 95224860146667 is missing from the cache. │ This may mean Compat [34da2185-b29b-5c13-b0c7-acf172513d20] does not support precompilation but is imported by a module that does. └ @ Base loading.jl:947 ┌ Info: Precompiling Ipopt [b6b21f68-93f8-5de0-b562-5493be1d77c9] └ @ Base loading.jl:1186 ┌ Warning: Module Compat with build ID 95224860146667 is missing from the cache. │ This may mean Compat [34da2185-b29b-5c13-b0c7-acf172513d20] does not support precompilation but is imported by a module that does. └ @ Base loading.jl:947
Precompiling NLopt
┌ Info: Recompiling stale cache file /Users/florian.oswald/.julia/compiled/v1.1/NLopt/faRdv.ji for NLopt [76087f3c-5699-56af-9a33-bf431cd00edd] └ @ Base loading.jl:1184 ┌ Warning: Module Compat with build ID 95224860146667 is missing from the cache. │ This may mean Compat [34da2185-b29b-5c13-b0c7-acf172513d20] does not support precompilation but is imported by a module that does. └ @ Base loading.jl:947 ┌ Warning: Module Compat with build ID 95224860146667 is missing from the cache. │ This may mean Compat [34da2185-b29b-5c13-b0c7-acf172513d20] does not support precompilation but is imported by a module that does. └ @ Base loading.jl:947
Precompiling LaTeXStrings
┌ Info: Precompiling LaTeXStrings [b964fa9f-0449-5b57-a5c2-d3ea65f4040f] └ @ Base loading.jl:1186 ┌ Warning: Module Compat with build ID 95224860146667 is missing from the cache. │ This may mean Compat [34da2185-b29b-5c13-b0c7-acf172513d20] does not support precompilation but is imported by a module that does. └ @ Base loading.jl:947
Precompiling QuantEcon
┌ Info: Precompiling QuantEcon [fcd29c91-0bd7-5a09-975d-7ac3f643a60c] └ @ Base loading.jl:1186 ┌ Warning: Module Compat with build ID 95224860146667 is missing from the cache. │ This may mean Compat [34da2185-b29b-5c13-b0c7-acf172513d20] does not support precompilation but is imported by a module that does. └ @ Base loading.jl:947 ┌ Warning: Module Compat with build ID 95224860146667 is missing from the cache. │ This may mean Compat [34da2185-b29b-5c13-b0c7-acf172513d20] does not support precompilation but is imported by a module that does. └ @ Base loading.jl:947
using LinearAlgebra, Statistics, Compat
using DataFrames, RDatasets, DataFramesMeta, CategoricalArrays, Query, VegaLite
using DataVoyager, GLM, RegressionTables, FixedEffectModels
┌ Info: Precompiling RDatasets [ce6b1742-4840-55fa-b093-852dadbb1d8b] └ @ Base loading.jl:1186 ┌ Info: Precompiling DataFramesMeta [1313f7d8-7da2-5740-9ea0-a2ca25f37964] └ @ Base loading.jl:1186 ┌ Info: Precompiling Query [1a8c2f83-1ff3-5112-b086-8aa67b057ba1] └ @ Base loading.jl:1186 ┌ Info: Precompiling VegaLite [112f6efa-9a02-5b7d-90c0-432ed331239a] └ @ Base loading.jl:1186 ┌ Info: Precompiling DataVoyager [5721bf48-af8e-5845-8445-c9e18126e773] └ @ Base loading.jl:1186 ┌ Info: Precompiling GLM [38e38edf-8417-5370-95a0-9cbb8c7f171a] └ @ Base loading.jl:1186 ┌ Info: Precompiling RegressionTables [d519eb52-b820-54da-95a6-98e1306fdade] └ @ Base loading.jl:1186 WARNING: could not import StatsBase.df_residual into FixedEffectModels
A useful package for working with data is DataFrames.jl
The most important data type provided is a DataFrame
, a two dimensional array for storing heterogeneous data
Although data can be heterogeneous within a DataFrame
, the contents of the columns must be homogeneous
(of the same type)
This is analogous to a data.frame
in R, a DataFrame
in Pandas (Python) or, more loosely, a spreadsheet in Excel
There are a few different ways to create a DataFrame
The first is to set up columns and construct a dataframe by assigning names
using DataFrames, RDatasets # RDatasets provides good standard data examples from R
# note use of missing
commodities = ["crude", "gas", "gold", "silver"]
last_price = [4.2, 11.3, 12.1, missing]
df = DataFrame(commod = commodities, price = last_price)
commod | price | |
---|---|---|
String | Float64⍰ | |
1 | crude | 4.2 |
2 | gas | 11.3 |
3 | gold | 12.1 |
4 | silver | missing |
Columns of the DataFrame
can be accessed by name using a symbol df[:row]
or a struct-style df.row
, as below
df[:price]
4-element Array{Union{Missing, Float64},1}: 4.2 11.3 12.1 missing
df.price
4-element Array{Union{Missing, Float64},1}: 4.2 11.3 12.1 missing
Note that the type of this array has values Union{Missing, Float64}
since it was created with a missing
value
df.commod
4-element Array{String,1}: "crude" "gas" "gold" "silver"
The DataFrames.jl
package provides a number of methods for acting on DataFrame
’s, such as describe
describe(df)
variable | mean | min | median | max | nunique | nmissing | eltype | |
---|---|---|---|---|---|---|---|---|
Symbol | Union… | Any | Union… | Any | Union… | Union… | DataType | |
1 | commod | crude | silver | 4 | String | |||
2 | price | 9.2 | 4.2 | 11.3 | 12.1 | 1 | Float64 |
While often data will be generated all at once, or read from a file, you can add to a DataFrame
by providing the key parameters
nt = (commod = "nickel", price= 5.1)
push!(df, nt)
commod | price | |
---|---|---|
String | Float64⍰ | |
1 | crude | 4.2 |
2 | gas | 11.3 |
3 | gold | 12.1 |
4 | silver | missing |
5 | nickel | 5.1 |
Named tuples can also be used to construct a DataFrame
, and have it properly deduce all types
ntx = (t = 1, col1 = 3.0)
df2 = DataFrame([ntx])
push!(df2, (t=2, col1 = 4.0))
t | col1 | |
---|---|---|
Int64 | Float64 | |
1 | 1 | 3.0 |
2 | 2 | 4.0 |
As we discussed in fundamental types, the semantics of missing
are that mathematical operations will not silently ignore it
In order to allow missing
in a column, you can create/load the DataFrame
from a source with missing
’s, or call allowmissing!
on a column
allowmissing!(df2, :col1) # necessary to add in a for col1
push!(df2, (t=3, col1 = missing))
push!(df2, (t=4, col1 = 5.1))
t | col1 | |
---|---|---|
Int64 | Float64⍰ | |
1 | 1 | 3.0 |
2 | 2 | 4.0 |
3 | 3 | missing |
4 | 4 | 5.1 |
eltype(df2.col1)
Union{Missing, Float64}
We can see the propagation of missing
to caller functions, as well as a way to efficiently calculate with non-missing data
@show mean(df2.col1)
@show mean(skipmissing(df2.col1))
mean(df2.col1) = missing mean(skipmissing(df2.col1)) = 4.033333333333333
4.033333333333333
And to replace the missing
df2.col1 .= coalesce.(df2.col1, 0.0) # replace all missing with 0.0
4-element Array{Union{Missing, Float64},1}: 3.0 4.0 0.0 5.1
One way to do an additional calculation with a DataFrame
is to tuse the @transform
macro from DataFramesMeta.jl
using DataFramesMeta
f(x) = x^2
df2 = @transform(df2, col2 = f.(:col1), col3 = 1.0)
t | col1 | col2 | col3 | |
---|---|---|---|---|
Int64 | Float64⍰ | Float64 | Float64 | |
1 | 1 | 3.0 | 9.0 | 1.0 |
2 | 2 | 4.0 | 16.0 | 1.0 |
3 | 3 | 0.0 | 0.0 | 1.0 |
4 | 4 | 5.1 | 26.01 | 1.0 |
For data that is categorical
using CategoricalArrays
id = [1, 2, 3, 4]
y = ["old", "young", "young", "old"]
y = CategoricalArray(y)
df = DataFrame(id=id, y=y)
id | y | |
---|---|---|
Int64 | Categorical… | |
1 | 1 | old |
2 | 2 | young |
3 | 3 | young |
4 | 4 | old |
levels(df.y)
2-element Array{String,1}: "old" "young"
The DataFrame
(and similar types that fulfill a standard generic interface) can fit into a variety of packages
One set of them is the QueryVerse
Note: The QueryVerse, in the same spirit as R’s tidyverse, makes heavy use of the pipeline syntax |>
x = 3.0
f(x) = x^2
g(x) = log(x)
@show g(f(x))
@show x |> f |> g; # pipes nest function calls
g(f(x)) = 2.1972245773362196 (x |> f) |> g = 2.1972245773362196
To give an example directly from the source of the LINQ inspired Query.jl
using Query
df = DataFrame(name=["John", "Sally", "Kirk"], age=[23., 42., 59.], children=[3,5,2])
x = @from i in df begin
@where i.age>50
@select {i.name, i.children}
@collect DataFrame
end
name | children | |
---|---|---|
String | Int64 | |
1 | Kirk | 2 |
While it is possible to just use the Plots.jl
library, there may be better options for displaying tabular data – such as VegaLite.jl
using RDatasets, VegaLite
iris = dataset("datasets", "iris")
iris |> @vlplot(
:point,
x=:PetalLength,
y=:PetalWidth,
color=:Species
)
iris
SepalLength | SepalWidth | PetalLength | PetalWidth | Species | |
---|---|---|---|---|---|
Float64 | Float64 | Float64 | Float64 | Categorical… | |
1 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
2 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
3 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
4 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
5 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
6 | 5.4 | 3.9 | 1.7 | 0.4 | setosa |
7 | 4.6 | 3.4 | 1.4 | 0.3 | setosa |
8 | 5.0 | 3.4 | 1.5 | 0.2 | setosa |
9 | 4.4 | 2.9 | 1.4 | 0.2 | setosa |
10 | 4.9 | 3.1 | 1.5 | 0.1 | setosa |
11 | 5.4 | 3.7 | 1.5 | 0.2 | setosa |
12 | 4.8 | 3.4 | 1.6 | 0.2 | setosa |
13 | 4.8 | 3.0 | 1.4 | 0.1 | setosa |
14 | 4.3 | 3.0 | 1.1 | 0.1 | setosa |
15 | 5.8 | 4.0 | 1.2 | 0.2 | setosa |
16 | 5.7 | 4.4 | 1.5 | 0.4 | setosa |
17 | 5.4 | 3.9 | 1.3 | 0.4 | setosa |
18 | 5.1 | 3.5 | 1.4 | 0.3 | setosa |
19 | 5.7 | 3.8 | 1.7 | 0.3 | setosa |
20 | 5.1 | 3.8 | 1.5 | 0.3 | setosa |
21 | 5.4 | 3.4 | 1.7 | 0.2 | setosa |
22 | 5.1 | 3.7 | 1.5 | 0.4 | setosa |
23 | 4.6 | 3.6 | 1.0 | 0.2 | setosa |
24 | 5.1 | 3.3 | 1.7 | 0.5 | setosa |
25 | 4.8 | 3.4 | 1.9 | 0.2 | setosa |
26 | 5.0 | 3.0 | 1.6 | 0.2 | setosa |
27 | 5.0 | 3.4 | 1.6 | 0.4 | setosa |
28 | 5.2 | 3.5 | 1.5 | 0.2 | setosa |
29 | 5.2 | 3.4 | 1.4 | 0.2 | setosa |
30 | 4.7 | 3.2 | 1.6 | 0.2 | setosa |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
Another useful tool for exploring tabular data is DataVoyager.jl
using DataVoyager
iris |> Voyager()
While Julia is not intended as a replacement for R, Stata, and similar specialty languages, it has a growing number of packages aimed at statistics and econometrics
Many of the packages live in the JuliaStats organization
A few to point out
using GLM
x = randn(100)
y = 0.9 .* x + 0.5 * rand(100)
df = DataFrame(x=x, y=y)
ols = lm(@formula(y ~ x), df) # R-style notation
StatsModels.DataFrameRegressionModel{LinearModel{LmResp{Array{Float64,1}},DensePredChol{Float64,Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}} Formula: y ~ 1 + x Coefficients: Estimate Std.Error t value Pr(>|t|) (Intercept) 0.254383 0.0142877 17.8044 <1e-31 x 0.901569 0.0132557 68.0138 <1e-83
To display the results in a useful tables for LaTeX and the REPL, use RegressionTables developed at SciencesPo by Johannes Böhm for output similar to the Stata package esttab and the R package stargazer
using RegressionTables
regtable(ols)
---------------------- y -------- (1) ---------------------- (Intercept) 0.254*** (0.014) x 0.902*** (0.013) ---------------------- Estimator OLS ---------------------- N 100 R2 0.979 ----------------------
While Julia may be overkill for estimating a simple linear regression, fixed-effects estimation with dummies for multiple variables are much more computationally intensive. The FixedEffectsModel.jl is a great tool for this task.
For a 2-way fixed-effect, taking the example directly from the documentation using cigarette consumption data
using FixedEffectModels
cigar = dataset("plm", "Cigar")
cigar.StateCategorical = categorical(cigar.State) # convert to categorical data type
cigar.YearCategorical = categorical(cigar.Year)
describe(cigar)
variable | mean | min | median | max | nunique | nmissing | eltype | |
---|---|---|---|---|---|---|---|---|
Symbol | Union… | Any | Union… | Any | Union… | Int64 | DataType | |
1 | State | 26.8261 | 1 | 26.5 | 51 | 0 | Int64 | |
2 | Year | 77.5 | 63 | 77.5 | 92 | 0 | Int64 | |
3 | Price | 68.6999 | 23.4 | 52.3 | 201.9 | 0 | Float64 | |
4 | Pop | 4537.11 | 319.0 | 3174.0 | 30703.3 | 0 | Float64 | |
5 | Pop16 | 3366.62 | 215.2 | 2315.3 | 22920.0 | 0 | Float64 | |
6 | CPI | 73.5967 | 30.6 | 62.9 | 140.3 | 0 | Float64 | |
7 | NDI | 7525.02 | 1322.57 | 6281.2 | 23074.0 | 0 | Float64 | |
8 | Sales | 123.951 | 53.4 | 121.2 | 297.9 | 0 | Float64 | |
9 | Pimin | 62.8993 | 23.4 | 46.4 | 178.5 | 0 | Float64 | |
10 | StateCategorical | 1 | 51 | 46 | 0 | CategoricalValue{Int64,UInt32} | ||
11 | YearCategorical | 63 | 92 | 30 | 0 | CategoricalValue{Int64,UInt32} |
fixedeffectresults = reg(cigar,
@model(Sales ~ NDI, fe = StateCategorical + YearCategorical,
weights = Pop, vcov = cluster(StateCategorical)))
regtable(fixedeffectresults)
---------------------------- Sales --------- (1) ---------------------------- NDI -0.005*** (0.001) ---------------------------- StateCategorical Yes YearCategorical Yes ---------------------------- Estimator OLS ---------------------------- N 1,380 R2 0.804 ----------------------------
?reg
search: reg Regex regtable RegexMatch RegressionModel RegressionResult
Estimate a linear model with high dimensional categorical variables / instrumental variables
df::AbstractDataFrame
model::Model
: A model created using @model
save::Union{Bool, Symbol} = false
: Should residuals and eventual estimated fixed effects saved in a dataframe? Use save = :residuals
to only save residuals. Use save = :fe
to only save fixed effects.method::Symbol = :lsmr
: Method to deman regressors. :lsmr
is akin to conjugate gradient descent. With parallel use :lsmr_parallel
. To use multi threaded use lsmr_threads
. Other choices are :qr
and :cholesky
(factorization methods)contrasts::Dict = Dict()
An optional Dict of contrast codings for each categorical variable in the formula
. Any unspecified variables will have DummyCoding
.maxiter::Integer = 10000
: Maximum number of iterationstol::Real =1e-8
: ToleranceModels with instruments variables are estimated using 2SLS. reg
tests for weak instruments by computing the Kleibergen-Paap rk Wald F statistic, a generalization of the Cragg-Donald Wald F statistic for non i.i.d. errors. The statistic is similar to the one returned by the Stata command ivreg2
.
using DataFrames, RDatasets, FixedEffectModels
df = dataset("plm", "Cigar")
df[:StateC] = categorical(df[:State])
df[:YearC] = categorical(df[:Year])
reg(df, @model(Sales ~ Price, fe = StateC + YearC))
reg(df, @model(Sales ~ NDI, fe = StateC + StateC&Year))
reg(df, @model(Sales ~ NDI, fe = StateC*Year))
reg(df, @model(Sales ~ (Price ~ Pimin)))
reg(df, @model(Sales ~ Price, weights = Pop))
reg(df, @model(Sales ~ NDI, subset = State .< 30))
reg(df, @model(Sales ~ NDI, vcov = robust))
reg(df, @model(Sales ~ NDI, vcov = cluster(StateC)))
reg(df, @model(Sales ~ NDI, vcov = cluster(StateC + YearC)))
reg(df, @model(Sales ~ YearC), contrasts = Dict(:YearC => DummyCoding(base = 80)))
"""
FEbenchmark(;N=10_000_000,K= 100)
A function to create a 2-way fixed effect dataset. `N` observations, `K`
and `N/K` categories for Fixed effects respectively. We generate 7 regressors as well as a weight vector.
"""
function FEbenchmark(;N=10_000_000,K= 100)
id1_int = Int.(rand(1:(N/K), N))
id2_int = Int.(rand(1:K, N))
w = cos.(id1_int)
x1 = 5 * cos.(id1_int) + 5 * sin.(id2_int) + randn(N)
x2 = cos.(id1_int) + sin.(id2_int) + randn(N)
x3 = cos.(id1_int) + sin.(id2_int) + randn(N)
x4 = cos.(id1_int) + sin.(id2_int) + randn(N)
x5 = cos.(id1_int) + sin.(id2_int) + randn(N)
x6 = cos.(id1_int) + sin.(id2_int) + randn(N)
x7 = cos.(id1_int) + sin.(id2_int) + randn(N)
y= 3 .* x1 .+ 5 .* x2 .+ x3 .+ x4 .+ x5 .+ x6 .+ x7 .+ cos.(id1_int) .+ cos.(id2_int).^2 .+ randn(N)
df = DataFrame(id1 = categorical(id1_int),id1_int = id1_int,
id2 = categorical(id2_int), id2_int = id2_int,
x1 = x1,
x2 = x2,
x3 = x3,
x4 = x4,
x5 = x5,
x6 = x6,
x7 = x7,
w = w, y = y)
df
end
FEbenchmark
# create our dataset
df = FEbenchmark()
# run a bunch of different models
@time reg(df, @model(y ~ x1 + x2))
# 0.582029 seconds (852 allocations: 535.311 MiB, 18.28% gc time)
@time reg(df, @model(y ~ x1 + x2, vcov = cluster(id2)))
# 0.809652 seconds (1.55 k allocations: 649.821 MiB, 14.40% gc time)
@time reg(df, @model(y ~ x1 + x2, fe = id1))
# 1.655732 seconds (1.21 k allocations: 734.353 MiB, 16.88% gc time)
@time reg(df, @model(y ~ x1 + x2, fe = id1, vcov = cluster(id1)))
# 2.113248 seconds (499.66 k allocations: 811.751 MiB, 15.08% gc time)
@time reg(df, @model(y ~ x1 + x2, fe = id1 + id2))
# 3.553678 seconds (1.36 k allocations: 1005.101 MiB, 10.55% gc time))
# let's take the biggest model as our benchmark timing
@time result = reg(df, @model(y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7, fe = id1 + id2))
UndefVarError: categorical not defined Stacktrace: [1] #FEbenchmark#5(::Int64, ::Int64, ::Function) at ./In[2]:20 [2] FEbenchmark() at ./In[2]:8 [3] top-level scope at In[3]:1
# using R lfe package
using RCall
rstring = R"""
# factorize data
# notice that we have to pass the integer data in id1_int.
r_d = $df
r_d[,"id1"] = factor(r_d[,"id1_int"])
r_d[,"id2"] = factor(r_d[,"id2_int"])
library(lfe)
r_time = system.time(lfe <- felm(y ~x1 + x2 + x3 + x4 + x5 + x6 + x7| id1 + id2, data=r_d))
print(paste0("R lfe time: ",r_time[1]))
print(summary(lfe))
r_time[1]
#
## OLS.
#ols <- lm(y ~x1 + x2 + x3 + x4 + x5 + x6 + x7 + id1 + id2 -1, data=r_d)
#summary(ols)
"""
┌ Info: Precompiling RCall [6f49c342-dc21-5d91-9882-a32aef131414] └ @ Base loading.jl:1186
UndefVarError: df not defined Stacktrace: [1] top-level scope at /Users/florian.oswald/.julia/packages/RCall/29zDq/src/macros.jl:66 [2] top-level scope at In[1]:3
# use option save=true to store estimated fixed effects
@time result = reg(df, @model(y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7, fe = id1 + id2), save = true)
using RegressionTables
regtable(result)
20.261003 seconds (2.21 k allocations: 3.096 GiB, 9.50% gc time) ---------------------- y ---------- (1) ---------------------- x1 3.000*** (0.000) x2 5.000*** (0.000) x3 1.000*** (0.000) x4 1.000*** (0.000) x5 1.000*** (0.000) x6 1.000*** (0.000) x7 1.000*** (0.000) ---------------------- id1 Yes id2 Yes ---------------------- Estimator OLS ---------------------- N 10,000,000 R2 0.999 ----------------------
cor(result.augmentdf.id1,result.augmentdf.id2) # AKM sorting?
-0.00020160140915523005
# more models
result2 = reg(df, @model(y ~ x1 + x2 + x3, fe = id1 + id2))
result3 = reg(df, @model(y ~ x1 + x3, fe = id1 ))
Fixed Effect Model ============================================================== Number of obs: 10000000 Degrees of freedom: 100002 R2: 0.953 R2 within: 0.907 F-Statistic: 8.63907e7 p-value: 0.000 Iterations: 1 Converged: true Estimate Std.Error t value Pr(>|t|) Lower 95% Upper 95% x1 4.60759 0.000596781 7720.74 0.000 4.60642 4.60876 x3 1.31957 0.00179355 735.73 0.000 1.31606 1.32309
============================================================== -------------------------------------------------------------- ==============================================================
# how to look at those? -> regtable!
regtable(result, result2, result3 )
------------------------------------------------ y ------------------------------------ (1) (2) (3) ------------------------------------------------ x1 3.000*** 3.000*** 4.608*** (0.000) (0.001) (0.001) x2 5.000*** 5.001*** (0.000) (0.001) x3 1.000*** 1.000*** 1.320*** (0.000) (0.001) (0.002) x4 1.000*** (0.000) x5 1.000*** (0.000) x6 1.000*** (0.000) x7 1.000*** (0.000) ------------------------------------------------ id1 Yes Yes Yes id2 Yes Yes ------------------------------------------------ Estimator OLS OLS OLS ------------------------------------------------ N 10,000,000 10,000,000 10,000,000 R2 0.999 0.993 0.953 ------------------------------------------------
# latex?
regtable(result, result2, result3, renderSettings = latexOutput() ) # latexOutput(filename::String)
\begin{tabular}{lrrr} \toprule & \multicolumn{3}{c}{y} \\ \cmidrule(lr){2-4} & (1) & (2) & (3) \\ \midrule x1 & 3.000*** & 3.000*** & 4.608*** \\ & (0.000) & (0.001) & (0.001) \\ x2 & 5.000*** & 5.001*** & \\ & (0.000) & (0.001) & \\ x3 & 1.000*** & 1.000*** & 1.320*** \\ & (0.000) & (0.001) & (0.002) \\ x4 & 1.000*** & & \\ & (0.000) & & \\ x5 & 1.000*** & & \\ & (0.000) & & \\ x6 & 1.000*** & & \\ & (0.000) & & \\ x7 & 1.000*** & & \\ & (0.000) & & \\ \midrule id1 & Yes & Yes & Yes \\ id2 & Yes & Yes & \\ \midrule Estimator & OLS & OLS & OLS \\ \midrule $N$ & 10,000,000 & 10,000,000 & 10,000,000 \\ $R^2$ & 0.999 & 0.993 & 0.953 \\ \bottomrule \end{tabular}
# options
regtable(result, result2, result3 , below_statistic = :blank)
------------------------------------------------ y ------------------------------------ (1) (2) (3) ------------------------------------------------ x1 3.000*** 3.000*** 4.608*** x2 5.000*** 5.001*** x3 1.000*** 1.000*** 1.320*** x4 1.000*** x5 1.000*** x6 1.000*** x7 1.000*** ------------------------------------------------ id1 Yes Yes Yes id2 Yes Yes ------------------------------------------------ Estimator OLS OLS OLS ------------------------------------------------ N 10,000,000 10,000,000 10,000,000 R2 0.999 0.993 0.953 ------------------------------------------------
using GLM
ols = lm(@formula(y ~ x1 + x2),df)
StatsModels.DataFrameRegressionModel{LinearModel{LmResp{Array{Float64,1}},DensePredChol{Float64,Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}} Formula: y ~ 1 + x1 + x2 Coefficients: Estimate Std.Error t value Pr(>|t|) (Intercept) 0.497543 0.000865641 574.768 <1e-99 x1 4.01824 0.000235451 17066.2 <1e-99 x2 5.20388 0.000849626 6124.91 <1e-99
regtable(result, result2,ols,below_decoration = s -> "[$(s)]")
-------------------------------------------------- y ------------------------------------ (1) (2) (3) -------------------------------------------------- x1 3.000*** 3.000*** 4.018*** [0.000] [0.001] [0.000] x2 5.000*** 5.001*** 5.204*** [0.000] [0.001] [0.001] x3 1.000*** 1.000*** [0.000] [0.001] x4 1.000*** [0.000] x5 1.000*** [0.000] x6 1.000*** [0.000] x7 1.000*** [0.000] (Intercept) 0.498*** [0.001] -------------------------------------------------- id1 Yes Yes id2 Yes Yes -------------------------------------------------- Estimator OLS OLS OLS -------------------------------------------------- N 10,000,000 10,000,000 10,000,000 R2 0.999 0.993 0.989 --------------------------------------------------
💪 @jmboehm
To explore the data use the interactive DataVoyager and VegaLite
# cigar |> Voyager()
cigar |> @vlplot(
:point,
x=:Price,
y=:Sales,
color=:Year,
size=:NDI
)