Data and Statistics Packages

 Florian Oswald

This is lecture is a slightly modified version of https://lectures.quantecon.org/jl/data_statistical_packages.html Thank you to the amazing Quantecon.org team!

Overview

This lecture explores some of the key packages for working with data and doing statistics in Julia

In particular, we will examine the DataFrame object in detail (i.e., construction, manipulation, querying, visualization, and nuances like missing data)

While Julia is not an ideal language for pure cookie-cutter statistical analysis, it has many useful packages to provide those tools as part of a more general solution

Examples include GLM.jl and FixedEffectModels.jl, which we discuss

This list is not exhaustive, and others can be found in organizations such as JuliaStats, JuliaData, and QueryVerse

In [1]:
using LinearAlgebra, Statistics
In [2]:
using DataFrames, RDatasets, DataFramesMeta, CategoricalArrays, Query, VegaLite
┌ Info: Precompiling DataFramesMeta [1313f7d8-7da2-5740-9ea0-a2ca25f37964]
└ @ Base loading.jl:1273
┌ Info: Precompiling VegaLite [112f6efa-9a02-5b7d-90c0-432ed331239a]
└ @ Base loading.jl:1273
ERROR: LoadError: NodeJS not properly installed. Please run
Pkg.build("NodeJS")
Stacktrace:
 [1] error(::String) at .\error.jl:33
 [2] top-level scope at C:\Users\cms27\.julia\packages\NodeJS\rx0mH\src\NodeJS.jl:7
 [3] include at .\boot.jl:328 [inlined]
 [4] include_relative(::Module, ::String) at .\loading.jl:1105
 [5] include(::Module, ::String) at .\Base.jl:31
 [6] top-level scope at none:2
 [7] eval at .\boot.jl:330 [inlined]
 [8] eval(::Expr) at .\client.jl:425
 [9] top-level scope at .\none:3
in expression starting at C:\Users\cms27\.julia\packages\NodeJS\rx0mH\src\NodeJS.jl:6
ERROR: LoadError: Failed to precompile NodeJS [2bd173c7-0d6d-553b-b6af-13a54713934c] to C:\Users\cms27\.julia\compiled\v1.3\NodeJS\A9Jiy_qcK9J.ji.
Stacktrace:
 [1] error(::String) at .\error.jl:33
 [2] compilecache(::Base.PkgId, ::String) at .\loading.jl:1283
 [3] _require(::Base.PkgId) at .\loading.jl:1024
 [4] require(::Base.PkgId) at .\loading.jl:922
 [5] require(::Module, ::Symbol) at .\loading.jl:917
 [6] include at .\boot.jl:328 [inlined]
 [7] include_relative(::Module, ::String) at .\loading.jl:1105
 [8] include(::Module, ::String) at .\Base.jl:31
 [9] top-level scope at none:2
 [10] eval at .\boot.jl:330 [inlined]
 [11] eval(::Expr) at .\client.jl:425
 [12] top-level scope at .\none:3
in expression starting at C:\Users\cms27\.julia\packages\VegaLite\sHyyT\src\VegaLite.jl:3
Failed to precompile VegaLite [112f6efa-9a02-5b7d-90c0-432ed331239a] to C:\Users\cms27\.julia\compiled\v1.3\VegaLite\lrJqi_qcK9J.ji.

Stacktrace:
 [1] error(::String) at .\error.jl:33
 [2] compilecache(::Base.PkgId, ::String) at .\loading.jl:1283
 [3] _require(::Base.PkgId) at .\loading.jl:1024
 [4] require(::Base.PkgId) at .\loading.jl:922
 [5] require(::Module, ::Symbol) at .\loading.jl:917
 [6] top-level scope at In[2]:1
In [3]:
using DataVoyager, GLM, RegressionTables, FixedEffectModels
┌ Info: Precompiling DataVoyager [5721bf48-af8e-5845-8445-c9e18126e773]
└ @ Base loading.jl:1273
ERROR: LoadError: NodeJS not properly installed. Please run
Pkg.build("NodeJS")
Stacktrace:
 [1] error(::String) at .\error.jl:33
 [2] top-level scope at C:\Users\cms27\.julia\packages\NodeJS\rx0mH\src\NodeJS.jl:7
 [3] include at .\boot.jl:328 [inlined]
 [4] include_relative(::Module, ::String) at .\loading.jl:1105
 [5] include(::Module, ::String) at .\Base.jl:31
 [6] top-level scope at none:2
 [7] eval at .\boot.jl:330 [inlined]
 [8] eval(::Expr) at .\client.jl:425
 [9] top-level scope at .\none:3
in expression starting at C:\Users\cms27\.julia\packages\NodeJS\rx0mH\src\NodeJS.jl:6
ERROR: LoadError: Failed to precompile NodeJS [2bd173c7-0d6d-553b-b6af-13a54713934c] to C:\Users\cms27\.julia\compiled\v1.3\NodeJS\A9Jiy_qcK9J.ji.
Stacktrace:
 [1] error(::String) at .\error.jl:33
 [2] compilecache(::Base.PkgId, ::String) at .\loading.jl:1283
 [3] _require(::Base.PkgId) at .\loading.jl:1024
 [4] require(::Base.PkgId) at .\loading.jl:922
 [5] require(::Module, ::Symbol) at .\loading.jl:917
 [6] include at .\boot.jl:328 [inlined]
 [7] include_relative(::Module, ::String) at .\loading.jl:1105
 [8] include(::Module, ::String) at .\Base.jl:31
 [9] top-level scope at none:2
 [10] eval at .\boot.jl:330 [inlined]
 [11] eval(::Expr) at .\client.jl:425
 [12] top-level scope at .\none:3
in expression starting at C:\Users\cms27\.julia\packages\VegaLite\sHyyT\src\VegaLite.jl:3
ERROR: LoadError: Failed to precompile VegaLite [112f6efa-9a02-5b7d-90c0-432ed331239a] to C:\Users\cms27\.julia\compiled\v1.3\VegaLite\lrJqi_qcK9J.ji.
Stacktrace:
 [1] error(::String) at .\error.jl:33
 [2] compilecache(::Base.PkgId, ::String) at .\loading.jl:1283
 [3] _require(::Base.PkgId) at .\loading.jl:1024
 [4] require(::Base.PkgId) at .\loading.jl:922
 [5] require(::Module, ::Symbol) at .\loading.jl:917
 [6] include at .\boot.jl:328 [inlined]
 [7] include_relative(::Module, ::String) at .\loading.jl:1105
 [8] include(::Module, ::String) at .\Base.jl:31
 [9] top-level scope at none:2
 [10] eval at .\boot.jl:330 [inlined]
 [11] eval(::Expr) at .\client.jl:425
 [12] top-level scope at .\none:3
in expression starting at C:\Users\cms27\.julia\packages\DataVoyager\7x2Tx\src\DataVoyager.jl:5
Failed to precompile DataVoyager [5721bf48-af8e-5845-8445-c9e18126e773] to C:\Users\cms27\.julia\compiled\v1.3\DataVoyager\5nnU1_qcK9J.ji.

Stacktrace:
 [1] error(::String) at .\error.jl:33
 [2] compilecache(::Base.PkgId, ::String) at .\loading.jl:1283
 [3] _require(::Base.PkgId) at .\loading.jl:1024
 [4] require(::Base.PkgId) at .\loading.jl:922
 [5] require(::Module, ::Symbol) at .\loading.jl:917
 [6] top-level scope at In[3]:1

DataFrames

A useful package for working with data is DataFrames.jl

The most important data type provided is a DataFrame, a two dimensional array for storing heterogeneous data

Although data can be heterogeneous within a DataFrame, the contents of the columns must be homogeneous (of the same type)

This is analogous to a data.frame in R, a DataFrame in Pandas (Python) or, more loosely, a spreadsheet in Excel

There are a few different ways to create a DataFrame

Constructing a DataFrame

The first is to set up columns and construct a dataframe by assigning names

In [8]:
using DataFrames, RDatasets  # RDatasets provides good standard data examples from R

# note use of missing
commodities = ["crude", "gas", "gold", "silver"]
last_price = [4.2, 11.3, 12.1, missing]
df = DataFrame(commod = commodities, price = last_price)
Out[8]:

4 rows × 2 columns

commodprice
StringFloat64⍰
1crude4.2
2gas11.3
3gold12.1
4silvermissing

Columns of the DataFrame can be accessed by name using a symbol df[:row] or a struct-style df.row, as below

In [9]:
df[!,:price]
Out[9]:
4-element Array{Union{Missing, Float64},1}:
  4.2     
 11.3     
 12.1     
   missing
In [10]:
df.price
Out[10]:
4-element Array{Union{Missing, Float64},1}:
  4.2     
 11.3     
 12.1     
   missing

Note that the type of this array has values Union{Missing, Float64} since it was created with a missing value

In [11]:
df.commod
Out[11]:
4-element Array{String,1}:
 "crude" 
 "gas"   
 "gold"  
 "silver"

The DataFrames.jl package provides a number of methods for acting on DataFrame’s, such as describe

In [12]:
describe(df)
Out[12]:

2 rows × 8 columns

variablemeanminmedianmaxnuniquenmissingeltype
SymbolUnion…AnyUnion…AnyUnion…Union…Type
1commodcrudesilver4String
2price9.24.211.312.11Union{Missing, Float64}

While often data will be generated all at once, or read from a file, you can add to a DataFrame by providing the key parameters

In [13]:
nt = (commod = "nickel", price= 5.1)
push!(df, nt)
Out[13]:

5 rows × 2 columns

commodprice
StringFloat64⍰
1crude4.2
2gas11.3
3gold12.1
4silvermissing
5nickel5.1

Named tuples can also be used to construct a DataFrame, and have it properly deduce all types

In [14]:
ntx = (t = 1, col1 = 3.0)
df2 = DataFrame([ntx])
push!(df2, (t=2, col1 = 4.0))
Out[14]:

2 rows × 2 columns

tcol1
Int64Float64
113.0
224.0

Working with Missing

As we discussed in fundamental types, the semantics of missing are that mathematical operations will not silently ignore it

In order to allow missing in a column, you can create/load the DataFrame from a source with missing’s, or call allowmissing! on a column

In [15]:
allowmissing!(df2, :col1) # necessary to add in a for col1
push!(df2, (t=3, col1 = missing))
push!(df2, (t=4, col1 = 5.1))
Out[15]:

4 rows × 2 columns

tcol1
Int64Float64⍰
113.0
224.0
33missing
445.1
In [16]:
eltype(df2.col1)
Out[16]:
Union{Missing, Float64}

We can see the propagation of missing to caller functions, as well as a way to efficiently calculate with non-missing data

In [17]:
@show mean(df2.col1)
@show mean(skipmissing(df2.col1))
mean(df2.col1) = missing
mean(skipmissing(df2.col1)) = 4.033333333333333
Out[17]:
4.033333333333333

And to replace the missing

In [18]:
df2.col1  .= coalesce.(df2.col1, 0.0) # replace all missing with 0.0
Out[18]:
4-element Array{Union{Missing, Float64},1}:
 3.0
 4.0
 0.0
 5.1

Manipulating and Transforming DataFrames

One way to do an additional calculation with a DataFrame is to tuse the @transform macro from DataFramesMeta.jl

In [ ]:

In [19]:
using DataFramesMeta
f(x) = x^2
df2 = @transform(df2, col2 = f.(:col1), col3 = ones(size(df2,1)))
Out[19]:

4 rows × 4 columns

tcol1col2col3
Int64Float64⍰Float64Float64
113.09.01.0
224.016.01.0
330.00.01.0
445.126.011.0

Categorical Data

For data that is categorical

In [20]:
using CategoricalArrays
id = [1, 2, 3, 4]
y = ["old", "young", "young", "old"]
y = CategoricalArray(y)
df = DataFrame(id=id, y=y)
Out[20]:

4 rows × 2 columns

idy
Int64Categorical…
11old
22young
33young
44old
In [21]:
levels(df.y)
Out[21]:
2-element Array{String,1}:
 "old"  
 "young"

Visualization, Querying, and Plots

The DataFrame (and similar types that fulfill a standard generic interface) can fit into a variety of packages

One set of them is the QueryVerse

Note: The QueryVerse, in the same spirit as R’s tidyverse, makes heavy use of the pipeline syntax |>

In [22]:
x = 3.0
f(x) = x^2
g(x) = log(x)

@show g(f(x))
@show x |> f |> g; # pipes nest function calls
g(f(x)) = 2.1972245773362196
(x |> f) |> g = 2.1972245773362196

To give an example directly from the source of the LINQ inspired Query.jl

In [23]:
using Query

df = DataFrame(name=["John", "Sally", "Kirk"], age=[23., 42., 59.], children=[3,5,2])

x = @from i in df begin
    @where i.age>50
    @select {i.name, i.children}
    @collect DataFrame
end
Out[23]:

1 rows × 2 columns

namechildren
StringInt64
1Kirk2

While it is possible to just use the Plots.jl library, there may be better options for displaying tabular data – such as VegaLite.jl

In [26]:
# Error with VegaLite and NodeJS in Jupyter
#using RDatasets, VegaLite
#iris = dataset("datasets", "iris")

#iris |> @vlplot(
#    :point,
#    x=:PetalLength,
#    y=:PetalWidth,
#    color=:Species
#)

Another useful tool for exploring tabular data is DataVoyager.jl

using DataVoyager
iris |> Voyager()

Statistics and Econometrics

While Julia is not intended as a replacement for R, Stata, and similar specialty languages, it has a growing number of packages aimed at statistics and econometrics

Many of the packages live in the JuliaStats organization

A few to point out

  • StatsBase has basic statistical functions such as geometric and harmonic means, auto-correlations, robust statistics, etc.
  • StatsFuns has a variety of mathematical functions and constants such as pdf and cdf of many distributions, softmax, etc.

General Linear Models

To run linear regressions and similar statistics, use the GLM package

In [30]:
using GLM

x = randn(100)
y = 0.9 .* x + 0.5 * rand(100)
df = DataFrame(x=x, y=y)
ols = lm(@formula(y ~ x), df) # R-style notation
Out[30]:
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

y ~ 1 + x

Coefficients:
──────────────────────────────────────────────────────────────────────────
             Estimate  Std. Error  t value  Pr(>|t|)  Lower 95%  Upper 95%
──────────────────────────────────────────────────────────────────────────
(Intercept)  0.255058   0.0150475  16.9502    <1e-30   0.225196   0.284919
x            0.929994   0.0193715  48.0085    <1e-69   0.891552   0.968436
──────────────────────────────────────────────────────────────────────────

To display the results in a useful tables for LaTeX and the REPL, use RegressionTables developed at SciencesPo by Johannes Böhm for output similar to the Stata package esttab and the R package stargazer

In [31]:
# Need to install NVIDIA Cuda (?)
using RegressionTables
regtable(ols)
----------------------
                  y   
              --------
                   (1)
----------------------
(Intercept)   0.255***
               (0.015)
x             0.930***
               (0.019)
----------------------
Estimator          OLS
----------------------
N                  100
R2               0.959
----------------------


Fixed Effects

While Julia may be overkill for estimating a simple linear regression, fixed-effects estimation with dummies for multiple variables are much more computationally intensive. The FixedEffectsModel.jl is a great tool for this task.

For a 2-way fixed-effect, taking the example directly from the documentation using cigarette consumption data

In [42]:
using FixedEffectModels
cigar = dataset("plm", "Cigar")
cigar.StateCategorical =  categorical(cigar.State)   # convert to categorical data type
cigar.YearCategorical =  categorical(cigar.Year)

describe(cigar)
Out[42]:

11 rows × 8 columns

variablemeanminmedianmaxnuniquenmissingeltype
SymbolUnion…AnyUnion…AnyUnion…NothingDataType
1State26.8261126.551Int64
2Year77.56377.592Int64
3Price68.699923.452.3201.9Float64
4Pop4537.11319.03174.030703.3Float64
5Pop163366.62215.22315.322920.0Float64
6CPI73.596730.662.9140.3Float64
7NDI7525.021322.576281.223074.0Float64
8Sales123.95153.4121.2297.9Float64
9Pimin62.899323.446.4178.5Float64
10StateCategorical15146CategoricalValue{Int64,UInt32}
11YearCategorical639230CategoricalValue{Int64,UInt32}
In [38]:
# fixedeffectresults = reg(cigar, 
#   @formula(Sales ~ NDI + fe(StateCategorical) + fe(YearCategorical)), 
#    Vcov.cluster(:StateCategorical), 
#    weights = :Pop)

fixedeffectresults = reg(cigar, @model(Sales ~ NDI, fe = StateCategorical + YearCategorical,
                            weights = Pop, vcov = cluster(StateCategorical)))
regtable(fixedeffectresults)
----------------------------
                     Sales  
                   ---------
                         (1)
----------------------------
NDI                -0.005***
                     (0.001)
----------------------------
StateCategorical         Yes
YearCategorical          Yes
----------------------------
Estimator                OLS
----------------------------
N                      1,380
R2                   -70.104
----------------------------


In [23]:
?reg
search: reg Regex regtable RegexMatch RegressionModel RegressionTables aggregate

Out[23]:

Estimate a linear model with high dimensional categorical variables / instrumental variables

Arguments

  • df: a Table
  • FormulaTerm: A formula created using @formula
  • CovarianceEstimator: A method to compute the variance-covariance matrix
  • save::Union{Bool, Symbol} = false: Should residuals and eventual estimated fixed effects saved in a dataframe? Use save = :residuals to only save residuals. Use save = :fe to only save fixed effects.
  • method::Symbol = :lsmr: Method to deman regressors. :lsmr is akin to conjugate gradient descent. To use LSMR on multiple cores, use :lsmr_parallel. To use LSMR with multiple threads, use lsmr_threads. To use LSMR on GPU, use lsmr_gpu(requires CuArrays. Use the option double_precision = false to use Float32 on the GPU.).
  • contrasts::Dict = Dict() An optional Dict of contrast codings for each categorical variable in the formula. Any unspecified variables will have DummyCoding.
  • maxiter::Integer = 10000: Maximum number of iterations
  • double_precision::Bool: Should the demeaning operation use Float64 rather than Float32? Default to true.
  • tol::Real Tolerance. Default to 1e-8 if double_precision = true, 1e-6 otherwise.

Details

Models with instruments variables are estimated using 2SLS. reg tests for weak instruments by computing the Kleibergen-Paap rk Wald F statistic, a generalization of the Cragg-Donald Wald F statistic for non i.i.d. errors. The statistic is similar to the one returned by the Stata command ivreg2.

Examples

using RDatasets, FixedEffectModels
df = dataset("plm", "Cigar")
@time reg(df, @formula(Sales ~ Price))
reg(df, @formula(Sales ~ Price + fe(State) + fe(Year)))
reg(df, @formula(Sales ~ NDI + fe(State) + fe(State)&Year))
reg(df, @formula(Sales ~ NDI + fe(State)*Year))
reg(df, @formula(Sales ~ (Price ~ Pimin)))
@time reg(df, @formula(Sales ~ Price), weights = :Pop)
reg(df, @formula(Sales ~ NDI), Vcov.robust()))
reg(df, @formula(Sales ~ NDI), Vcov.cluster(:State)))
reg(df, @formula(Sales ~ NDI), Vcov.cluster(:State , :Year)))
df.Yearc = categoricay(df.Year)
reg(df, @formula(Sales ~ YearC), contrasts = Dict(:YearC => DummyCoding(base = 80)))

R LFE vs FixedEffects.jl Shootout

  • High dimensional Fixed effect estimation is compuationally intensive
  • The R packge lfe is a good option that uses multithreading, if available.
  • Let's see how FixedEffects.jl compares to it.
In [39]:
"""
    FEbenchmark(;N=10_000_000,K= 100)

A function to create a 2-way fixed effect dataset. `N` observations, `K` 
and `N/K` categories for Fixed effects respectively. We generate 7 regressors as well as a weight vector.
"""
function FEbenchmark(;N=10_000_000,K= 100)
    id1_int = Int.(rand(1:(N/K), N))
    id2_int = Int.(rand(1:K, N))
    w = cos.(id1_int)
    
    x1 = 5 * cos.(id1_int) + 5 * sin.(id2_int) + randn(N)
    x2 =  cos.(id1_int) + sin.(id2_int) + randn(N)
    x3 =  cos.(id1_int) + sin.(id2_int) + randn(N)
    x4 =  cos.(id1_int) + sin.(id2_int) + randn(N)
    x5 =  cos.(id1_int) + sin.(id2_int) + randn(N)
    x6 =  cos.(id1_int) + sin.(id2_int) + randn(N)
    x7 =  cos.(id1_int) + sin.(id2_int) + randn(N)
    y= 3 .* x1 .+ 5 .* x2 .+ x3 .+ x4 .+ x5 .+ x6 .+ x7 .+ cos.(id1_int) .+ cos.(id2_int).^2 .+ randn(N)
    df = DataFrame(id1 = categorical(id1_int),id1_int = id1_int, 
                   id2 = categorical(id2_int), id2_int = id2_int,
                   x1 = x1, 
                   x2 = x2,
                   x3 = x3,
                   x4 = x4,
                   x5 = x5,
                   x6 = x6,
                   x7 = x7,        
                   w = w, y = y)
    df
end
Out[39]:
FEbenchmark
In [41]:
# create our dataset
df = FEbenchmark()

# run a bunch of different models
# the comments show timings of a previous version of FixedEffectModels (faster!)
@time reg(df, @formula(y ~ x1 + x2))
# 0.582029 seconds (852 allocations: 535.311 MiB, 18.28% gc time)
@time reg(df, @formula(y ~ x1 + x2), Vcov.cluster(:id2))
# 0.809652 seconds (1.55 k allocations: 649.821 MiB, 14.40% gc time)
@time reg(df, @formula(y ~ x1 + x2 + fe(id1)))
# 1.655732 seconds (1.21 k allocations: 734.353 MiB, 16.88% gc time)
@time reg(df, @formula(y ~ x1 + x2 + fe(id1)), Vcov.cluster(:id1))
#  2.113248 seconds (499.66 k allocations: 811.751 MiB, 15.08% gc time)
@time reg(df, @formula(y ~ x1 + x2 + fe(id1) + fe(id2)))
# 3.553678 seconds (1.36 k allocations: 1005.101 MiB, 10.55% gc time))

# let's take the biggest model as our benchmark timing
@time result = reg(df, @formula(y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + fe(id1) + fe(id2)))
  1.839808 seconds (548 allocations: 1.044 GiB, 25.86% gc time)
UndefVarError: Vcov not defined

Stacktrace:
 [1] top-level scope at util.jl:155
 [2] top-level scope at In[41]:7
In [43]:
using DataFrames, RDatasets, FixedEffectModels
df = dataset("plm", "Cigar")
reg(df, @formula(Sales ~ NDI + fe(State) + fe(Year)), Vcov.cluster(:State), weights = :Pop)
UndefVarError: fe not defined

Stacktrace:
 [1] top-level scope at In[43]:3
In [26]:
# using R lfe package
using RCall
rstring = R"""
# factorize data
# notice that we have to pass the integer data in id1_int.
r_d = $df
r_d[,"id1"] = factor(r_d[,"id1_int"])
r_d[,"id2"] = factor(r_d[,"id2_int"])

library(lfe)
r_time = system.time(lfe <- felm(y ~x1 + x2 + x3 + x4 + x5 + x6 + x7| id1 + id2, data=r_d))
print(paste0("R lfe time: ",r_time[1]))
print(summary(lfe))
r_time[1]
#
## OLS.
#ols <- lm(y ~x1 + x2 + x3 + x4 + x5 + x6 + x7 + id1 + id2 -1, data=r_d)
#summary(ols)
"""
┌ Warning: RCall.jl: Loading required package: Matrix
└ @ RCall /Users/florian.oswald/.julia/packages/RCall/g7dhB/src/io.jl:113
[1] "R lfe time: 35.968"

Call:
   felm(formula = y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 | id1 + id2,      data = r_d) 

Residuals:
    Min      1Q  Median      3Q     Max 
-5.3031 -0.6708  0.0000  0.6710  5.2332 

Coefficients:
    Estimate Std. Error t value Pr(>|t|)    
x1 3.0000275  0.0003177    9443   <2e-16 ***
x2 5.0003924  0.0003178   15736   <2e-16 ***
x3 1.0004647  0.0003178    3148   <2e-16 ***
x4 1.0000344  0.0003180    3145   <2e-16 ***
x5 0.9999615  0.0003179    3146   <2e-16 ***
x6 1.0000432  0.0003179    3146   <2e-16 ***
x7 0.9997845  0.0003179    3145   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1 on 9899894 degrees of freedom
Multiple R-squared(full model): 0.9986   Adjusted R-squared: 0.9986 
Multiple R-squared(proj model): 0.975   Adjusted R-squared: 0.9748 
F-statistic(full model):6.903e+04 on 100105 and 9899894 DF, p-value: < 2.2e-16 
F-statistic(proj model): 5.518e+07 on 7 and 9899894 DF, p-value: < 2.2e-16 


Out[26]:
RObject{RealSxp}
user.self 
   35.968 
In [27]:
# use option save=true to store estimated fixed effects
@time result = reg(df, @formula(y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + fe(id1) + fe(id2)), save = true)
# using RegressionTables
# regtable(result)
# current incompatibility https://github.com/jmboehm/RegressionTables.jl/issues/35
 32.108785 seconds (4.54 M allocations: 5.811 GiB, 13.20% gc time)
Out[27]:
                      Fixed Effect Model                      
==============================================================
Number of obs:        10000000  Degrees of freedom:     100107
R2:                      0.999  R2 Adjusted:             0.999
F Statistic:         5.51766e7  p-value:                 0.000
R2 within:               0.975  Iterations:                  6
Converged:                true  
==============================================================
     Estimate   Std.Error t value Pr(>|t|) Lower 95% Upper 95%
--------------------------------------------------------------
x1    3.00003 0.000317701 9442.92    0.000    2.9994   3.00065
x2    5.00039 0.000317761 15736.3    0.000   4.99977   5.00102
x3    1.00046 0.000317782 3148.28    0.000  0.999842   1.00109
x4    1.00003 0.000317983 3144.93    0.000  0.999411   1.00066
x5   0.999961 0.000317887 3145.65    0.000  0.999338   1.00058
x6    1.00004 0.000317855 3146.22    0.000   0.99942   1.00067
x7   0.999784 0.000317874 3145.23    0.000  0.999161   1.00041
==============================================================
In [28]:
cor(result.augmentdf.fe_id1,result.augmentdf.fe_id2)   # AKM sorting?
Out[28]:
8.026537147977652e-5
In [29]:
# more models
result2 = reg(df, @formula(y ~ x1 + x2 + x3 + fe(id1) + fe(id2)))
result3 = reg(df, @formula(y ~ x1      + x3 + fe(id1)  ))
Out[29]:
                      Fixed Effect Model                      
==============================================================
Number of obs:        10000000  Degrees of freedom:     100002
R2:                      0.953  R2 Adjusted:             0.952
F Statistic:         4.81708e7  p-value:                 0.000
R2 within:               0.907  Iterations:                  1
Converged:                true  
==============================================================
     Estimate   Std.Error t value Pr(>|t|) Lower 95% Upper 95%
--------------------------------------------------------------
x1    4.60739 0.000596852 7719.48    0.000   4.60622   4.60856
x3    1.32066  0.00179345  736.38    0.000   1.31715   1.32418
==============================================================
In [30]:
# how to look at those? -> regtable!
# regtable(result, result2, result3 )
In [31]:
# latex?
# regtable(result, result2, result3, renderSettings = latexOutput() )  # latexOutput(filename::String)
In [32]:
# options
# regtable(result, result2, result3 , below_statistic = :blank)
In [33]:
using GLM
ols = lm(@formula(y ~ x1 + x2),df)
Out[33]:
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

y ~ 1 + x1 + x2

Coefficients:
─────────────────────────────────────────────────────────────────────────────
             Estimate   Std. Error    t value  Pr(>|t|)  Lower 95%  Upper 95%
─────────────────────────────────────────────────────────────────────────────
(Intercept)  0.498854  0.000865368    576.465    <1e-99   0.497158    0.50055
x1           4.01794   0.000235342  17072.8      <1e-99   4.01748     4.0184 
x2           5.20352   0.000849066   6128.52     <1e-99   5.20186     5.20519
─────────────────────────────────────────────────────────────────────────────
In [34]:
# regtable(result, result2,ols,below_decoration = s -> "[$(s)]")

💪 @jmboehm

To explore the data use the interactive DataVoyager and VegaLite

In [35]:
# cigar |> Voyager()

cigar |> @vlplot(
    :point,
    x=:Price,
    y=:Sales,
    color=:Year,
    size=:NDI
)
Out[35]:
708090Year5,00010,00015,00020,000NDI050100150200Price050100150200250300Sales