Data and Statistics Packages¶

Florian Oswald¶

This is lecture is a slightly modified version of https://lectures.quantecon.org/jl/data_statistical_packages.html Thank you to the amazing Quantecon.org team!

Contents¶

Data and Statistics Packages

Overview¶

This lecture explores some of the key packages for working with data and doing statistics in Julia

In particular, we will examine the DataFrame object in detail (i.e., construction, manipulation, querying, visualization, and nuances like missing data)

While Julia is not an ideal language for pure cookie-cutter statistical analysis, it has many useful packages to provide those tools as part of a more general solution

Examples include GLM.jl and FixedEffectModels.jl, which we discuss

This list is not exhaustive, and others can be found in organizations such as JuliaStats, JuliaData, and QueryVerse

using LinearAlgebra, Statistics

using DataFrames, RDatasets, DataFramesMeta, CategoricalArrays, Query, VegaLite

┌ Info: Precompiling DataFramesMeta [1313f7d8-7da2-5740-9ea0-a2ca25f37964]
└ @ Base loading.jl:1273
┌ Info: Precompiling VegaLite [112f6efa-9a02-5b7d-90c0-432ed331239a]
└ @ Base loading.jl:1273
ERROR: LoadError: NodeJS not properly installed. Please run
Pkg.build("NodeJS")
Stacktrace:
 [1] error(::String) at .\error.jl:33
 [2] top-level scope at C:\Users\cms27\.julia\packages\NodeJS\rx0mH\src\NodeJS.jl:7
 [3] include at .\boot.jl:328 [inlined]
 [4] include_relative(::Module, ::String) at .\loading.jl:1105
 [5] include(::Module, ::String) at .\Base.jl:31
 [6] top-level scope at none:2
 [7] eval at .\boot.jl:330 [inlined]
 [8] eval(::Expr) at .\client.jl:425
 [9] top-level scope at .\none:3
in expression starting at C:\Users\cms27\.julia\packages\NodeJS\rx0mH\src\NodeJS.jl:6
ERROR: LoadError: Failed to precompile NodeJS [2bd173c7-0d6d-553b-b6af-13a54713934c] to C:\Users\cms27\.julia\compiled\v1.3\NodeJS\A9Jiy_qcK9J.ji.
Stacktrace:
 [1] error(::String) at .\error.jl:33
 [2] compilecache(::Base.PkgId, ::String) at .\loading.jl:1283
 [3] _require(::Base.PkgId) at .\loading.jl:1024
 [4] require(::Base.PkgId) at .\loading.jl:922
 [5] require(::Module, ::Symbol) at .\loading.jl:917
 [6] include at .\boot.jl:328 [inlined]
 [7] include_relative(::Module, ::String) at .\loading.jl:1105
 [8] include(::Module, ::String) at .\Base.jl:31
 [9] top-level scope at none:2
 [10] eval at .\boot.jl:330 [inlined]
 [11] eval(::Expr) at .\client.jl:425
 [12] top-level scope at .\none:3
in expression starting at C:\Users\cms27\.julia\packages\VegaLite\sHyyT\src\VegaLite.jl:3

Failed to precompile VegaLite [112f6efa-9a02-5b7d-90c0-432ed331239a] to C:\Users\cms27\.julia\compiled\v1.3\VegaLite\lrJqi_qcK9J.ji.

Stacktrace:
 [1] error(::String) at .\error.jl:33
 [2] compilecache(::Base.PkgId, ::String) at .\loading.jl:1283
 [3] _require(::Base.PkgId) at .\loading.jl:1024
 [4] require(::Base.PkgId) at .\loading.jl:922
 [5] require(::Module, ::Symbol) at .\loading.jl:917
 [6] top-level scope at In[2]:1

using DataVoyager, GLM, RegressionTables, FixedEffectModels

┌ Info: Precompiling DataVoyager [5721bf48-af8e-5845-8445-c9e18126e773]
└ @ Base loading.jl:1273
ERROR: LoadError: NodeJS not properly installed. Please run
Pkg.build("NodeJS")
Stacktrace:
 [1] error(::String) at .\error.jl:33
 [2] top-level scope at C:\Users\cms27\.julia\packages\NodeJS\rx0mH\src\NodeJS.jl:7
 [3] include at .\boot.jl:328 [inlined]
 [4] include_relative(::Module, ::String) at .\loading.jl:1105
 [5] include(::Module, ::String) at .\Base.jl:31
 [6] top-level scope at none:2
 [7] eval at .\boot.jl:330 [inlined]
 [8] eval(::Expr) at .\client.jl:425
 [9] top-level scope at .\none:3
in expression starting at C:\Users\cms27\.julia\packages\NodeJS\rx0mH\src\NodeJS.jl:6
ERROR: LoadError: Failed to precompile NodeJS [2bd173c7-0d6d-553b-b6af-13a54713934c] to C:\Users\cms27\.julia\compiled\v1.3\NodeJS\A9Jiy_qcK9J.ji.
Stacktrace:
 [1] error(::String) at .\error.jl:33
 [2] compilecache(::Base.PkgId, ::String) at .\loading.jl:1283
 [3] _require(::Base.PkgId) at .\loading.jl:1024
 [4] require(::Base.PkgId) at .\loading.jl:922
 [5] require(::Module, ::Symbol) at .\loading.jl:917
 [6] include at .\boot.jl:328 [inlined]
 [7] include_relative(::Module, ::String) at .\loading.jl:1105
 [8] include(::Module, ::String) at .\Base.jl:31
 [9] top-level scope at none:2
 [10] eval at .\boot.jl:330 [inlined]
 [11] eval(::Expr) at .\client.jl:425
 [12] top-level scope at .\none:3
in expression starting at C:\Users\cms27\.julia\packages\VegaLite\sHyyT\src\VegaLite.jl:3
ERROR: LoadError: Failed to precompile VegaLite [112f6efa-9a02-5b7d-90c0-432ed331239a] to C:\Users\cms27\.julia\compiled\v1.3\VegaLite\lrJqi_qcK9J.ji.
Stacktrace:
 [1] error(::String) at .\error.jl:33
 [2] compilecache(::Base.PkgId, ::String) at .\loading.jl:1283
 [3] _require(::Base.PkgId) at .\loading.jl:1024
 [4] require(::Base.PkgId) at .\loading.jl:922
 [5] require(::Module, ::Symbol) at .\loading.jl:917
 [6] include at .\boot.jl:328 [inlined]
 [7] include_relative(::Module, ::String) at .\loading.jl:1105
 [8] include(::Module, ::String) at .\Base.jl:31
 [9] top-level scope at none:2
 [10] eval at .\boot.jl:330 [inlined]
 [11] eval(::Expr) at .\client.jl:425
 [12] top-level scope at .\none:3
in expression starting at C:\Users\cms27\.julia\packages\DataVoyager\7x2Tx\src\DataVoyager.jl:5

Failed to precompile DataVoyager [5721bf48-af8e-5845-8445-c9e18126e773] to C:\Users\cms27\.julia\compiled\v1.3\DataVoyager\5nnU1_qcK9J.ji.

Stacktrace:
 [1] error(::String) at .\error.jl:33
 [2] compilecache(::Base.PkgId, ::String) at .\loading.jl:1283
 [3] _require(::Base.PkgId) at .\loading.jl:1024
 [4] require(::Base.PkgId) at .\loading.jl:922
 [5] require(::Module, ::Symbol) at .\loading.jl:917
 [6] top-level scope at In[3]:1

DataFrames¶

A useful package for working with data is DataFrames.jl

The most important data type provided is a DataFrame, a two dimensional array for storing heterogeneous data

Although data can be heterogeneous within a DataFrame, the contents of the columns must be homogeneous (of the same type)

This is analogous to a data.frame in R, a DataFrame in Pandas (Python) or, more loosely, a spreadsheet in Excel

There are a few different ways to create a DataFrame

Constructing a DataFrame¶

The first is to set up columns and construct a dataframe by assigning names

using DataFrames, RDatasets  # RDatasets provides good standard data examples from R

# note use of missing
commodities = ["crude", "gas", "gold", "silver"]
last_price = [4.2, 11.3, 12.1, missing]
df = DataFrame(commod = commodities, price = last_price)

Columns of the DataFrame can be accessed by name using a symbol df[:row] or a struct-style df.row, as below

df[!,:price]

4-element Array{Union{Missing, Float64},1}:
  4.2     
 11.3     
 12.1     
   missing

df.price

4-element Array{Union{Missing, Float64},1}:
  4.2     
 11.3     
 12.1     
   missing

Note that the type of this array has values Union{Missing, Float64} since it was created with a missing value

df.commod

4-element Array{String,1}:
 "crude" 
 "gas"   
 "gold"  
 "silver"

The DataFrames.jl package provides a number of methods for acting on DataFrame’s, such as describe

describe(df)

While often data will be generated all at once, or read from a file, you can add to a DataFrame by providing the key parameters

nt = (commod = "nickel", price= 5.1)
push!(df, nt)

Named tuples can also be used to construct a DataFrame, and have it properly deduce all types

ntx = (t = 1, col1 = 3.0)
df2 = DataFrame([ntx])
push!(df2, (t=2, col1 = 4.0))

Working with Missing¶

As we discussed in fundamental types, the semantics of missing are that mathematical operations will not silently ignore it

In order to allow missing in a column, you can create/load the DataFrame from a source with missing’s, or call allowmissing! on a column

allowmissing!(df2, :col1) # necessary to add in a for col1
push!(df2, (t=3, col1 = missing))
push!(df2, (t=4, col1 = 5.1))

eltype(df2.col1)

Union{Missing, Float64}

We can see the propagation of missing to caller functions, as well as a way to efficiently calculate with non-missing data

@show mean(df2.col1)
@show mean(skipmissing(df2.col1))

mean(df2.col1) = missing
mean(skipmissing(df2.col1)) = 4.033333333333333

4.033333333333333

And to replace the missing

df2.col1  .= coalesce.(df2.col1, 0.0) # replace all missing with 0.0

4-element Array{Union{Missing, Float64},1}:
 3.0
 4.0
 0.0
 5.1

Manipulating and Transforming DataFrames¶

One way to do an additional calculation with a DataFrame is to tuse the @transform macro from DataFramesMeta.jl

using DataFramesMeta
f(x) = x^2
df2 = @transform(df2, col2 = f.(:col1), col3 = ones(size(df2,1)))

Categorical Data¶

For data that is categorical

using CategoricalArrays
id = [1, 2, 3, 4]
y = ["old", "young", "young", "old"]
y = CategoricalArray(y)
df = DataFrame(id=id, y=y)

levels(df.y)

2-element Array{String,1}:
 "old"  
 "young"

Visualization, Querying, and Plots¶

The DataFrame (and similar types that fulfill a standard generic interface) can fit into a variety of packages

One set of them is the QueryVerse

Note: The QueryVerse, in the same spirit as R’s tidyverse, makes heavy use of the pipeline syntax |>

x = 3.0
f(x) = x^2
g(x) = log(x)

@show g(f(x))
@show x |> f |> g; # pipes nest function calls

g(f(x)) = 2.1972245773362196
(x |> f) |> g = 2.1972245773362196

To give an example directly from the source of the LINQ inspired Query.jl

using Query

df = DataFrame(name=["John", "Sally", "Kirk"], age=[23., 42., 59.], children=[3,5,2])

x = @from i in df begin
    @where i.age>50
    @select {i.name, i.children}
    @collect DataFrame
end

While it is possible to just use the Plots.jl library, there may be better options for displaying tabular data – such as VegaLite.jl

# Error with VegaLite and NodeJS in Jupyter
#using RDatasets, VegaLite
#iris = dataset("datasets", "iris")

#iris |> @vlplot(
#    :point,
#    x=:PetalLength,
#    y=:PetalWidth,
#    color=:Species
#)

Another useful tool for exploring tabular data is DataVoyager.jl

using DataVoyager
iris |> Voyager()

Statistics and Econometrics¶

While Julia is not intended as a replacement for R, Stata, and similar specialty languages, it has a growing number of packages aimed at statistics and econometrics

Many of the packages live in the JuliaStats organization

A few to point out

StatsBase has basic statistical functions such as geometric and harmonic means, auto-correlations, robust statistics, etc.
StatsFuns has a variety of mathematical functions and constants such as pdf and cdf of many distributions, softmax, etc.

General Linear Models¶

To run linear regressions and similar statistics, use the GLM package

using GLM

x = randn(100)
y = 0.9 .* x + 0.5 * rand(100)
df = DataFrame(x=x, y=y)
ols = lm(@formula(y ~ x), df) # R-style notation

StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

y ~ 1 + x

Coefficients:
──────────────────────────────────────────────────────────────────────────
             Estimate  Std. Error  t value  Pr(>|t|)  Lower 95%  Upper 95%
──────────────────────────────────────────────────────────────────────────
(Intercept)  0.255058   0.0150475  16.9502    <1e-30   0.225196   0.284919
x            0.929994   0.0193715  48.0085    <1e-69   0.891552   0.968436
──────────────────────────────────────────────────────────────────────────

To display the results in a useful tables for LaTeX and the REPL, use RegressionTables developed at SciencesPo by Johannes Böhm for output similar to the Stata package esttab and the R package stargazer

# Need to install NVIDIA Cuda (?)
using RegressionTables
regtable(ols)

----------------------
                  y   
              --------
                   (1)
----------------------
(Intercept)   0.255***
               (0.015)
x             0.930***
               (0.019)
----------------------
Estimator          OLS
----------------------
N                  100
R2               0.959
----------------------

Fixed Effects¶

While Julia may be overkill for estimating a simple linear regression, fixed-effects estimation with dummies for multiple variables are much more computationally intensive. The FixedEffectsModel.jl is a great tool for this task.

For a 2-way fixed-effect, taking the example directly from the documentation using cigarette consumption data

using FixedEffectModels
cigar = dataset("plm", "Cigar")
cigar.StateCategorical =  categorical(cigar.State)   # convert to categorical data type
cigar.YearCategorical =  categorical(cigar.Year)

describe(cigar)

# fixedeffectresults = reg(cigar, 
#   @formula(Sales ~ NDI + fe(StateCategorical) + fe(YearCategorical)), 
#    Vcov.cluster(:StateCategorical), 
#    weights = :Pop)

fixedeffectresults = reg(cigar, @model(Sales ~ NDI, fe = StateCategorical + YearCategorical,
                            weights = Pop, vcov = cluster(StateCategorical)))
regtable(fixedeffectresults)

----------------------------
                     Sales  
                   ---------
                         (1)
----------------------------
NDI                -0.005***
                     (0.001)
----------------------------
StateCategorical         Yes
YearCategorical          Yes
----------------------------
Estimator                OLS
----------------------------
N                      1,380
R2                   -70.104
----------------------------

?reg

search: reg Regex regtable RegexMatch RegressionModel RegressionTables aggregate

using RDatasets, FixedEffectModels
df = dataset("plm", "Cigar")
@time reg(df, @formula(Sales ~ Price))
reg(df, @formula(Sales ~ Price + fe(State) + fe(Year)))
reg(df, @formula(Sales ~ NDI + fe(State) + fe(State)&Year))
reg(df, @formula(Sales ~ NDI + fe(State)*Year))
reg(df, @formula(Sales ~ (Price ~ Pimin)))
@time reg(df, @formula(Sales ~ Price), weights = :Pop)
reg(df, @formula(Sales ~ NDI), Vcov.robust()))
reg(df, @formula(Sales ~ NDI), Vcov.cluster(:State)))
reg(df, @formula(Sales ~ NDI), Vcov.cluster(:State , :Year)))
df.Yearc = categoricay(df.Year)
reg(df, @formula(Sales ~ YearC), contrasts = Dict(:YearC => DummyCoding(base = 80)))

R LFE vs FixedEffects.jl Shootout¶

High dimensional Fixed effect estimation is compuationally intensive
The R packge lfe is a good option that uses multithreading, if available.
Let's see how FixedEffects.jl compares to it.

"""
    FEbenchmark(;N=10_000_000,K= 100)

A function to create a 2-way fixed effect dataset. `N` observations, `K` 
and `N/K` categories for Fixed effects respectively. We generate 7 regressors as well as a weight vector.
"""
function FEbenchmark(;N=10_000_000,K= 100)
    id1_int = Int.(rand(1:(N/K), N))
    id2_int = Int.(rand(1:K, N))
    w = cos.(id1_int)
    
    x1 = 5 * cos.(id1_int) + 5 * sin.(id2_int) + randn(N)
    x2 =  cos.(id1_int) + sin.(id2_int) + randn(N)
    x3 =  cos.(id1_int) + sin.(id2_int) + randn(N)
    x4 =  cos.(id1_int) + sin.(id2_int) + randn(N)
    x5 =  cos.(id1_int) + sin.(id2_int) + randn(N)
    x6 =  cos.(id1_int) + sin.(id2_int) + randn(N)
    x7 =  cos.(id1_int) + sin.(id2_int) + randn(N)
    y= 3 .* x1 .+ 5 .* x2 .+ x3 .+ x4 .+ x5 .+ x6 .+ x7 .+ cos.(id1_int) .+ cos.(id2_int).^2 .+ randn(N)
    df = DataFrame(id1 = categorical(id1_int),id1_int = id1_int, 
                   id2 = categorical(id2_int), id2_int = id2_int,
                   x1 = x1, 
                   x2 = x2,
                   x3 = x3,
                   x4 = x4,
                   x5 = x5,
                   x6 = x6,
                   x7 = x7,        
                   w = w, y = y)
    df
end

FEbenchmark

# create our dataset
df = FEbenchmark()

# run a bunch of different models
# the comments show timings of a previous version of FixedEffectModels (faster!)
@time reg(df, @formula(y ~ x1 + x2))
# 0.582029 seconds (852 allocations: 535.311 MiB, 18.28% gc time)
@time reg(df, @formula(y ~ x1 + x2), Vcov.cluster(:id2))
# 0.809652 seconds (1.55 k allocations: 649.821 MiB, 14.40% gc time)
@time reg(df, @formula(y ~ x1 + x2 + fe(id1)))
# 1.655732 seconds (1.21 k allocations: 734.353 MiB, 16.88% gc time)
@time reg(df, @formula(y ~ x1 + x2 + fe(id1)), Vcov.cluster(:id1))
#  2.113248 seconds (499.66 k allocations: 811.751 MiB, 15.08% gc time)
@time reg(df, @formula(y ~ x1 + x2 + fe(id1) + fe(id2)))
# 3.553678 seconds (1.36 k allocations: 1005.101 MiB, 10.55% gc time))

# let's take the biggest model as our benchmark timing
@time result = reg(df, @formula(y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + fe(id1) + fe(id2)))

  1.839808 seconds (548 allocations: 1.044 GiB, 25.86% gc time)

UndefVarError: Vcov not defined

Stacktrace:
 [1] top-level scope at util.jl:155
 [2] top-level scope at In[41]:7

using DataFrames, RDatasets, FixedEffectModels
df = dataset("plm", "Cigar")
reg(df, @formula(Sales ~ NDI + fe(State) + fe(Year)), Vcov.cluster(:State), weights = :Pop)

UndefVarError: fe not defined

Stacktrace:
 [1] top-level scope at In[43]:3

# using R lfe package
using RCall
rstring = R"""
# factorize data
# notice that we have to pass the integer data in id1_int.
r_d = $df
r_d[,"id1"] = factor(r_d[,"id1_int"])
r_d[,"id2"] = factor(r_d[,"id2_int"])

library(lfe)
r_time = system.time(lfe <- felm(y ~x1 + x2 + x3 + x4 + x5 + x6 + x7| id1 + id2, data=r_d))
print(paste0("R lfe time: ",r_time[1]))
print(summary(lfe))
r_time[1]
#
## OLS.
#ols <- lm(y ~x1 + x2 + x3 + x4 + x5 + x6 + x7 + id1 + id2 -1, data=r_d)
#summary(ols)
"""

┌ Warning: RCall.jl: Loading required package: Matrix
└ @ RCall /Users/florian.oswald/.julia/packages/RCall/g7dhB/src/io.jl:113

[1] "R lfe time: 35.968"

Call:
   felm(formula = y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 | id1 + id2,      data = r_d) 

Residuals:
    Min      1Q  Median      3Q     Max 
-5.3031 -0.6708  0.0000  0.6710  5.2332 

Coefficients:
    Estimate Std. Error t value Pr(>|t|)    
x1 3.0000275  0.0003177    9443   <2e-16 ***
x2 5.0003924  0.0003178   15736   <2e-16 ***
x3 1.0004647  0.0003178    3148   <2e-16 ***
x4 1.0000344  0.0003180    3145   <2e-16 ***
x5 0.9999615  0.0003179    3146   <2e-16 ***
x6 1.0000432  0.0003179    3146   <2e-16 ***
x7 0.9997845  0.0003179    3145   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1 on 9899894 degrees of freedom
Multiple R-squared(full model): 0.9986   Adjusted R-squared: 0.9986 
Multiple R-squared(proj model): 0.975   Adjusted R-squared: 0.9748 
F-statistic(full model):6.903e+04 on 100105 and 9899894 DF, p-value: < 2.2e-16 
F-statistic(proj model): 5.518e+07 on 7 and 9899894 DF, p-value: < 2.2e-16

RObject{RealSxp}
user.self 
   35.968

# use option save=true to store estimated fixed effects
@time result = reg(df, @formula(y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + fe(id1) + fe(id2)), save = true)
# using RegressionTables
# regtable(result)
# current incompatibility https://github.com/jmboehm/RegressionTables.jl/issues/35

 32.108785 seconds (4.54 M allocations: 5.811 GiB, 13.20% gc time)

                      Fixed Effect Model                      
==============================================================
Number of obs:        10000000  Degrees of freedom:     100107
R2:                      0.999  R2 Adjusted:             0.999
F Statistic:         5.51766e7  p-value:                 0.000
R2 within:               0.975  Iterations:                  6
Converged:                true  
==============================================================
     Estimate   Std.Error t value Pr(>|t|) Lower 95% Upper 95%
--------------------------------------------------------------
x1    3.00003 0.000317701 9442.92    0.000    2.9994   3.00065
x2    5.00039 0.000317761 15736.3    0.000   4.99977   5.00102
x3    1.00046 0.000317782 3148.28    0.000  0.999842   1.00109
x4    1.00003 0.000317983 3144.93    0.000  0.999411   1.00066
x5   0.999961 0.000317887 3145.65    0.000  0.999338   1.00058
x6    1.00004 0.000317855 3146.22    0.000   0.99942   1.00067
x7   0.999784 0.000317874 3145.23    0.000  0.999161   1.00041
==============================================================

cor(result.augmentdf.fe_id1,result.augmentdf.fe_id2)   # AKM sorting?

8.026537147977652e-5

# more models
result2 = reg(df, @formula(y ~ x1 + x2 + x3 + fe(id1) + fe(id2)))
result3 = reg(df, @formula(y ~ x1      + x3 + fe(id1)  ))

                      Fixed Effect Model                      
==============================================================
Number of obs:        10000000  Degrees of freedom:     100002
R2:                      0.953  R2 Adjusted:             0.952
F Statistic:         4.81708e7  p-value:                 0.000
R2 within:               0.907  Iterations:                  1
Converged:                true  
==============================================================
     Estimate   Std.Error t value Pr(>|t|) Lower 95% Upper 95%
--------------------------------------------------------------
x1    4.60739 0.000596852 7719.48    0.000   4.60622   4.60856
x3    1.32066  0.00179345  736.38    0.000   1.31715   1.32418
==============================================================

# how to look at those? -> regtable!
# regtable(result, result2, result3 )

# latex?
# regtable(result, result2, result3, renderSettings = latexOutput() )  # latexOutput(filename::String)

# options
# regtable(result, result2, result3 , below_statistic = :blank)

using GLM
ols = lm(@formula(y ~ x1 + x2),df)

StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

y ~ 1 + x1 + x2

Coefficients:
─────────────────────────────────────────────────────────────────────────────
             Estimate   Std. Error    t value  Pr(>|t|)  Lower 95%  Upper 95%
─────────────────────────────────────────────────────────────────────────────
(Intercept)  0.498854  0.000865368    576.465    <1e-99   0.497158    0.50055
x1           4.01794   0.000235342  17072.8      <1e-99   4.01748     4.0184 
x2           5.20352   0.000849066   6128.52     <1e-99   5.20186     5.20519
─────────────────────────────────────────────────────────────────────────────

# regtable(result, result2,ols,below_decoration = s -> "[$(s)]")

💪 @jmboehm

To explore the data use the interactive DataVoyager and VegaLite

# cigar |> Voyager()

cigar |> @vlplot(
    :point,
    x=:Price,
    y=:Sales,
    color=:Year,
    size=:NDI
)

	variable	mean	min	median	max	nunique	nmissing	eltype
	Symbol	Union…	Any	Union…	Any	Union…	Nothing	DataType
1	State	26.8261	1	26.5	51			Int64
2	Year	77.5	63	77.5	92			Int64
3	Price	68.6999	23.4	52.3	201.9			Float64
4	Pop	4537.11	319.0	3174.0	30703.3			Float64
5	Pop16	3366.62	215.2	2315.3	22920.0			Float64
6	CPI	73.5967	30.6	62.9	140.3			Float64
7	NDI	7525.02	1322.57	6281.2	23074.0			Float64
8	Sales	123.951	53.4	121.2	297.9			Float64
9	Pimin	62.8993	23.4	46.4	178.5			Float64
10	StateCategorical		1		51	46		CategoricalValue{Int64,UInt32}
11	YearCategorical		63		92	30		CategoricalValue{Int64,UInt32}