SciencesPo Computational Economics 2018

  • Welcome!
  • Why do Economists (and Social Scientists in general) have to talk about Computation?
  • This by Gentzkow and Shapiro is still relevant, even if in Computing Time, 2014 is a long time ago. I highly recommend you read this.

Gentzkow and Shapiro

Here is a good rule of thumb: If you are trying to solve a problem, and there are multi-billion dollar firms whose entire business model depends on solving the same problem, and there are whole courses at your university devoted to how to solve that problem, you might want to figure out what the experts do and see if you can’t learn something from it.

Gentzkow and Shapiro

  • This is the spirit of this course.
  • We want to learn computation as practised by the experts.
  • We don't want to - ever - reinvent the wheel.

Computation Basics

  • It is important that we understand some basics about computers.
  • Even though software (and computers) always get more and more sophisticated, there is still a considerable margin for "human error". This doesn't mean necessarily that there is something wrong, but certain ways of doing things may have severe performance implications.
  • Whatever else happens, you write the code, and one way of writing code is different from another.

test

Julia? Why Julia?

  • The best software doesn't exist. All depends on:
    1. The problem at hand.
      • You are fine with Stata if you need to run a probit.
      • Languages have different comparative advantages with regards to different tasks.
    2. Preferences of the analyst. Some people just love their software.
  • That said, there are some general themes we should keep in mind when choosing a software.
  • Stephen Johnson at MIT has a good pitch.

High versus Low Level Languages

  • High-level languages for technical computing: Matlab, Python, R, ...
    • you get going immediately
    • very important for exploratory coding or data analysis
    • You don't want to worry about type declarations and compilers at the exploratory stage
  • High-level languages are slow.
    • Traditional Solutions to this: Passing the high-speed threshold.
    • Using Rcpp or Cython etc is a bit like Stargate. You loose control the moment you pass the barrier to C++ for a little bit. (Even though those are great solutions.) If the C++ part of your code becomes large, testing this code becomes increasingly difficult.
    • You end up spending your time coding C++. But that has it's own drawbacks.

Julia is Fast

  • Julia is fast.
    • But julia is also a high-level dynamic language. How come?
    • The JIT compiler.
    • The LLVM project.
  • Julia is open source (and it's for free)
    • It's for free. Did I say that it's for free?
    • You will never again worry about licenses. Want to run 1000 instances of julia? Do it.
    • The entire standard library of julia is written in julia (and not in C, e.g., as is the case in R, matlab or python). It's easy to look and understand at how things work.
  • Julia is a very modern language, combining the best features of many other languages.

What does Julia want to achieve?

  • There is a wall built into the scientific software stack
  • Julia cofounder Stefan Karpinski talks about the 2 languages problem
  • key: the wall creates a social barrier. Developer and User are different people.

The Wall in the scientific software stack

test

Economists and Their Software

  • In A Comparison of Programming Languages in Economics, the authors compare some widely used languages on a close to identical piece of code.
  • It can be quite contentious to talk about Software to Economists.
  • Takeaway from that paper:
    • There are some very good alternatives to fortran
    • fortran is not faster than C++
    • It seems pointless to invest either money or time in matlab, given the many good options that are available for free.

The Fundamental Tradeoff

Developer Time (Your Time) is Much More Expensive than Computing Time

  • It may well be that the runtime of a fortran program is one third of the time it takes to run the program in julia, or anything else for that matter.
  • However, the time it takes to develop that program is very likely to be (much) longer in fortran.
  • Particularly if you want to hold your program to the same quality standards.

Takeaway

  • Given my personal experience with many of the above languagues, I think julia is a very good tool for economists with non-trivial computational tasks.
  • This is why I am using it for demonstrations in this course.

A Second Fundamental Tradeoff

  • Regardless of the software you use, there is one main problem with computation.
  • It concerns speed vs accuracy.
  • You may be able to do something very fast, but at very small accuracy (i.e. with a high numerical margin of error)
  • On the other hand, you may be able to get a very accurate solution, but it may take you an irrealistic amount of time to get there.
  • You have to face that tradeoff and decide for yourself what's best.

A Warning about Optimizing your Code!

In Donald Knuth's paper "Structured Programming With GoTo Statements", he wrote:

"Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%."

Julia Workflow

  • We will use the lastest stable version of julia. today that is v0.6.2
  • For today, stay within this notebook.
  • In general, you will want to install julia on your computer
  • the most stable workflow involves a text editor and a julia terminal next to it
  • You would develop your code in a text file (say "develop.jl"), and then just do include("develop.jl") in the terminal
  • I use sublime text as an editor
  • Atom is also very good

Some Numerical Concepts and Julia

  • Machine epsilon: The smallest number that your computer can represent, type eps().
  • Infinity: A number greater than all representable numbers on your computer. Obeys some arithmethmic rules
    • Overflow: If you perform an operation where the result is greater than the largest representable number.
    • Underflow: You take two (very small) representable numbers, but the result is smaller than eps().
    • In Julia, you are wrapped around the end of your representable space:
      x = typemax(Int64)
      x + 1
      
  • Integers and Floating Point Numbers.
  • Single and Double Precision.
  • In Julia, all of these are different numeric primitive types (head over to julia manual for a second).
  • Julia also supports Arbitrary Precision Arithmetic. Thus, overflow shouldn't become an issue anymore.
  • See min and max for different types:
In [1]:
for T in [Int8,Int16,Int32,Int64,Int128,UInt8,UInt16,UInt32,
		  UInt64,UInt128,Float32,Float64]
         println("$(lpad(T,7)): [$(typemin(T)),$(typemax(T))]")
end
   Int8: [-128,127]
  Int16: [-32768,32767]
  Int32: [-2147483648,2147483647]
  Int64: [-9223372036854775808,9223372036854775807]
 Int128: [-170141183460469231731687303715884105728,170141183460469231731687303715884105727]
  UInt8: [0,255]
 UInt16: [0,65535]
 UInt32: [0,4294967295]
 UInt64: [0,18446744073709551615]
UInt128: [0,340282366920938463463374607431768211455]
Float32: [-Inf,Inf]
Float64: [-Inf,Inf]

Interacting with the Julia REPL

  • REPL?
  • different modes: command, help, search, shell
  • incremental search with CTRL r
  • documented in the manual

Julia Primer: Types

  • Types are at the core of what makes julia a great language.
  • Everything in julia is represented as a datatype.
  • Remember the different numeric types from before? Those are types.
  • The manual, as usual, is very informative on this.
  • From the wikibook on julia, here is a representation of the numeric type graph:

Julia Primer: Custom Types

  • The great thing is that you can create you own types.
  • Going with the example from the wikibook, we could have types Jaguar and Cat as being subtypes of Feline:
struct Feline
    weight::Float64
    sound::String
end
struct Cat <: Feline
    weight::Float64
    sound::String
end
In [1]:
abstract type Feline end

struct Jaguar <: Feline
	weight::Float64
	sound::String
end
struct Cat <: Feline
	weight::Float64
	sound::String
end

# is a cat a Feline?
Cat <: Feline
Out[1]:
true
In [2]:
# create a cat and a jaguar
c = Cat(15.2,"miauu")
j = Jaguar(95.1,"ROARRRRRR!!!!!")

# is c an instance of type Cat?
isa(c,Cat)

# methods
function behave(c::Cat)
    println(c.sound)
    println("my weight is $(c.weight) kg! should go on a diet")
end
function behave(j::Jaguar)
    println(j.sound)
    println("Step back! I'm a $(j.weight) kg jaguar.")
end
Out[2]:
behave (generic function with 2 methods)
In [3]:
# make a cat behave:
behave(c)

# and a jaguar
behave(j)
miauu
my weight is 15.2 kg! should go on a diet
ROARRRRRR!!!!!
Step back! I'm a 95.1 kg jaguar.

Julia Primer: Multiple Dispatch

  • You have just learned multiple dispatch. The same function name dispatches to different functions, depending on the input argument type.

Julia Primer: Important performance lesson - Type Stability

  • If you don't declare types, julia will try to infer them for you.
  • DANGER: don't change types along the way.
    • julia optimizes your code for a specific type configuration.
    • it's not the same CPU operation to add two Ints and two Floats. The difference matters.
  • Example
In [4]:
function t1(n)
    s  = 0  # typeof(s) = Int
    for i in 1:n
        s += s/i
    end
end
function t2(n)
    s  = 0.0   # typeof(s) = Float64
    for i in 1:n
        s += s/i
    end
end
@time t1(10000000)
@time t2(10000000)
  0.037028 seconds (22.55 k allocations: 1.205 MiB)
  0.011815 seconds (24.43 k allocations: 1.327 MiB)

Julia Modules

  • A module is a new workspace - a new global scope
  • A module defines a separate namespace
  • There is an illustrative example available at the julia manual, let's look at it.

An example Module

module MyModule
    # which other modules to use: imports
    using Lib
    using BigLib: thing1, thing2
    import Base.show
    importall OtherLib

    # what to export from this module
    export MyType, foo

    # type defs
    struct MyType
        x
    end

    # methods
    bar(x) = 2x
    foo(a::MyType) = bar(a.x) + 1

    show(io::IO, a::MyType) = print(io, "MyType $(a.x)")
end

Modules and files

  • you can easily have more files inside a module to organize your code.
  • For example, you could include other files like this
module Foo

include("file1.jl")
include("file2.jl")

end

Working with Modules

  • Look at the example at the manual!
  • Location of Modules: Julia stores packages in a hidden folder ~/.julia/v0.6 (system-dependent)
  • You can develop your own modules in a different location if you want.
  • Julia reads the file ~/.juliarc.jl on each startup. Modify the LOAD_PATH variable:
# add this to ~/.juliarc.jl
push!(LOAD_PATH, "/Path/To/My/Module/")

Unit Testing and Code Quality

What is Unit Testing? Why should you test you code?

Even with the best validation, it’s very hard to achieve perfect quality in software. Here are some typical residual defect rates (bugs left over after the software has shipped) per kloc (one thousand lines of source code):

  • 1 - 10 defects/kloc: Typical industry software.
  • 0.1 - 1 defects/kloc: High-quality validation. The Java libraries might achieve this level of correctness.
  • 0.01 - 0.1 defects/kloc: The very best, safety-critical validation. NASA and companies like Praxis can achieve this level. This can be discouraging for large systems. For example, if you have shipped a million lines of typical industry source code (1 defect/kloc), it means you missed 1000 bugs!

Unit Testing in Science

  • One widely-used way to prevent your code from having too many errors, is to continuously test it.
  • This issue is widely neglected in Economics as well as other sciences.
    • If the resulting graph looks right, the code should be alright, shouldn't it?
    • Well, should it?
  • It is regrettable that so little effort is put into verifying the proper functioning of scientific code.
  • Referees in general don't have access to the computing code for paper that is submitted to a journal for publication.
  • How should they be able to tell whether what they see in black on white on paper is the result of the actual computation that was proposed, rather than the result of chance (a.k.a. a bug)?
    • Increasingly papers do post the source code after publication.
    • The scientific method is based on the principle of reproduciblity of results.
      • Notice that having something reproducible is only a first step, since you can reproduce with your buggy code the same nice graph.
      • But from where we are right now, it's an important first step.
    • This is an issue that is detrimental to credibility of Economics, and Science, as a whole.
  • Extensively testing your code will guard you against this.

Best Practice

  • You want to be in maximum control over your code at all times:
    • You want to be as sure as possible that a certain piece of code is doing, what it actually meant to do.
    • This sounds trivial (and it is), yet very few people engage in unit testing.
  • Things are slowly changing. See http://www.runmycode.org for example.
  • You are the generation that is going to change this. Do it.
  • Let's look at some real world Examples.

Ariane 5 blows up because of a bug

It took the European Space Agency 10 years and $$7 billion to produce Ariane 5, a giant rocket capable of hurling a pair of three-ton satellites into orbit with each launch and intended to give Europe overwhelming supremacy in the commercial space business. All it took to explode that rocket less than a minute into its maiden voyage last June, scattering fiery rubble across the mangrove swamps of French Guiana, was a small computer program trying to stuff a 64-bit number into a 16-bit space. This shutdown occurred 36.7 seconds after launch, when the guidance system's own computer tried to convert one piece of data -- the sideways velocity of the rocket -- from a 64-bit format to a 16-bit format. The number was too big, and an overflow error resulted. When the guidance system shut down, it passed control to an identical, redundant unit, which was there to provide backup in case of just such a failure. But the second unit had failed in the identical manner a few milliseconds before. And why not? It was running the same software.

NASA Mars Orbiter crashes because of a bug

For nine months, the Mars Climate Orbiter was speeding through space and speaking to NASA in metric. But the engineers on the ground were replying in non-metric English. It was a mathematical mismatch that was not caught until after the $$125-million spacecraft, a key part of NASA's Mars exploration program, was sent crashing too low and too fast into the Martian atmosphere. The craft has not been heard from since. Noel Henners of Lockheed Martin Astronautics, the prime contractor for the Mars craft, said at a news conference it was up to his company's engineers to assure the metric systems used in one computer program were compatible with the English system used in another program. The simple conversion check was not done, he said.

LA Airport Air Traffic Control shuts down because of a bug

(IEEE Spectrum) -- It was an air traffic controller's worst nightmare. Without warning, on Tuesday, 14 September, at about 5 p.m. Pacific daylight time, air traffic controllers lost voice contact with 400 airplanes they were tracking over the southwestern United States. Planes started to head toward one another, something that occurs routinely under careful control of the air traffic controllers, who keep airplanes safely apart. But now the controllers had no way to redirect the planes' courses. The controllers lost contact with the planes when the main voice communications system (VCS) shut down unexpectedly. To make matters worse, a backup system that was supposed to take over in such an event crashed within a minute after it was turned on. The outage disrupted about 800 flights across the country. Inside the control system unit (VCSU) is a countdown timer that ticks off time in milliseconds. The VCSU uses the timer as a pulse to send out periodic queries to the VSCS. It starts out at the highest possible number that the system's server and its software can handle — 232. It's a number just over 4 billion milliseconds. When the counter reaches zero, the system runs out of ticks and can no longer time itself. So it shuts down. Counting down from 232 to zero in milliseconds takes just under 50 days. The FAA procedure of having a technician reboot the VSCS every 30 days resets the timer to 232 almost three weeks before it runs out of digits.

Automated Testing

  • You should try to minimize the effort of writing tests.
  • Using an automated test suite is very helpful here.
  • In Julia, we have got Base.Test in the Base package
  • Julia unit testing is described here

Automated Testing on Travis

  • https://travis-ci.org is a continuous integration service.
  • It runs your test on their machines and notifies you of the result.
  • Every time you push a commit to github.
  • If the repository is public on github, the service is for free.
  • Many julia packages are testing on Travis.
  • You should look out for this green badge: Example
  • You can run the tests for a package with Pkg.test("Package_name")
  • You can run the tests for julia itself with Base.runtests()
In [6]:
# let's do some simple testing
using Base.Test

@test 1==1
@test pi  3.14159 atol=1e-4
Out[6]:
Test Passed
In [7]:
@test 2>3
Test Failed
  Expression: 2 > 3
   Evaluated: 2 > 3
There was an error during testing

Stacktrace:
 [1] record(::Base.Test.FallbackTestSet, ::Base.Test.Fail) at ./test.jl:533
 [2] do_test(::Base.Test.Returned, ::Expr) at ./test.jl:352
 [3] include_string(::String, ::String) at ./loading.jl:515

Debugging Julia

There are at least 2 ways to chase down a bug.

  1. Use logging facilities.
    • poor man's logger: write println statements at various points in your code.
    • better: use Base.logging
      • this will really become useful in v1.0, when you can choose the logging level in a better way.
  2. Use the debugger:

Julia and Data

What's special about Data?

  • There are several issues when working with data:
    • A typical dataset might be deliverd to you in tabular form. A comma separated file, for example: a spreadsheet.
    • R, julia and python share the concept of a DataFrame. A tabular dataset with column names.
    • that means in particular that each column could have a different datatype.
    • For a language that optimizes on efficiently computing with different datatypes, that is a challenge.
    • Importantly: data can be missing, i.e. for several reasons there is a record that was not, well, recorded.
    • Julia has made a lot of progress here. We now have the Missing data type, provided in Missings.jl
In [5]:
using DataFrames  # DataFrames re-exports Missings.jl
m = missings(Float64,3)  # you choose a datatype, and the dims for an array
Out[5]:
3-element Array{Union{Missing, Float64},1}:
 missing
 missing
 missing
In [6]:
m[1:2] = ones(2)
sum(m)  # => missing. because Float64 + missing = missing
prod(m)  # => missing. because Float64 * missing = missing
sum(skipmissing(m)) # => 2
Out[6]:
2.0
In [7]:
# you can replace missing values
println(typeof(Missings.replace(m,3)))
collect(Missings.replace(m,3))
Missings.EachReplaceMissing{Array{Union{Missing, Float64},1},Float64}
Out[7]:
3-element Array{Float64,1}:
 1.0
 1.0
 3.0

DataFrames

  • A dataframe is a tabular dataset: a spreadsheet
  • columns can be of different data type. very convenient, very hard to optimize.
In [8]:
df = DataFrame(nums = rand(3),words=["little","brown","dog"])
Out[8]:

3 rows × 2 columns

numswords
Float64String
10.677634little
20.306855brown
30.427814dog
In [9]:
# there is alot of functionality. please consult the manual.
df[:nums]
df[:words][1]  # "little"
df[:nums][2:3] = [1,1]
df
┌ Warning: `getindex(df::DataFrame, col_ind::ColumnIndex)` is deprecated, use `df[!, col_ind]` instead.
│   caller = top-level scope at In[9]:1
└ @ Core In[9]:1
┌ Warning: `getindex(df::DataFrame, col_ind::ColumnIndex)` is deprecated, use `df[!, col_ind]` instead.
│   caller = top-level scope at In[9]:3
└ @ Core In[9]:3
┌ Warning: `getindex(df::DataFrame, col_ind::ColumnIndex)` is deprecated, use `df[!, col_ind]` instead.
│   caller = top-level scope at In[9]:4
└ @ Core In[9]:4
Out[9]:

3 rows × 2 columns

numswords
Float64String
10.677634little
21.0brown
31.0dog
In [ ]:
using RDatasets   # popular Datasets from R
iris = dataset("datasets", "iris")
head(iris)  # get the first 6 rows
In [2]:
println(iris[:SepalLength][1:6])  # get a column

show(iris[2,:])   # get a row

describe(iris);   # get a description
[5.1, 4.9, 4.7, 4.6, 5.0, 5.4]
1×5 DataFrames.DataFrame
│ Row │ SepalLength │ SepalWidth │ PetalLength │ PetalWidth │ Species │
├─────┼─────────────┼────────────┼─────────────┼────────────┼─────────┤
│ 1   │ 4.9         │ 3.0        │ 1.4         │ 0.2        │ setosa  │SepalLength
Summary Stats:
Mean:           5.843333
Minimum:        4.300000
1st Quartile:   5.100000
Median:         5.800000
3rd Quartile:   6.400000
Maximum:        7.900000
Length:         150
Type:           Float64

SepalWidth
Summary Stats:
Mean:           3.057333
Minimum:        2.000000
1st Quartile:   2.800000
Median:         3.000000
3rd Quartile:   3.300000
Maximum:        4.400000
Length:         150
Type:           Float64

PetalLength
Summary Stats:
Mean:           3.758000
Minimum:        1.000000
1st Quartile:   1.600000
Median:         4.350000
3rd Quartile:   5.100000
Maximum:        6.900000
Length:         150
Type:           Float64

PetalWidth
Summary Stats:
Mean:           1.199333
Minimum:        0.100000
1st Quartile:   0.300000
Median:         1.300000
3rd Quartile:   1.800000
Maximum:        2.500000
Length:         150
Type:           Float64

Species
Summary Stats:
Length:         150
Type:           CategoricalArrays.CategoricalString{UInt8}
Number Unique:  3

Working with DataFrames

  • RTFM :-)
  • We can sort, join (i.e. merge), split-apply-combine and reshape dataframes. We can do most things one can do in base R with a data.frame.
In [15]:
d = DataFrame(A = rand(10),B=rand([1,2],10),C=["word $i" for i in 1:10])
by(d,:B,x->mean(x[:A]))
Out[15]:
Bx1
110.605719
220.382219
In [16]:
# subsetting a dataframe can become cumbersome
d = DataFrame(A = rand(10),B=rand([1,2],10),C=["word $i" for i in 1:10])
d[(d[:A].>0.1) .& (d[:B].==2),:]  # approach 1
using DataFramesMeta
@where(d,(:A .>0.1) .& (:B.==2))  # approach 2
Out[16]:
ABC
10.1343152word 2
20.9299852word 3
30.840772word 5
40.8313272word 8
50.6072832word 9

enter

DataFramesMeta.jl

  • This makes this much easier.
  • It makes heavy use of macros.
  • The same thing from before is now
using DataFramesMeta
@where(d,(:A .>0.1) & (:B.==2))
  • We can access column names as symbols inside an expression.
In [17]:
# the previous operation would become
@by(d,:B,m=mean(:A))
Out[17]:
Bm
110.418524
220.668736

Chaining operations

  • Very often we have to do a chain of operations on a set of data.
  • Readibility is a big concern here.
  • Here we can use the @linq macro together with a pipe operator.
  • This is inspired by LINK (language integrated query) from microsoft .NET
In [18]:
df = DataFrame(a = 1:20,b = rand([2,5,10],20), x = rand(20))
x_thread = @linq df |>
    transform(y = 10 * :x) |>
    where(:a .> 2) |>
    by(:b, meanX = mean(:x), meanY = mean(:y)) |>
    orderby(:meanX) |>
    select(meanx=:meanX,meany= :meanY, var = :b)

# or
# using Lazy
# x_thread = @> begin
#     df
#     @transform(y = 10 * :x)
#     @where(:a .> 2)
#     @by(:b, meanX = mean(:x), meanY = mean(:y))
#     @orderby(:meanX)
#     @select(:meanX, :meanY, var = :b)
# end
WARNING: calling vector of vectors constructor without passing column names is deprecated
Stacktrace:
 [1] depwarn(::String, ::Symbol) at ./deprecated.jl:70
 [2] Type at /Users/florian.oswald/.julia/v0.6/DataFrames/src/deprecated.jl:4 [inlined]
 [3] #35 at /Users/florian.oswald/.julia/v0.6/DataFramesMeta/src/DataFramesMeta.jl:316 [inlined]
 [4] ##696 at /Users/florian.oswald/.julia/v0.6/DataFramesMeta/src/DataFramesMeta.jl:70 [inlined]
 [5] (::##21#29{DataFrames.DataFrame})(::DataFrames.DataFrame) at /Users/florian.oswald/.julia/v0.6/DataFramesMeta/src/DataFramesMeta.jl:72
 [6] orderby(::DataFrames.DataFrame, ::##21#29{DataFrames.DataFrame}) at /Users/florian.oswald/.julia/v0.6/DataFramesMeta/src/DataFramesMeta.jl:313
 [7] include_string(::String, ::String) at ./loading.jl:515
 [8] include_string(::Module, ::String, ::String) at /Users/florian.oswald/.julia/v0.6/Compat/src/Compat.jl:174
 [9] execute_request(::ZMQ.Socket, ::IJulia.Msg) at /Users/florian.oswald/.julia/v0.6/IJulia/src/execute_request.jl:154
 [10] (::Compat.#inner#17{Array{Any,1},IJulia.#execute_request,Tuple{ZMQ.Socket,IJulia.Msg}})() at /Users/florian.oswald/.julia/v0.6/Compat/src/Compat.jl:488
 [11] eventloop(::ZMQ.Socket) at /Users/florian.oswald/.julia/v0.6/IJulia/src/eventloop.jl:8
 [12] (::IJulia.##14#17)() at ./task.jl:335
while loading In[18], in expression starting on line 323
Out[18]:
meanxmeanyvar
10.4005374.005375
20.5167435.1674310
30.701227.01222

an alternative

Query.jl

  • Much in the same spirit. But can query almost any data source, not only dataframes.
    • Data Sources: DataFrames, Dicts, Arrays, TypedTables, DataStreams,...
    • Data Sinks: DataFrames, Dicts, Csv
  • It is not as convenient to summarise data, however.
  • It is one of the best documented packages I know.
  • In general, a query looks like this:
    q = @from <range variable> in <source> begin
      <query statements>
    end
    
  • Here is an example:
In [21]:
using Query

df = DataFrame(name=["John", "Sally", "Kirk"], age=[23., 42., 59.], children=[3,5,2])

x = @from i in df begin
    @where i.age>50
    @select {i.name, i.children}
    @collect DataFrame
end
println(df)
println(x)
3×3 DataFrames.DataFrame
│ Row │ name  │ age  │ children │
├─────┼───────┼──────┼──────────┤
│ 1   │ John  │ 23.0 │ 3        │
│ 2   │ Sally │ 42.0 │ 5        │
│ 3   │ Kirk  │ 59.0 │ 2        │
1×2 DataFrames.DataFrame
│ Row │ name │ children │
├─────┼──────┼──────────┤
│ 1   │ Kirk │ 2        │

you can do several things in a query:

  • sort
  • filter
  • project
  • flatten
  • join
  • split-apply-combine (dplyr)
In [22]:
df = DataFrame(name=repeat(["John", "Sally", "Kirk"],inner=[1],outer=[2]), 
     age=vcat([10., 20., 30.],[10., 20., 30.].+3), 
     children=repeat([3,2,2],inner=[1],outer=[2]),state=[:a,:a,:a,:b,:b,:b])

x = @from i in df begin
    @group i by i.state into g
    @select {group=g.key,mage=mean(g..age), oldest=maximum(g..age), youngest=minimum(g..age)}
    @collect DataFrame
end

println(x)
2×4 DataFrames.DataFrame
│ Row │ group │ mage │ oldest │ youngest │
├─────┼───────┼──────┼────────┼──────────┤
│ 1   │ a     │ 20.0 │ 30.0   │ 10.0     │
│ 2   │ b     │ 23.0 │ 33.0   │ 13.0     │