This is lecture is a slightly modified version of https://lectures.quantecon.org/jl/need_for_speed.html Thank you to the amazing Quantecon.org team!
Computer scientists often classify programming languages according to the following two categories:
High level languages aim to maximize productivity by
Low level languages aim for speed and control, which they achieve by
Traditionally we understand this as a trade off.
One of the great strengths of Julia is that it pushes out the curve, achieving both high productivity and high performance with relatively little fuss.
The word “relatively” is important here, however…
In simple programs, excellent performance is often trivial to achieve
For longer, more sophisticated programs, you need to be aware of potential stumbling blocks.
This lecture covers the key points
You should read our earlier lecture on types, methods and multiple dispatch before this one
using InstantiateFromURL
# activate the QuantEcon environment
activate_github("QuantEcon/QuantEconLecturePackages", tag = "v0.9.5");
using LinearAlgebra, Statistics, Compat
This section provides more background on how methods, functions, and types are connected
These individual specialized versions are called methods
When an operation like addition is requested, the Julia compiler inspects the type of data to be acted on and hands it out to the appropriate method
This process is called multiple dispatch
Like all “infix” operators, 1 + 1 has the alternative syntax +(1, 1)
+(1, 1)
2
This operator + is itself a function with multiple methods
We can investigate them using the @which macro, which shows the method to which a given call is dispatched
x, y = 1.0, 1.0
@which +(x, y)
We see that the operation is sent to the +
method that specializes in adding
floating point numbers
Here’s the integer case
x, y = 1, 1
@which +(x, y)
This output says that the call has been dispatched to the + method responsible for handling integer values
(We’ll learn more about the details of this syntax below)
Here’s another example, with complex numbers
x, y = 1.0 + 1.0im, 1.0 + 1.0im
@which +(x, y)
Again, the call has been dispatched to a + method specifically designed for handling the given data type
It’s straightforward to add methods to existing functions.
For example, we can’t at present add an integer and a string in Julia (i.e. 100 + "100"
is not valid syntax).
This is sensible behavior, but if you want to change it there’s nothing to stop you!
import Base: + # enables adding methods to the + function
+(x::Integer, y::String) = x + parse(Int, y)
@show +(100, "100")
@show 100 + "100"; # equivalent
100 + "100" = 200 100 + "100" = 200
We can now be a little bit clearer about what happens when you call a function on given types
Suppose we execute the function call f(a, b)
where a
and b
are of concrete types S
and T
respectively
The Julia interpreter first queries the types of a
and b
to obtain the tuple (S, T)
It then parses the list of methods belonging to f
, searching for a match
If it finds a method matching (S, T)
it calls that method
If not, it looks to see whether the pair (S, T)
matches any method defined for immediate parent types
For example, if S
is Float64
and T
is ComplexF32
then the
immediate parents are AbstractFloat
and Number
respectively
@show supertype(Float64)
@show supertype(ComplexF32)
supertype(Float64) = AbstractFloat supertype(ComplexF32) = Number
Number
Hence the interpreter looks next for a method of the form f(x::AbstractFloat, y::Number)
If the interpreter can’t find a match in immediate parents (supertypes) it proceeds up the tree, looking at the parents of the last type it checked at each iteration
This is the process that leads to the following error (since we only added the +
for adding Integer
and String
above)
@show (typeof(100.0) <: Integer) == false # 100.0 is not an integer
100.0 + "100" # hence our "+" method from above will not work.
(typeof(100.0) <: Integer) == false = true
MethodError: no method matching +(::Float64, ::String) Closest candidates are: +(::Any, ::Any, !Matched::Any, !Matched::Any...) at operators.jl:502 +(::Float64, !Matched::Float64) at float.jl:395 +(::AbstractFloat, !Matched::Bool) at bool.jl:114 ... Stacktrace: [1] top-level scope at In[8]:2
Because the dispatch procedure starts from concrete types and works upwards, dispatch always invokes the most specific method available
For example, if you have methods for function f
that handle
(Float64, Int64)
pairs (Number, Number)
pairs and you call f
with f(0.5, 1)
then the first method will be invoked
This makes sense because (hopefully) the first method is optimized for exactly this kind of data
The second method is probably more of a “catch all” method that handles other data in a less optimal way
Here’s another simple example, involving a user-defined function
function q(x) # or q(x::Any)
println("Default (Any) method invoked")
end
function q(x::Number)
println("Number method invoked")
end
function q(x::Integer)
println("Integer method invoked")
end
q (generic function with 3 methods)
Let’s now run this and see how it relates to our discussion of method dispatch above
q(3)
Integer method invoked
q(3.0)
Number method invoked
q("foo")
Default (Any) method invoked
Since typeof(3) <: Int64 <: Integer <: Number
, the call q(3)
proceeds up the tree to Integer
and invokes q(x::Integer)
On the other hand, 3.0
is a Float64
, which is not a subtype of Integer
Hence the call q(3.0)
continues up to q(x::Number)
Finally, q("foo")
is handled by the function operating on Any
, since String
is not a subtype of Number
or Integer
For the most part, time spent “optimizing” Julia code to run faster is about ensuring the compiler can correctly deduce types for all functions
The macro @code_warntype
gives us a hint
x = [1, 2, 3]
f(x) = 2x
@code_warntype f(x)
Body::Array{Int64,1} 1 ─ %1 = invoke Base.broadcast(Base.:*::typeof(*), 2::Int64, _2::Array{Int64,1})::Array{Int64,1} └── return %1
The @code_warntype
macro compiles f(x)
using the type of x
as an example – i.e., the [1, 2, 3]
is used as a prototype for analyzing the compilation, rather than simply calculating the value
Here, the Body::Array{Int64,1}
tells us the type of the return value of the
function, when called with types like [1, 2, 3]
, is always a vector of integers
In contrast, consider a function potentially returning nothing
, as in this lecture
f(x) = x > 0.0 ? x : nothing
@code_warntype f(1)
Body::Union{Nothing, Int64} 1 ─ %1 = (Base.sitofp)(Float64, x)::Float64 │ %2 = (Base.lt_float)(0.0, %1)::Bool │ %3 = (Base.eq_float)(0.0, %1)::Bool │ %4 = (Base.lt_float)(%1, 9.223372036854776e18)::Bool │ %5 = (Base.and_int)(%3, %4)::Bool │ %6 = (Base.fptosi)(Int64, %1)::Int64 │ %7 = (Base.slt_int)(%6, x)::Bool │ %8 = (Base.and_int)(%5, %7)::Bool │ %9 = (Base.or_int)(%2, %8)::Bool └── goto #3 if not %9 2 ─ return x 3 ─ return Main.nothing
This states that the compiler determines the return type when called with an integer (like 1
) could be one of two different types, Body::Union{Nothing, Int64}
A final example is a variation on the above, which returns the maximum of x
and 0
f(x) = x > 0.0 ? x : 0.0
@code_warntype f(1)
Body::Union{Float64, Int64} 1 ─ %1 = (Base.sitofp)(Float64, x)::Float64 │ %2 = (Base.lt_float)(0.0, %1)::Bool │ %3 = (Base.eq_float)(0.0, %1)::Bool │ %4 = (Base.lt_float)(%1, 9.223372036854776e18)::Bool │ %5 = (Base.and_int)(%3, %4)::Bool │ %6 = (Base.fptosi)(Int64, %1)::Int64 │ %7 = (Base.slt_int)(%6, x)::Bool │ %8 = (Base.and_int)(%5, %7)::Bool │ %9 = (Base.or_int)(%2, %8)::Bool └── goto #3 if not %9 2 ─ return x 3 ─ return 0.0
Which shows that, when called with an integer, the return type could be that integer or the floating point 0.0
On the other hand, if we use change the function to return 0
if x <= 0, it is type-unstable with floating point
f(x) = x > 0.0 ? x : 0
@code_warntype f(1.0)
Body::Union{Float64, Int64} 1 ─ %1 = (Base.lt_float)(0.0, x)::Bool └── goto #3 if not %1 2 ─ return x 3 ─ return 0
The solution is to use the zero(x)
function which returns the additive identity element of type x
On the other hand, if we change the function to return 0
if x <= 0
, it is type-unstable with floating point
@show zero(2.3)
@show zero(4)
@show zero(2.0 + 3im)
f(x) = x > 0.0 ? x : zero(x)
@code_warntype f(1.0)
zero(2.3) = 0.0 zero(4) = 0 zero(2.0 + 3im) = 0.0 + 0.0im Body::Float64 1 ─ %1 = (Base.lt_float)(0.0, x)::Bool └── goto #3 if not %1 2 ─ return x 3 ─ return 0.0
Let’s think about how quickly code runs, taking as given
We’ll start by discussing the kinds of instructions that machines understand
pushq %rbp
movq %rsp, %rbp
addq %rdi, %rdi
leaq (%rdi,%rsi,8), %rax
popq %rbp
retq
nopl (%rax)
Note that this code is specific to one particular piece of hardware that we use — different machines require different machine code
If you ever feel tempted to start rewriting your economic model in assembly, please restrain yourself
It’s far more sensible to give these instructions in a language like Julia, where they can be easily written and understood
function f(a, b)
y = 2a + 8b
return y
end
f (generic function with 2 methods)
or Python
def f(a, b):
y = 2 * a + 8 * b
return y
or even C
int f(int a, int b) {
int y = 2 * a + 8 * b;
return y;
}
In any of these languages we end up with code that is much easier for humans to write, read, share and debug
We leave it up to the machine itself to turn our code into machine code
How exactly does this happen?
The process for turning high level code into machine code differs across languages
Let’s look at some of the options and how they differ from one another
Traditional compiled languages like Fortran, C and C++ are a reasonable option for writing fast code.
Indeed, the standard benchmark for performance is still well-written C or Fortran.
These languages compile down to efficient machine code because users are forced to provide a lot of detail on data types and how the code will execute.
The compiler therefore has ample information for building the corresponding machine code ahead of time (AOT) in a way that
At the same time, the syntax and semantics of C and Fortran are verbose and unwieldy when compared to something like Julia.
Moreover, these low level languages lack the interactivity that’s so crucial for scientific work.
Interpreted languages like Python generate machine code “on the fly”, during program execution
This allows them to be flexible and interactive
Moreover, programmers can leave many tedious details to the runtime environment, such as
But all this convenience and flexibility comes at a cost: it’s hard to turn instructions written in these languages into efficient machine code
For example, consider what happens when Python adds a long list of numbers together
Typically the runtime environment has to check the type of these objects one by one before it figures out how to add them
This involves substantial overheads.
There are also significant overheads associated with accessing the data values themselves, which might not be stored contiguously in memory
The resulting machine code is often complex and slow.
Just-in-time (JIT) compilation is an alternative approach that marries some of the advantages of AOT compilation and interpreted languages
The basic idea is that functions for specific tasks are compiled as requested
As long as the compiler has enough information about what the function does, it can in principle generate efficient machine code
In some instances, all the information is supplied by the programmer
In other cases, the compiler will attempt to infer missing information on the fly based on usage
Through this approach, computing environments built around JIT compilers aim to
JIT compilation is the approach used by Julia
In an ideal setting, all information necessary to generate efficient native machine code is supplied or inferred
In such a setting, Julia will be on par with machine code from low level languages
Consider the function
function f(a, b)
y = (a + 8b)^2
return 7y
end
f (generic function with 2 methods)
Suppose we call f
with integer arguments (e.g., z = f(1, 2)
)
The JIT compiler now knows the types of a
and b
Moreover, it can infer types for other variables inside the function
y
will also be an integer It then compiles a specialized version of the function to handle integers and stores it in memory
We can view the corresponding machine code using the @code_native macro
@code_native f(1, 2)
.section __TEXT,__text,regular,pure_instructions ; ┌ @ In[19]:2 within `f' ; │┌ @ In[19]:2 within `+' decl %eax leal (%edi,%esi,8), %ecx ; │└ ; │┌ @ intfuncs.jl:243 within `literal_pow' ; ││┌ @ int.jl:54 within `*' decl %eax imull %ecx, %ecx ; │└└ ; │ @ In[19]:3 within `f' ; │┌ @ int.jl:54 within `*' decl %eax leal (,%ecx,8), %eax decl %eax subl %ecx, %eax ; │└ retl nopw %cs:(%eax,%eax) ; └
If we now call f
again, but this time with floating point arguments, the JIT compiler will once more infer types for the other variables inside the function
y
will also be a float It then compiles a new version to handle this type of argument
@code_native f(1.0, 2.0)
.section __TEXT,__text,regular,pure_instructions ; ┌ @ In[19]:2 within `f' decl %eax movl $558071048, %eax ## imm = 0x21437D08 addl %eax, (%eax) addb %al, (%eax) ; │┌ @ promotion.jl:314 within `*' @ float.jl:399 vmulsd (%eax), %xmm1, %xmm1 ; │└ ; │┌ @ float.jl:395 within `+' vaddsd %xmm0, %xmm1, %xmm0 ; │└ ; │┌ @ intfuncs.jl:243 within `literal_pow' ; ││┌ @ float.jl:399 within `*' vmulsd %xmm0, %xmm0, %xmm0 decl %eax movl $558071056, %eax ## imm = 0x21437D10 addl %eax, (%eax) addb %al, (%eax) ; │└└ ; │ @ In[19]:3 within `f' ; │┌ @ promotion.jl:314 within `*' @ float.jl:399 vmulsd (%eax), %xmm0, %xmm0 ; │└ retl nopw %cs:(%eax,%eax) ; └
Subsequent calls using either floats or integers are now routed to the appropriate compiled code
In some senses, what we saw above was a best case scenario
Sometimes the JIT compiler produces messy, slow machine code
This happens when type inference fails or the compiler has insufficient information to optimize effectively
The next section looks at situations where these problems arise and how to get around them
To summarize what we’ve learned so far, Julia provides a platform for generating highly efficient machine code with relatively little effort by combining
But the process is not flawless, and hiccups can occur!
The purpose of this section is to highlight potential issues and show you how to circumvent them.
The main Julia package for benchmarking is BenchmarkTools.jl
Below, we’ll use the @btime
macro it exports to evaluate the performance of Julia code
As mentioned in an earlier lecture, we can also save benchmark results to a file and guard against performance regressions in code
For more, see the package docs
Global variables are names assigned to values outside of any function or type definition
The are convenient and novice programmers typically use them with abandon
But global variables are also dangerous, especially in medium to large size programs, since
This makes it much harder to be certain about what some small part of a given piece of code actually commands
Here’s a useful discussion on the topic
When it comes to JIT compilation, global variables create further problems
The reason is that the compiler can never be sure of the type of the global variable, or even that the type will stay constant while a given function runs
To illustrate, consider this code, where b
is global
b = 1.0
function g(a)
global b
for i ∈ 1:1_000_000
tmp = a + b
end
end
g (generic function with 1 method)
The code executes relatively slowly and uses a huge amount of memory
using BenchmarkTools
@btime g(1.0)
┌ Info: Recompiling stale cache file /Users/florian.oswald/.julia/compiled/v1.1/BenchmarkTools/ZXPQo.ji for BenchmarkTools [6e4b80f9-dd63-53aa-95a3-0cdb28fa8baf] └ @ Base loading.jl:1184
25.444 ms (2000000 allocations: 30.52 MiB)
If you look at the corresponding machine code you will see that it’s a mess
@code_native g(1.0)
.section __TEXT,__text,regular,pure_instructions ; ┌ @ In[22]:3 within `g' pushl %ebp decl %eax movl %esp, %ebp incl %ecx pushl %edi incl %ecx pushl %esi incl %ecx pushl %ebp incl %ecx pushl %esp pushl %ebx decl %eax andl $-32, %esp decl %eax subl $128, %esp vmovsd %xmm0, 24(%esp) decl %eax movl $10544800, %eax ## imm = 0xA0E6A0 addl %eax, (%eax) addb %al, (%eax) vxorps %xmm0, %xmm0, %xmm0 vmovaps %ymm0, 32(%esp) vzeroupper calll *%eax decl %ecx movl %eax, %esp decl %eax movl $4, 32(%esp) decl %ecx movl (%esp), %eax decl %eax movl %eax, 40(%esp) decl %eax leal 32(%esp), %eax decl %ecx movl %eax, (%esp) movl $1000000, %ebx ## imm = 0xF4240 decl %ecx movl $10296272, %edi ## imm = 0x9D1BD0 addl %eax, (%eax) addb %al, (%eax) decl %esp leal 80(%esp), %esi nopl (%eax) ; │ @ In[22]:5 within `g' L112: decl %eax movl $10544800, %eax ## imm = 0xA0E6A0 addl %eax, (%eax) addb %al, (%eax) decl %esp movl 207488824(%eax), %ebp decl %esp movl %ebp, 48(%esp) movl $1616, %esi ## imm = 0x650 movl $16, %edx decl %esp movl %esp, %edi decl %eax movl $10556384, %eax ## imm = 0xA113E0 addl %eax, (%eax) addb %al, (%eax) calll *%eax decl %eax movl $75664016, %ecx ## imm = 0x4828A90 addl %eax, (%eax) addb %al, (%eax) decl %eax movl %ecx, -8(%eax) vmovsd 24(%esp), %xmm0 ## xmm0 = mem[0],zero vmovsd %xmm0, (%eax) decl %eax movl %eax, 56(%esp) decl %eax movl $99037904, %ecx ## imm = 0x5E732D0 addl %eax, (%eax) addb %al, (%eax) decl %eax movl %ecx, 80(%esp) decl %eax movl %eax, 88(%esp) decl %esp movl %ebp, 96(%esp) movl $3, %esi decl %esp movl %esi, %edi incl %ecx calll *%edi ; │┌ @ range.jl:594 within `iterate' ; ││┌ @ promotion.jl:403 within `==' decl %eax addl $-1, %ebx ; │└└ jne L112 decl %eax movl 40(%esp), %eax decl %ecx movl %eax, (%esp) decl %eax leal -40(%ebp), %esp popl %ebx incl %ecx popl %esp incl %ecx popl %ebp incl %ecx popl %esi incl %ecx popl %edi popl %ebp retl nop ; └
If we eliminate the global variable like so
function g(a, b)
for i ∈ 1:1_000_000
tmp = a + b
end
end
g (generic function with 2 methods)
then execution speed improves dramatically
@btime g(1.0, 1.0)
1.642 ns (0 allocations: 0 bytes)
Note that the second run was dramatically faster than the first
That’s because the first call included the time for JIT compilaiton
Notice also how small the memory footprint of the execution is
Also, the machine code is simple and clean
@code_native g(1.0, 1.0)
.section __TEXT,__text,regular,pure_instructions ; ┌ @ In[25]:2 within `g' retl nopw %cs:(%eax,%eax) ; └
Now the compiler is certain of types throughout execution of the function and hence can optimize accordingly
const
keyword¶Another way to stabilize the code above is to maintain the global variable but
prepend it with const
const b_const = 1.0
function g(a)
global b_const
for i ∈ 1:1_000_000
tmp = a + b_const
end
end
g (generic function with 2 methods)
Now the compiler can again generate efficient machine code
We’ll leave you to experiment with it
Another scenario that trips up the JIT compiler is when composite types have fields with abstract types
We met this issue earlier, when we discussed AR(1) models
Let’s experiment, using, respectively,
As we’ll see, the last of options these gives us the best performance, while still maintaining significant flexibility
Here’s the untyped case
struct Foo_generic
a
end
Here’s the case of an abstract type on the field a
struct Foo_abstract
a::Real
end
Finally, here’s the parametrically typed case
struct Foo_concrete{T <: Real}
a::T
end
Now we generate instances
fg = Foo_generic(1.0)
fa = Foo_abstract(1.0)
fc = Foo_concrete(1.0)
Foo_concrete{Float64}(1.0)
In the last case, concrete type information for the fields is embedded in the object
typeof(fc)
Foo_concrete{Float64}
This is significant because such information is detected by the compiler
Here’s a function that uses the field a
of our objects
function f(foo)
for i ∈ 1:1_000_000
tmp = i + foo.a
end
end
f (generic function with 2 methods)
Let’s try timing our code, starting with the generic case:
@btime f($fg)
30.384 ms (1999489 allocations: 30.51 MiB)
The timing is not very impressive
Here’s the nasty looking machine code
@code_native f(fg)
.section __TEXT,__text,regular,pure_instructions ; ┌ @ In[34]:2 within `f' pushl %ebp incl %ecx pushl %edi incl %ecx pushl %esi incl %ecx pushl %ebp incl %ecx pushl %esp pushl %ebx decl %eax subl $72, %esp vxorps %xmm0, %xmm0, %xmm0 vmovaps %xmm0, (%esp) decl %eax movl %esi, %ebx decl %eax movl $0, 16(%esp) decl %eax movl %ebx, 64(%esp) decl %eax movl $10544800, %eax ## imm = 0xA0E6A0 addl %eax, (%eax) addb %al, (%eax) calll *%eax decl %eax movl $2, (%esp) decl %eax movl (%eax), %ecx decl %eax movl %ecx, 8(%esp) decl %eax movl %esp, %ecx decl %eax movl %eax, 32(%esp) decl %eax movl %ecx, (%eax) decl %esp movl (%ebx), %esp movl $1, %ebx decl %eax movl $99037904, %ebp ## imm = 0x5E732D0 addl %eax, (%eax) addb %al, (%eax) decl %ecx movl $10296272, %esi ## imm = 0x9D1BD0 addl %eax, (%eax) addb %al, (%eax) decl %esp leal 40(%esp), %edi ; │ @ In[34]:3 within `f' ; │┌ @ sysimg.jl:18 within `getproperty' L112: decl %ebp movl (%esp), %ebp ; │└ decl %eax movl %ebx, %edi decl %eax movl $10485968, %eax ## imm = 0xA000D0 addl %eax, (%eax) addb %al, (%eax) calll *%eax decl %eax movl %eax, 16(%esp) decl %eax movl %ebp, 40(%esp) decl %eax movl %eax, 48(%esp) decl %esp movl %ebp, 56(%esp) movl $3, %esi decl %esp movl %edi, %edi incl %ecx calll *%esi ; │┌ @ range.jl:595 within `iterate' ; ││┌ @ int.jl:53 within `+' decl %eax addl $1, %ebx ; ││└ ; ││ @ range.jl:594 within `iterate' ; ││┌ @ promotion.jl:403 within `==' decl %eax cmpl $1000001, %ebx ## imm = 0xF4241 ; │└└ jne L112 decl %eax movl 8(%esp), %eax decl %eax movl 32(%esp), %ecx decl %eax movl %eax, (%ecx) decl %eax movl $275091464, %eax ## imm = 0x10659008 addl %eax, (%eax) addb %al, (%eax) decl %eax addl $72, %esp popl %ebx incl %ecx popl %esp incl %ecx popl %ebp incl %ecx popl %esi incl %ecx popl %edi popl %ebp retl nopw %cs:(%eax,%eax) ; └
The abstract case is similar
@btime f($fa)
29.230 ms (1999489 allocations: 30.51 MiB)
Note the large memory footprint
The machine code is also long and complex, although we omit details
Finally, let’s look at the parametrically typed version
@btime f($fc)
1.642 ns (0 allocations: 0 bytes)
Some of this time is JIT compilation, and one more execution gets us down to
Here’s the corresponding machine code
@code_native f(fc)
.section __TEXT,__text,regular,pure_instructions ; ┌ @ In[34]:2 within `f' retl nopw %cs:(%eax,%eax) ; └
Much nicer…
Another way we can run into trouble is with abstract container types
Consider the following function, which essentially does the same job as Julia’s sum()
function but acts only on floating point data
function sum_float_array(x::AbstractVector{<:Number})
sum = 0.0
for i ∈ eachindex(x)
sum += x[i]
end
return sum
end
sum_float_array (generic function with 1 method)
Calls to this function run very quickly
x = range(0, 1, length = Int(1e6))
x = collect(x)
typeof(x)
Array{Float64,1}
@btime sum_float_array($x)
994.252 μs (0 allocations: 0 bytes)
499999.9999999796
When Julia compiles this function, it knows that the data passed in as x
will be an array of 64 bit floats
Hence it’s known to the compiler that the relevant method for +
is always addition of floating point numbers
Moreover, the data can be arranged into continuous 64 bit blocks of memory to simplify memory access
Finally, data types are stable — for example, the local variable sum
starts off as a float and remains a float throughout
Here’s the same function minus the type annotation in the function signature
function sum_array(x)
sum = 0.0
for i ∈ eachindex(x)
sum += x[i]
end
return sum
end
sum_array (generic function with 1 method)
When we run it with the same array of floating point numbers it executes at a similar speed as the function with type information
@btime sum_array($x)
992.630 μs (0 allocations: 0 bytes)
499999.9999999796
The reason is that when sum_array()
is first called on a vector of a given
data type, a newly compiled version of the function is produced to handle that
type
In this case, since we’re calling the function on a vector of floats, we get a compiled version of the function with essentially the same internal representation as sum_float_array()
Things get tougher for the interpreter when the data type within the array is imprecise
For example, the following snippet creates an array where the element type is Any
x = Any[ 1/i for i ∈ 1:1e6 ];
eltype(x)
Any
Now summation is much slower and memory management is less efficient
@btime sum_array($x)
21.258 ms (1000000 allocations: 15.26 MiB)
14.392726722864989
Here are some final comments on performance
Writing fast Julia code amounts to writing Julia from which the compiler can generate efficient machine code.
For this, Julia needs to know about the type of data it’s processing as early as possible.
We could hard code the type of all variables and function arguments but this comes at a cost.
Our code becomes more cumbersome and less generic.
We are starting to loose the advantages that drew us to Julia in the first place.
Moreover, explicitly typing everything is not necessary for optimal performance.
The Julia compiler is smart and can often infer types perfectly well, without any performance cost.
What we really want to do is
Use functions to segregate operations into logically distinct blocks
Data types will be determined at function boundaries
If types are not supplied then they will be inferred
If types are stable and can be inferred effectively your functions will run fast
A good next stop for further reading is the relevant part of the Julia documentation
# julia version
versioninfo()
Julia Version 1.1.0 Commit 80516ca202 (2019-01-21 21:24 UTC) Platform Info: OS: macOS (x86_64-apple-darwin14.5.0) CPU: Intel(R) Core(TM) i5-5257U CPU @ 2.70GHz WORD_SIZE: 64 LIBM: libopenlibm LLVM: libLLVM-6.0.1 (ORCJIT, broadwell)