Multithreading in Julia
Before we have a look how Julia deals with the concept of multithreading, let us make clear what we are talking about.
What is multithreading?
In the terminology of computer science a thread is the smallest sequence of instructions that can be managed by the scheduler of the operating system. It is often also called a light weight process and is most of the time considered to exist inside the context of a process. Consequently, multithreading is the ability to mange multiple concurrently executed threads. Multiple threads share their resources, this makes this quite a powerful tool. The threads run on a single CPU or on multiple CPUs and give us the opportunity to leverage the full force of our computer (or cell phone for that matter).
Back to Julia
By default, Julia will start with a single computational thread of execution:
julia> Threads.nthreads()
1
This does not mean that Julia is only using one thread. We mentioned the Basic Linear Algebra Subroutines - BLAS before. Calls to this library (e.g. matrix-matrix multiplication) will be multithreaded by default and therefore, technically you have been doing multithreading all along ;). Let us illustrate this with a small example.
BLAS
is a wrapper to the BLAS libraries used by Julia.
julia> using BenchmarkTools, LinearAlgebra
julia> A = rand(2000, 2000);
julia> B = rand(2000, 2000);
julia> BLAS.get_num_threads()
8
julia> @btime $A * $B;
141.984 ms (2 allocations: 30.52 MiB)
julia> BLAS.set_num_threads(1)
julia> @btime $A * $B;
1.009 s (2 allocations: 30.52 MiB)
In order to have multiple threads available you need to start Julia with the --threads <Int|auto>
option or define the environment variable JULIA_NUM_THREADS=<Int>
. With the command threadid()
from the Threads
module you can find out on which thread you currently are.
> julia --threads auto
julia> Threads.nthreads()
16
julia> Threads.threadid()
1
By default, the Julia REPL, or the main Julia process for that matter, will always run on the thread with id 1
race. We do not have the time for a deep dive into all the dirty details on how do do proper multithreaded programming (raise conditions, locks, atomic operations, thread safe programming, ...), therefore we keep it light and simple with the @threads
macro and introduce the needed concepts when we need them along the way.
4
threads.Like all the other macros it gives us the possibility to bring something rather complex in our code by still staying very readable as the following example shows.
julia> using Base.Threads
julia> nthreads()
4
julia> a = zeros(10);
julia> @threads for i in 1:10
a[i] = threadid()
end
julia> a
10-element Vector{Float64}:
1.0
1.0
1.0
2.0
2.0
2.0
3.0
3.0
4.0
4.0
Let us try to apply this concept to our example.
Multithreaded
The obvious first impulse is to just write a @threads
in front of the loop in our in_unit_circle
routine, well lets follow this impulse.
For reference, the results in this section are computed with 4 threads and the original code has the following performance
julia> @btime estimate_pi(in_unit_circle, N)
235.643 ms (0 allocations: 0 bytes)
3.14141152
in_unit_circle_threaded1
with the @threads
macro and test the result as well as the timing. using BenchmarkTools
function in_unit_circle_threaded1(N::Int64)
M = 0
@threads for i in 1:N
if (rand()^2 + rand()^2) < 1
M += 1
end
end
return M
end
and we test it
julia> get_accuracy(in_unit_circle_threaded1, N)
2.2828585264557084
julia> @btime estimate_pi(in_unit_circle_threaded1, N);
21.899 s (843326330 allocations: 12.57 GiB)
Well that is underwhelming. The result is wrong and it is slower. So what happened?
Atomic Operations
As we could see, in the above example from the docs, the loop is automatically split up per index for the threads available. This means each of the threads is performing the same loop and as the context and memory is shared also access the same storage. This is problematic for our variable M
. This means each thread reads and writes in the same variable but this also means the result is not correct. It might override the results of other threads or they all read at the same time but only one result will be written in the end. In short the counter is totally wrong. We call this race condition.
To solve this issue Julia supports atomic values. This allows us to access a variable in a thread-safe way and avoid race conditions. All primitive types can be wrapped with M = Atomic{Int64}(0)
and can only be accessed in a thread safe way. In order to do the atomic add we use the function atomic_add!(M, 1)
and we can access the value with M[]
.
in_unit_circle_threaded2
with the @threads
macro, an atomic M
and test the result as well as the timing. function in_unit_circle_threaded2(N::Int64)
M = Atomic{Int64}(0);
@threads for i in 1:N
if (rand()^2 + rand()^2) < 1
atomic_add!(M, 1)
end
end
return M[]
end
and we test it
julia> get_accuracy(in_unit_circle_threaded2, N)
2.729346091356888e-5
julia> @btime estimate_pi(in_unit_circle_threaded2, N);
10.487 s (24 allocations: 2.03 KiB)
Now our result is correct, but the time is still not better but worse.
Actually distribute the work
We are still not fast, because each attomic_add!
is checking which thread has the current result and needs to add the new value. To avoid this we need to eliminate atomic again. We can actually split up the work quite neatly if we remember the example from the docs. It is possible to access the threadid()
and the number of threads nthreads()
. So why not define M
as an array of length nthreads()
and in each thread write to separate values in the array by using threadid()
as index.
in_unit_circle_threaded3
with the @threads
, M
as array and test the result as well as the timing. function in_unit_circle_threaded3(N::Int64)
M = zeros(Int64, nthreads());
@threads for i in 1:N
if (rand()^2 + rand()^2) < 1
@inbounds M[threadid()] += 1
end
end
return sum(M)
end
and we test it
julia> get_accuracy(in_unit_circle_threaded3, N)
1.7652409621149445e-5
julia> @btime estimate_pi(in_unit_circle_threaded3, N);
2.857 s (23 allocations: 2.08 KiB)
Now our result is correct, and the time is okay.
Now we are faster, but still not faster than the serial version. Why is it still not working?
Global states
Without going into too much detail, rand()
is not thread safe. It does manipulate and read from some global state and that causes our slowdown. In fact, as the random numbers are not correctly distributed any more, the accuracy is also decaying.
To avoid this we need to exchange the random number generator and make the call to rand
thread safe. This solution is inspired by the section Multithreading of the Parallel Computing Class on JuliaAcademy.com and slightly adapted for the setup we have.
First step is to define a separate random number generator per thread:
using Random
const ThreadRNG = Vector{Random.MersenneTwister}(undef, nthreads())
@threads for i in 1:nthreads()
ThreadRNG[threadid()] = Random.MersenneTwister()
end
What we do in the third line is define a const
variable. That is a global variable whose type will not change. In fact we define a Vector of size nthreads()
and fill it with distinct Random.MersenneTwister
. This allows us to have a different random number generator for each thread by using
rng = ThreadRNG[threadid()]
rand(rng)
in each thread.
Now for our final version of the code, the basic idea is to not have a threaded loop over the integer N
but over the number of threads. In order for this to work we need to figure out how many iterations each thread needs to perform. For that, we use len, rem = divrem(N, nthreads())
to divide up N
into the quotient and remainder from the Euclidean division.
in_unit_circle_threaded4
with the @threads
macro, M
as array, the above code snippets and test the result as well as the timing. Extra points if you check if we do not loose any iterations due to the split. using Random
const ThreadRNG = Vector{Random.MersenneTwister}(undef, nthreads())
@threads for i in 1:nthreads()
ThreadRNG[threadid()] = Random.MersenneTwister()
end
function in_unit_circle_threaded4(N::Int64)
M = zeros(Int64, nthreads())
len, rem = divrem(N, nthreads())
@threads for i in 1:nthreads()
rng = ThreadRNG[threadid()]
m = 0
for j in 1:len
if (rand(rng)^2 + rand(rng)^2) < 1
m += 1
end
end
M[threadid()] = m
end
return sum(M)
end
and we test it
julia> get_accuracy(in_unit_circle_threaded4, N)
3.955314820203171e-5
julia> @btime estimate_pi(in_unit_circle_threaded4, N);
504.697 ms (22 allocations: 2.06 KiB)
Final results
For comparison, here are our final results for 4 computational threads:
Accuracy of in_unit_circle: 9.43186408797203e-5
Performance:
2.531 s (0 allocations: 0 bytes)
Accuracy of in_unit_circle_threaded1: 2.2828585264557084
Performance:
21.899 s (843326330 allocations: 12.57 GiB)
Accuracy of in_unit_circle_threaded2: 2.729346091356888e-5
Performance:
10.487 s (24 allocations: 2.03 KiB)
Accuracy of in_unit_circle_threaded3: 1.7652409621149445e-5
Performance:
2.857 s (23 allocations: 2.08 KiB)
Accuracy of in_unit_circle_threaded4: 3.955314820203171e-5
Performance:
504.697 ms (22 allocations: 2.06 KiB)
Other pitfalls
There are several other pitfalls that might occur with multithreading, here is an incomplete list:
Oversubscription: We can overdo it with threading. For example if we multithread an algorithm that uses a BLAS routine, it can result in the scenario, that inside each thread a subroutine is trying to run on multiple threads. Thus, they might partially block each other and the overall performance is reduced, depending on the capacities we are working on.
False sharing: The latency of the different layers of memory inside a CPU vary and also the way a core on a CPU can access it. Usually, L3 is shared by all cores but not L2 and L1. This can result in false sharing and reduce the performance if one CPU accesses the data from a cache of another CPU.