GPU Parallelization
KomaMRI uses a vendor agnostic approach to GPU parallelization in order to support multiple GPU backends. Currently, the following backends are supported:
- CUDA.jl (Nvidia)
- Metal.jl (Apple)
- AMDGPU.jl (AMD)
- oneAPI.jl (Intel)
Choosing a GPU Backend
To determine which backend to use, KomaMRI uses package extensions (introduced in Julia 1.9) to avoid having the packages for each GPU backend as explicit dependencies. This means that the user is responsible for loading the backend package (e.g. using CUDA
) at the beginning of their code, or prior to calling KomaUI(), otherwise, Koma will default back to the CPU:
using KomaMRI
using CUDA # loading CUDA will load KomaMRICoreCUDAExt, selecting the backend
Once this is done, no further action is needed! The simulation objects will automatically be moved to the GPU and back once the simulation is finished. When the simulation is run a message will be shown with either the GPU device being used or the number of CPU threads if running on the CPU.
Of course, it is still possible to move objects to the GPU manually, and control precision using the f32 and f64 functions:
x = rand(100)
x |> f32 |> gpu # Float32 CuArray
To change the precision level used for the entire simulation, the sim_params["precision"]
parameter can be set to either f32
or f64
(Note that for most GPUs, Float32 operations are considerably faster compared with Float64). In addition, the sim_params["gpu"]
option can be set to true or false to enable / disable the gpu functionality (if set to true, the backend package will still need to be loaded beforehand):
using KomaMRI
using CUDA
sys = Scanner
obj = brain_phantom2D()
seq = PulseDesigner.EPI_example()
#Simulate on the GPU using 32-bit floating point values
sim_params = Dict{String,Any}(
"Nblocks" => 20,
"gpu" => true,
"precision" => "f32"
"sim_method" => Bloch(),
)
simulate(obj, seq, sys; sim_params)
How Objects are moved to the GPU
Koma's gpu
function implementation calls a separate gpu
function with a backend parameter of type <:KernelAbstractions.GPU
for the backend it is using. This function then calls the fmap
function from package Functors.jl
to recursively call adapt
from package Adapt.jl
on each field of the object being transferred. This is similar to how many other Julia packages, such as Flux.jl
, transfer data to the GPU. However, an important difference is that KomaMRI adapts directly to the KernelAbstractions.Backend
type in order to use the adapt_storage
functions defined in each backend package, rather than defining custom adapters, resulting in an implementation with fewer lines of code.
Inside the Simulation
KomaMRI has three different simulation methods, all of which can run on the GPU:
BlochSimple
: BlochSimple.jlBlochDict
: BlochDict.jlBloch
: BlochCPU.jl / BlochGPU.jl
BlochSimple
is the simplest method and prioritizes readability.
BlochDict
can be understood as an extension to BlochSimple
that outputs a more detailed signal.
Bloch
is equivalent to BlochSimple
in the operations it performs, but is much faster since it has been optimized both for the CPU and GPU. The CPU implementation prioritizes conserving memory, and makes extensive use of pre-allocation for the simulation arrays. Unlike the GPU implementation, it does not allocate a matrix of size Number of Spins x Number of Time Points
in each block, instead using a for loop to step through time.
In contrast, the GPU implementation divides work among as many threads as possible at the beginning of the run_spin_precession!
and run_spin_excitation!
functions. For the CPU implementation, this would not be beneficial since there are far less CPU threads available compared with the GPU. Preallocation is also used via the same prealloc
function used in BlochCPU.jl
, where a struct of arrays is allocated at the beginning of the simulation that can be re-used in each simulation block. In addition, a precalc
function is called before moving the simulation objects to the GPU to do certain calculations that are faster on the CPU beforehand.
Compared with BlochSimple
, which only uses array broadcasting for parallelization, Bloch
also uses kernel-based methods in its run_spin_excitation!
function for operations which need to be done sequentially. The kernel implementation uses shared memory to store the necessary arrays for applying the spin excitation for fast memory access, and separates the complex arrays into real and imaginary components to avoid bank conflicts.
The performance differences between Bloch and BlochSimple can be seen on the KomaMRI benchmarks page. The first data point is from when Bloch
was what is now BlochSimple
, before a more optimized implementation was created. The following three pull requests are primarily responsible for the performance differences between Bloch
and BlochSimple
: