GPU Acceleration
SatelliteGridding.jl supports accelerator execution through KernelAbstractions.jl. The same gridding kernels can run on the KA CPU backend, NVIDIA GPUs through CUDA.jl, and Apple GPUs through Metal.jl.
Setup
CUDA.jl and Metal.jl are weak dependencies. They are loaded only when you request that backend or explicitly import the package:
using SatelliteGridding
cuda_backend = resolve_backend("cuda") # requires CUDA.jl
metal_backend = resolve_backend("metal") # requires Metal.jl on macOS/Apple GPUYou can also use vendor packages directly:
using CUDA
using Metal
CUDA.devices()
Metal.devices()Usage
Julia API
grid_l2(config, grid_spec, time_spec;
backend = resolve_backend("cuda"),
outfile = "output.nc")
grid_l2(config, grid_spec, time_spec;
backend = resolve_backend("metal"),
outfile = "output_metal.nc")CLI
julia --project=. bin/grid.jl l2 \
--config examples/tropomi_sif.toml \
--backend cuda \
-o output.nc
julia --project=. bin/grid.jl l2 \
--config examples/tropomi_sif.toml \
--backend metal \
-o output_metal.ncGPU backends support quadrilateral footprints (--footprint quad) and circular footprints (--footprint circle, CircularFootprintGridding). The circular path uses a separate index kernel that masks samples outside the inferred circle or ellipse before scatter accumulation.
Architecture
The GPU pipeline consists of three KernelAbstractions kernels that run entirely on the GPU. Quadrilateral and circular footprints share the same scatter pattern, but use different index-generation kernels.
Kernel 1: Corner Sorting
sort_corners_ccw_ka!(backend, lat_corners, lon_corners)Sorts all footprint corners into CCW order in parallel using a 5-comparator sorting network. Each thread handles one footprint.
Kernel 2: Subpixel Index Computation
compute_footprint_indices_ka!(backend, ix, iy, skip_flag,
lat_corners, lon_corners, n)Computes n×n subpixel grid cell indices per footprint. Each thread handles one footprint and produces n² index pairs. The skip_flag indicates:
0= oversample (n×n subpixels computed)1= fast path (all corners in same cell, single index)2= skip (footprint too wide for meaningful oversampling)
For circular footprints:
compute_circular_footprint_indices_ka!(backend, ix, iy, inside_count,
skip_flag, center_lat, center_lon,
lat_corners, lon_corners, n)This kernel samples the footprint bounding box, keeps only samples inside the inferred circle/ellipse, and records inside_count so scatter weights are normalized as 1 / inside_count.
Kernel 3: Scatter-Accumulate
scatter_accumulate_ka!(backend, grid_sum, grid_weights,
ix, iy, skip_flag, values, n, n_vars)Atomically adds weighted values to the grid cells using @atomic. On GPU backends this maps to backend-supported atomic operations. Each thread handles one footprint and scatters its n² subpixel contributions.
Circular footprints use scatter_accumulate_circular_ka!, which skips masked samples and uses per-footprint normalization.
After processing all files for a time step, finalize_mean! divides sums by weights to compute the mean.
Performance Considerations
Data Transfer
Grid accumulators are allocated on the GPU and stay there across files. Each file's input data is transferred to the GPU for processing. The bottleneck is typically I/O (reading NetCDF files) rather than computation.
Memory
GPU memory usage scales with:
n_fp × n² × sizeof(Int32) × 2for index arrays (ix,iy)n_fp × n_vars × sizeof(Float32)for valuesn_lon × n_lat × (n_vars + 1) × sizeof(Float32)for grid accumulators
For a typical TROPOMI file (~500k soundings, n=10, 6 variables):
- Index arrays: ~400 MB
- Values: ~12 MB
- Grid (360×180): ~1.5 MB
Batch Size
Large files may need to be processed in batches to fit in GPU memory. The accumulate_batch! function handles one batch at a time.
Backend Comparison
| Aspect | Sequential | KA CPU | KA CUDA | KA Metal |
|---|---|---|---|---|
| Algorithm | Welford (incremental mean) | Sum-based | Sum-based | Sum-based |
| Parallelism | None | Multi-threaded | GPU-parallel | GPU-parallel |
| STD support | Single-pass | Two-pass | Two-pass | Two-pass |
| Atomic ops | Not needed | Not needed (sequential scatter) | Backend atomics | Backend atomics |
| Typical hardware | Any CPU | Any CPU | NVIDIA GPU | Apple GPU |
The KA CPU backend uses sequential scatter (no atomics) for the accumulation step, avoiding the overhead of software atomics on CPU. The sorting and subpixel kernels still run in parallel via KernelAbstractions threading.