GPU Acceleration

SatelliteGridding.jl supports accelerator execution through KernelAbstractions.jl. The same gridding kernels can run on the KA CPU backend, NVIDIA GPUs through CUDA.jl, and Apple GPUs through Metal.jl.

Setup

CUDA.jl and Metal.jl are weak dependencies. They are loaded only when you request that backend or explicitly import the package:

using SatelliteGridding

cuda_backend = resolve_backend("cuda")    # requires CUDA.jl
metal_backend = resolve_backend("metal")  # requires Metal.jl on macOS/Apple GPU

You can also use vendor packages directly:

using CUDA
using Metal

CUDA.devices()
Metal.devices()

Usage

Julia API

grid_l2(config, grid_spec, time_spec;
        backend = resolve_backend("cuda"),
        outfile = "output.nc")

grid_l2(config, grid_spec, time_spec;
        backend = resolve_backend("metal"),
        outfile = "output_metal.nc")

CLI

julia --project=. bin/grid.jl l2 \
    --config examples/tropomi_sif.toml \
    --backend cuda \
    -o output.nc

julia --project=. bin/grid.jl l2 \
    --config examples/tropomi_sif.toml \
    --backend metal \
    -o output_metal.nc

GPU backends support quadrilateral footprints (--footprint quad) and circular footprints (--footprint circle, CircularFootprintGridding). The circular path uses a separate index kernel that masks samples outside the inferred circle or ellipse before scatter accumulation.

Architecture

The GPU pipeline consists of three KernelAbstractions kernels that run entirely on the GPU. Quadrilateral and circular footprints share the same scatter pattern, but use different index-generation kernels.

Kernel 1: Corner Sorting

sort_corners_ccw_ka!(backend, lat_corners, lon_corners)

Sorts all footprint corners into CCW order in parallel using a 5-comparator sorting network. Each thread handles one footprint.

Kernel 2: Subpixel Index Computation

compute_footprint_indices_ka!(backend, ix, iy, skip_flag,
                               lat_corners, lon_corners, n)

Computes n×n subpixel grid cell indices per footprint. Each thread handles one footprint and produces n² index pairs. The skip_flag indicates:

  • 0 = oversample (n×n subpixels computed)
  • 1 = fast path (all corners in same cell, single index)
  • 2 = skip (footprint too wide for meaningful oversampling)

For circular footprints:

compute_circular_footprint_indices_ka!(backend, ix, iy, inside_count,
                                       skip_flag, center_lat, center_lon,
                                       lat_corners, lon_corners, n)

This kernel samples the footprint bounding box, keeps only samples inside the inferred circle/ellipse, and records inside_count so scatter weights are normalized as 1 / inside_count.

Kernel 3: Scatter-Accumulate

scatter_accumulate_ka!(backend, grid_sum, grid_weights,
                        ix, iy, skip_flag, values, n, n_vars)

Atomically adds weighted values to the grid cells using @atomic. On GPU backends this maps to backend-supported atomic operations. Each thread handles one footprint and scatters its n² subpixel contributions.

Circular footprints use scatter_accumulate_circular_ka!, which skips masked samples and uses per-footprint normalization.

After processing all files for a time step, finalize_mean! divides sums by weights to compute the mean.

Performance Considerations

Data Transfer

Grid accumulators are allocated on the GPU and stay there across files. Each file's input data is transferred to the GPU for processing. The bottleneck is typically I/O (reading NetCDF files) rather than computation.

Memory

GPU memory usage scales with:

  • n_fp × n² × sizeof(Int32) × 2 for index arrays (ix, iy)
  • n_fp × n_vars × sizeof(Float32) for values
  • n_lon × n_lat × (n_vars + 1) × sizeof(Float32) for grid accumulators

For a typical TROPOMI file (~500k soundings, n=10, 6 variables):

  • Index arrays: ~400 MB
  • Values: ~12 MB
  • Grid (360×180): ~1.5 MB

Batch Size

Large files may need to be processed in batches to fit in GPU memory. The accumulate_batch! function handles one batch at a time.

Backend Comparison

AspectSequentialKA CPUKA CUDAKA Metal
AlgorithmWelford (incremental mean)Sum-basedSum-basedSum-based
ParallelismNoneMulti-threadedGPU-parallelGPU-parallel
STD supportSingle-passTwo-passTwo-passTwo-pass
Atomic opsNot neededNot needed (sequential scatter)Backend atomicsBackend atomics
Typical hardwareAny CPUAny CPUNVIDIA GPUApple GPU

The KA CPU backend uses sequential scatter (no atomics) for the accumulation step, avoiding the overhead of software atomics on CPU. The sorting and subpixel kernels still run in parallel via KernelAbstractions threading.