Descend: A Safe GPU Systems Programming Language
Authors: Bastian Köpcke, Sergei Gorlatch, Michel Steuwer — PLDI 2024 (arXiv)
Summary
Rust-inspired GPU language that brings memory safety to GPU code via:
- Ownership + lifetimes extended to track CPU and GPU memory across host/device boundaries
- Views — abstractions describing safe parallel access patterns over memory regions; compiled into index computations (Lift/DPIA approach)
- Hierarchical scheduling — grid → blocks → threads encoded in the type system
- Compiles to CUDA C (not LLVM IR or MLIR)
- Zero overhead vs hand-written CUDA on reduction, transpose, scan, matmul benchmarks
Relevance to sammine-lang Phase 7
| Descend concept | sammine-lang / MLIR equivalent |
| Views (safe parallel access patterns) | linalg.generic indexing maps |
| Hierarchical GPU scheduling | MLIR gpu dialect + scf.parallel |
| Ownership/lifetimes for memory safety | Not yet implemented (manual alloc/free) |
| Compiles to CUDA text | MLIR → LLVM IR → native; could add gpu dialect for GPU targets |
Key Insights
- MLIR is better-positioned than Descend's approach: built-in gpu dialect, linalg dialect, tensor bufferization give GPU codegen via existing lowering pipelines (gpu-to-nvvm, gpu-to-rocdl)
- Views ≈ linalg indexing maps: Descend's statically-verified access patterns map directly to linalg.generic's affine indexing maps; no need to invent a new type — tensor<NxMxf64> + linalg ops encode these patterns
- Ownership is the big missing piece: preventing GPU data races requires Rust-style borrow checking in the type system — major undertaking, but Descend shows it's feasible with zero overhead
- Phase 8 (user MLIR blocks) could let users write gpu.launch kernels directly, sidestepping the need for a full scheduling hierarchy in sammine's type system
Lift: A Functional Data-Parallel IR for High-Performance GPU Code Generation
Authors: Michel Steuwer, Toomas Remmelg, Christophe Dubach — CGO 2017 (PDF)
Summary
Functional IR where GPU programs are compositions of small data-parallel patterns. The compiler uses algebraic rewrite rules to transform high-level patterns into low-level GPU code, automatically exploring thousands of valid implementations. Performance on par with hand-optimized OpenCL.
Core Patterns
- High-level: map, reduce, zip, split, join, iterate, reorder, transpose
- Low-level (OpenCL-mapped):
- mapGlobal/mapWorkgroup/mapLocal/mapSeq — map to GPU thread hierarchy
- toGlobal/toLocal/toPrivate — map to OpenCL address spaces
- asVector/asScalar/Vectorize — SIMD vectorization
- Rewrite rules: algebraic transforms, e.g. map(f) ∘ map(g) → map(f ∘ g) (fusion)
5-Stage Compilation Pipeline
- High-level expression — user writes reduce(+, 0) ∘ map(f) over arrays
- Rewrite rules — transform into low-level OpenCL-mapped patterns (mapGlobal, toLocal, etc.)
- Type analysis — array types carry size info, compiler infers thread counts
- Memory allocation + views — compiler builds views (internal data structures describing how to index into memory); array accesses, tiling, padding handled without actual memory movement
- Barrier elimination + OpenCL codegen — barriers inserted only where data dependencies require; emit OpenCL kernel
Key Insight: Views
Performance-sensitive details (memory layout, indexing, synchronization) are NOT explicit in the IR. The compiler exploits pattern semantics to:
- Infer memory allocation sizes
- Compute multi-dimensional array indices
- Insert barriers only where data dependencies require them
- Eliminate redundant copies
Performance
- On par with hand-optimized OpenCL
- Compiler can explore 50,000 valid OpenCL kernel variants from a single expression, each provably correct
Rise: The MLIR Successor
- Rise is Lift's spiritual successor, implemented as an MLIR dialect
- Lowers functional patterns through linalg, scf, and std dialects
- Rise's existence directly influenced MLIR's linalg dialect design
- Compiler: Shine — produces C, OpenMP, OpenCL, or CUDA
- MLIR integration: rise-lang/mlir
Relevance to sammine-lang (most directly compatible paper)
| Lift/Rise concept | sammine-lang connection |
| Functional pattern IR (map, reduce, zip) | Could become built-in higher-order functions |
| Rewrite-rule optimization | MLIR transform dialect / pass infrastructure |
| Views (implicit array indexing) | linalg.generic affine indexing maps |
| mapGlobal/mapLocal/mapSeq | scf.parallel → gpu.launch lowering |
| Rise as MLIR dialect | Could import Rise patterns via Phase 8 plugin system |
| 5-stage pipeline | Stages 1-3 exist; views + barriers come with linalg |
Implications for Roadmap
- Phase 7: linalg.generic encodes the same framework Lift pioneered; matmul(a,b) → linalg.matmul is Lift's approach via MLIR. View system = linalg indexing maps.
- Phase 8: --mlir-plugin=path.dylib could load Rise dialect, enabling Lift-style functional patterns through existing MLIR lowering.
- Rewrite rules vs direct codegen: current MLIR backend does direct AST→MLIR. Lift's insight: composable rewrite rules via MLIR transform dialect yield better GPU code than hardcoded lowering. Emit into linalg rather than going straight to scf/memref.
- No custom dialect needed: Rise validated that functional patterns lower entirely through existing dialects (linalg → scf → memref → llvm).
Phase 7 Design Notes
Planned dialect usage
- tensor dialect for ranked tensor types (tensor<NxMxf64>)
- linalg dialect for structured ops (linalg.matmul, linalg.generic)
- bufferization to convert tensor → memref before lowering
- New passes: one-shot-bufferize, convert-linalg-to-loops, lower-affine
GPU extension (future)
- MLIR gpu dialect for kernel launch / device memory
- Lowering: gpu-to-nvvm (NVIDIA) or gpu-to-rocdl (AMD)
- scf.parallel → gpu.launch transformation available in MLIR
References