lee101/gist:6fdd96a37e70d6bd8eb9588d3069a054

## gistfile1.txt
| CUDA | Programming | Guide |
| ---- | ----------- | ----- |
Release 13.1
NVIDIA Corporation
Dec12,2025

Contents
1 IntroductiontoCUDA 3
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.1 TheGraphicsProcessingUnit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.2 TheBenefitsofUsingGPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.3 GettingStartedQuickly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 ProgrammingModel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.1 HeterogeneousSystems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.2 GPUHardwareModel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.2.1 ThreadBlocksandGrids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.2.2 WarpsandSIMT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.3 GPUMemory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.2.3.1 DRAMMemoryinHeterogeneousSystems . . . . . . . . . . . . . . . . . . . . . . . 11
1.2.3.2 On-ChipMemoryinGPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.2.3.3 UnifiedMemory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3 TheCUDAplatform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3.1 ComputeCapabilityandStreamingMultiprocessorVersions . . . . . . . . . . . . . . . 13
1.3.2 CUDAToolkitandNVIDIADriver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.3.2.1 CUDARuntimeAPIandCUDADriverAPI . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.3.3 ParallelThreadExecution(PTX) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.3.4 CubinsandFatbins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.3.4.1 BinaryCompatibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.3.4.2 PTXCompatibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.3.4.3 Just-in-TimeCompilation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2 ProgrammingGPUsinCUDA 17
2.1 IntrotoCUDAC++ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1.1 CompilationwithNVCC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1.2 Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1.2.1 SpecifyingKernels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1.2.2 LaunchingKernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.1.2.3 ThreadandGridIndexIntrinsics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.1.3 MemoryinGPUComputing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.1.3.1 UnifiedMemory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.1.3.2 ExplicitMemoryManagement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.1.3.3 MemoryManagementandApplicationPerformance . . . . . . . . . . . . . . . . . 24
2.1.4 SynchronizingCPUandGPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.1.5 PuttingitAllTogether . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.1.6 RuntimeInitialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.1.7 ErrorCheckinginCUDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.1.7.1 ErrorState . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.1.7.2 AsynchronousErrors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.1.7.3 CUDA_LOG_FILE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.1.8 DeviceandHostFunctions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
i

2.1.9 VariableSpecifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.1.9.1 DetectingDeviceCompilation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.1.10 ThreadBlockClusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.1.10.1 LaunchingwithClustersinTripleChevronNotation . . . . . . . . . . . . . . . . . . 34
2.2 WritingCUDASIMTKernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.2.1 BasicsofSIMT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.2.2 ThreadHierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.2.3 GPUDeviceMemorySpaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.2.3.1 GlobalMemory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.2.3.2 SharedMemory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.2.3.3 Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.2.3.4 LocalMemory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.2.3.5 ConstantMemory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.2.3.6 Caches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.2.3.7 TextureandSurfaceMemory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.2.3.8 DistributedSharedMemory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.2.4 MemoryPerformance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.2.4.1 CoalescedGlobalMemoryAccess . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.2.4.2 SharedMemoryAccessPatterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.2.5 Atomics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
2.2.6 CooperativeGroups. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2.2.7 KernelLaunchandOccupancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2.3 AsynchronousExecution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
2.3.1 WhatisAsynchronousConcurrentExecution? . . . . . . . . . . . . . . . . . . . . . . . . 56
2.3.2 CUDAStreams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
2.3.2.1 CreatingandDestroyingCUDAStreams . . . . . . . . . . . . . . . . . . . . . . . . . 57
2.3.2.2 LaunchingKernelsinCUDAStreams . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
2.3.2.3 LaunchingMemoryTransfersinCUDAStreams . . . . . . . . . . . . . . . . . . . . 58
2.3.2.4 StreamSynchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
2.3.3 CUDAEvents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
2.3.3.1 CreatingandDestroyingCUDAEvents . . . . . . . . . . . . . . . . . . . . . . . . . . 60
2.3.3.2 InsertingEventsintoCUDAStreams . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
2.3.3.3 TimingOperationsinCUDAStreams . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
2.3.3.4 CheckingtheStatusofCUDAEvents . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
2.3.4 CallbackFunctionsfromStreams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
2.3.4.1 UsingcudaStreamAddCallback() . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
2.3.4.2 AsynchronousErrorHandling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
2.3.5 CUDAStreamOrdering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
2.3.6 Blockingandnon-blockingstreamsandthedefaultstream . . . . . . . . . . . . . . . 66
2.3.6.1 LegacyDefaultStream . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
2.3.6.2 Per-threadDefaultStream . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
2.3.7 ExplicitSynchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
2.3.8 ImplicitSynchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
2.3.9 MiscellaneousandAdvancedtopics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
2.3.9.1 StreamPrioritization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
2.3.9.2 IntroductiontoCUDAGraphswithStreamCapture . . . . . . . . . . . . . . . . . . 68
2.3.10 SummaryofAsynchronousExecution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
2.4 UnifiedandSystemMemory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
2.4.1 UnifiedVirtualAddressSpace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
2.4.2 UnifiedMemory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
2.4.2.1 UnifiedMemoryParadigms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
2.4.2.2 FullUnifiedMemoryFeatureSupport . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
2.4.2.3 LimitedUnifiedMemorySupport . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
2.4.2.4 MemoryAdviseandPrefetch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
ii

2.4.3 Page-LockedHostMemory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
2.4.3.1 MappedMemory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
2.4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
2.5 NVCC:TheNVIDIACUDACompiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
2.5.1 CUDASourceFilesandHeaders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
2.5.2 NVCCCompilationWorkflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
2.5.3 NVCCBasicUsage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
2.5.3.1 NVCCPTXandCubinGeneration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
2.5.3.2 HostCodeCompilationNotes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
2.5.3.3 SeparateCompilationofGPUCode . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
2.5.4 CommonCompilerOptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
2.5.4.1 LanguageFeatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
2.5.4.2 DebuggingOptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
2.5.4.3 OptimizationOptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
2.5.4.4 Link-TimeOptimization(LTO) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
2.5.4.5 ProfilingOptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
2.5.4.6 FatbinCompression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
2.5.4.7 CompilerPerformanceControls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
3 AdvancedCUDA 89
3.1 AdvancedCUDAAPIsandFeatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
3.1.1 cudaLaunchKernelEx . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
3.1.2 LaunchingClusters: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
3.1.2.1 LaunchingwithClustersusingcudaLaunchKernelEx . . . . . . . . . . . . . . . . . 90
3.1.2.2 BlocksasClusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
3.1.3 MoreonStreamsandEvents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
3.1.3.1 StreamPriorities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
3.1.3.2 ExplicitSynchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
3.1.3.3 ImplicitSynchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
3.1.4 ProgrammaticDependentKernelLaunch . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
3.1.5 BatchedMemoryTransfers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
3.1.6 EnvironmentVariables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
3.2 AdvancedKernelProgramming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
3.2.1 UsingPTX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
3.2.2 HardwareImplementation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
3.2.2.1 SIMTExecutionModel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
3.2.2.2 HardwareMultithreading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
3.2.2.3 AsynchronousExecutionFeatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
3.2.3 ThreadScopes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
3.2.4 AdvancedSynchronizationPrimitives. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
3.2.4.1 ScopedAtomics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
3.2.4.2 AsynchronousBarriers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
3.2.4.3 Pipelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
3.2.5 AsynchronousDataCopies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
3.2.6 ConfiguringL1/SharedMemoryBalance. . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
3.3 TheCUDADriverAPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
3.3.1 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
3.3.2 Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
3.3.3 KernelExecution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
3.3.4 InteroperabilitybetweenRuntimeandDriverAPIs. . . . . . . . . . . . . . . . . . . . . . 125
3.4 ProgrammingSystemswithMultipleGPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
3.4.1 Multi-DeviceContextandExecutionManagement . . . . . . . . . . . . . . . . . . . . . 126
3.4.1.1 DeviceEnumeration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
3.4.1.2 DeviceSelection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
iii

3.4.1.3 Multi-DeviceStream,Event,andMemoryCopyBehavior . . . . . . . . . . . . . . . 127
3.4.2 Multi-DevicePeer-to-PeerTransfersandMemoryAccess . . . . . . . . . . . . . . . . . 128
3.4.2.1 Peer-to-PeerMemoryTransfers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
3.4.2.2 Peer-to-PeerMemoryAccess . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
3.4.2.3 Peer-to-PeerMemoryConsistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
3.4.2.4 Multi-DeviceManagedMemory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
3.4.2.5 HostIOMMUHardware,PCIAccessControlServices,andVMs . . . . . . . . . . . 130
3.5 ATourofCUDAFeatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
3.5.1 ImprovingKernelPerformance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
3.5.1.1 AsynchronousBarriers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
3.5.1.2 AsynchronousDataCopiesandtheTensorMemoryAccelerator(TMA) . . . . . . 130
3.5.1.3 Pipelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
3.5.1.4 WorkStealingwithClusterLaunchControl . . . . . . . . . . . . . . . . . . . . . . . 131
3.5.2 ImprovingLatencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
3.5.2.1 GreenContexts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
3.5.2.2 Stream-OrderedMemoryAllocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
3.5.2.3 CUDAGraphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
3.5.2.4 ProgrammaticDependentLaunch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
3.5.2.5 LazyLoading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
3.5.3 FunctionalityFeatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
3.5.3.1 ExtendedGPUMemory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
3.5.3.2 DynamicParallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
3.5.4 CUDAInteroperability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
3.5.4.1 CUDAInteroperabilitywithotherAPIs . . . . . . . . . . . . . . . . . . . . . . . . . . 133
3.5.4.2 InterprocessCommunication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
3.5.5 Fine-GrainedControl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
3.5.5.1 VirtualMemoryManagement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
3.5.5.2 DriverEntryPointAccess . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
3.5.5.3 ErrorLogManagement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
4 CUDAFeatures 135
4.1 UnifiedMemory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
4.1.1 UnifiedMemoryonDeviceswithFullCUDAUnifiedMemorySupport. . . . . . . . . . 135
4.1.1.1 UnifiedMemory: In-DepthExamples . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
4.1.1.2 PerformanceTuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
4.1.2 UnifiedMemoryonDeviceswithonlyCUDAManagedMemorySupport . . . . . . . . 149
4.1.3 UnifiedMemoryonWindows,WSL,andTegra . . . . . . . . . . . . . . . . . . . . . . . . 150
4.1.3.1 Multi-GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
4.1.3.2 CoherencyandConcurrency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
4.1.3.3 StreamAssociatedUnifiedMemory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
4.1.4 PerformanceHints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
4.1.4.1 DataPrefetching. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
4.1.4.2 DataUsageHints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
4.1.4.3 QueryingDataUsageAttributesonManagedMemory . . . . . . . . . . . . . . . . 160
4.1.4.4 GPUMemoryOversubscription . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
4.2 CUDAGraphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
4.2.1 GraphStructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
4.2.1.1 NodeTypes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
4.2.1.2 EdgeData . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
4.2.2 BuildingandRunningGraphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
4.2.2.1 GraphCreation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
4.2.2.2 GraphInstantiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
4.2.2.3 GraphExecution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
4.2.3 UpdatingInstantiatedGraphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
iv

4.2.3.1 WholeGraphUpdate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
4.2.3.2 IndividualNodeUpdate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
4.2.3.3 IndividualNodeEnable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
4.2.3.4 GraphUpdateLimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
4.2.4 ConditionalGraphNodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
4.2.4.1 ConditionalHandles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
4.2.4.2 ConditionalNodeBodyGraphRequirements . . . . . . . . . . . . . . . . . . . . . . 177
4.2.4.3 ConditionalIFNodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
4.2.4.4 ConditionalWHILENodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
4.2.4.5 ConditionalSWITCHNodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
4.2.5 GraphMemoryNodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
4.2.5.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
4.2.5.2 APIFundamentals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
4.2.5.3 OptimizedMemoryReuse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
4.2.5.4 PerformanceConsiderations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
4.2.5.5 PhysicalMemoryFootprint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
4.2.5.6 PeerAccess . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
4.2.6 DeviceGraphLaunch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
4.2.6.1 DeviceGraphCreation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
4.2.6.2 DeviceLaunch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
4.2.7 UsingGraphAPIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
4.2.8 CUDAUserObjects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
4.3 Stream-OrderedMemoryAllocator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
4.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
4.3.2 MemoryManagement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
4.3.2.1 AllocatingMemory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
4.3.2.2 FreeingMemory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
4.3.3 MemoryPools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
4.3.3.1 Default/ImplicitPools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
4.3.3.2 ExplicitPools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
4.3.3.3 DeviceAccessibilityforMulti-GPUSupport . . . . . . . . . . . . . . . . . . . . . . . 213
4.3.3.4 EnablingMemoryPoolsforIPC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
4.3.4 BestPracticesandTuning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
4.3.4.1 QueryforSupport . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
4.3.4.2 PhysicalPageCachingBehavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
4.3.4.3 ResourceUsageStatistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
4.3.4.4 MemoryReusePolicies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
4.3.4.5 SynchronizationAPIActions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
4.3.5 Addendums . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
4.3.5.1 cudaMemcpyAsyncCurrentContext/DeviceSensitivity . . . . . . . . . . . . . . . 221
4.3.5.2 cudaPointerGetAttributesQuery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
4.3.5.3 cudaGraphAddMemsetNode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
4.3.5.4 PointerAttributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
4.3.5.5 CPUVirtualMemory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
4.4 CooperativeGroups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
4.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
4.4.2 CooperativeGroupHandle&MemberFunctions . . . . . . . . . . . . . . . . . . . . . . . 222
4.4.3 DefaultBehavior/GrouplessExecution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
4.4.3.1 CreateImplicitGroupHandlesAsEarlyAsPossible . . . . . . . . . . . . . . . . . . 223
4.4.3.2 OnlyPassGroupHandlesbyReference . . . . . . . . . . . . . . . . . . . . . . . . . . 223
4.4.4 CreatingCooperativeGroups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
4.4.4.1 AvoidingGroupCreationHazards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
4.4.5 Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
4.4.5.1 Sync . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
v

4.4.5.2 Barriers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
4.4.6 CollectiveOperations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
4.4.6.1 Reduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
4.4.6.2 Scans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
4.4.6.3 InvokeOne . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
4.4.7 AsynchronousDataMovement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
4.4.7.1 MemcpyAsyncAlignmentRequirements . . . . . . . . . . . . . . . . . . . . . . . . 227
4.4.8 LargeScaleGroups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
4.4.8.1 WhentousecudaLaunchCooperativeKernel . . . . . . . . . . . . . . . . . . . . 228
4.5 ProgrammaticDependentLaunchandSynchronization . . . . . . . . . . . . . . . . . . . . 228
4.5.1 Background. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
4.5.2 APIDescription . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
4.5.3 UseinCUDAGraphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
4.6 GreenContexts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
4.6.1 Motivation/WhentoUse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
4.6.2 GreenContexts: Easeofuse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
4.6.3 GreenContexts: DeviceResourceandResourceDescriptor . . . . . . . . . . . . . . . . 236
4.6.4 GreenContextCreationExample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
4.6.4.1 Step1: GetavailableGPUresources. . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
4.6.4.2 Step2: PartitionSMresources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
4.6.4.3 Step2(continued): Addworkqueueresources . . . . . . . . . . . . . . . . . . . . . 246
4.6.4.4 Step3: CreateaResourceDescriptor . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
4.6.4.5 Step4: CreateaGreenContext . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
4.6.5 GreenContexts-Launchingwork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
4.6.6 AdditionalExecutionContextsAPIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
4.6.7 GreenContextsExample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
4.7 LazyLoading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
4.7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
4.7.2 ChangeHistory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
4.7.3 RequirementsforLazyLoading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
4.7.3.1 CUDARuntimeVersionRequirement . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
4.7.3.2 CUDADriverVersionRequirement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
4.7.3.3 CompilerRequirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
4.7.3.4 KernelRequirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
4.7.4 Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
4.7.4.1 Enabling&Disabling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
4.7.4.2 CheckingifLazyLoadingisEnabledatRuntime . . . . . . . . . . . . . . . . . . . . 253
4.7.4.3 ForcingaModuletoLoadEagerlyatRuntime . . . . . . . . . . . . . . . . . . . . . . 254
4.7.5 PotentialHazards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
4.7.5.1 ImpactonConcurrentKernelExecution . . . . . . . . . . . . . . . . . . . . . . . . . 254
4.7.5.2 LargeMemoryAllocations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
4.7.5.3 ImpactonPerformanceMeasurements . . . . . . . . . . . . . . . . . . . . . . . . . 254
4.8 ErrorLogManagement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
4.8.1 Background. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
4.8.2 Activation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
4.8.3 Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
4.8.4 APIDescription . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
4.8.5 LimitationsandKnownIssues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
4.9 AsynchronousBarriers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
4.9.1 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
4.9.2 ABarrier’sPhase: Arrival,Countdown,Completion,andReset . . . . . . . . . . . . . . 258
4.9.2.1 WarpEntanglement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
4.9.3 ExplicitPhaseTracking. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
4.9.4 EarlyExit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
vi

4.9.5 CompletionFunction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
4.9.6 TrackingAsynchronousMemoryOperations . . . . . . . . . . . . . . . . . . . . . . . . . 267
4.9.7 Producer-ConsumerPatternUsingBarriers . . . . . . . . . . . . . . . . . . . . . . . . . . 268
4.10 Pipelines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
4.10.1 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
4.10.2 SubmittingWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
4.10.3 ConsumingWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
4.10.4 WarpEntanglement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
4.10.5 EarlyExit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
4.10.6 TrackingAsynchronousMemoryOperations . . . . . . . . . . . . . . . . . . . . . . . . . 278
4.10.7 Producer-ConsumerPatternusingPipelines . . . . . . . . . . . . . . . . . . . . . . . . . 281
4.11 AsynchronousDataCopies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
4.11.1 UsingLDGSTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
4.11.1.1 BatchingLoadsinConditionalCode . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284
4.11.1.2 PrefetchingData. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
4.11.1.3 Producer-ConsumerPatternThroughWarpSpecialization . . . . . . . . . . . . . . 295
4.11.2 UsingtheTensorMemoryAccelerator(TMA) . . . . . . . . . . . . . . . . . . . . . . . . . 299
4.11.2.1 UsingTMAtotransferone-dimensionalarrays . . . . . . . . . . . . . . . . . . . . . 300
4.11.2.2 UsingTMAtotransfermulti-dimensionalarrays . . . . . . . . . . . . . . . . . . . . 307
4.11.3 UsingSTAS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
4.12 WorkStealingwithClusterLaunchControl . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327
4.12.1 APIDetails . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329
4.12.1.1 ThreadBlockCancellation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329
4.12.1.2 ConstraintsonThreadBlockCancellation . . . . . . . . . . . . . . . . . . . . . . . . 330
4.12.2 Example: Vector-ScalarMultiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331
4.12.2.1 Use-case: ThreadBlocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331
4.12.2.2 Use-case: ThreadBlockClusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
4.13 L2CacheControl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
4.13.1 L2CacheSet-AsideforPersistingAccesses . . . . . . . . . . . . . . . . . . . . . . . . . 335
4.13.2 L2PolicyforPersistingAccesses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336
4.13.3 L2AccessProperties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
4.13.4 L2PersistenceExample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
4.13.5 ResetL2AccesstoNormal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339
4.13.6 ManageUtilizationofL2set-asidecache . . . . . . . . . . . . . . . . . . . . . . . . . . . 339
4.13.7 QueryL2cacheProperties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339
4.13.8 ControlL2CacheSet-AsideSizeforPersistingMemoryAccess . . . . . . . . . . . . . 339
4.14 MemorySynchronizationDomains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340
4.14.1 MemoryFenceInterference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340
4.14.2 IsolatingTrafficwithDomains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341
4.14.3 UsingDomainsinCUDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341
4.15 InterprocessCommunication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342
4.15.1 IPCusingtheLegacyInterprocessCommunicationAPI . . . . . . . . . . . . . . . . . . 343
4.15.2 IPCusingtheVirtualMemoryManagementAPI . . . . . . . . . . . . . . . . . . . . . . . 344
4.16 VirtualMemoryManagement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344
4.16.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345
4.16.1.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345
4.16.1.2 QueryforSupport . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346
4.16.2 APIOverview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347
4.16.3 UnicastMemorySharing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
4.16.3.1 AllocateandExport . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
4.16.3.2 ShareandImport . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351
4.16.3.3 ReserveandMap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355
4.16.3.4 AccessRights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356
4.16.3.5 ReleasingtheMemory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356
vii

4.16.4 MulticastMemorySharing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357
4.16.4.1 AllocatingMulticastObjects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357
4.16.4.2 AddDevicestoMulticastObjects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358
4.16.4.3 BindMemorytoMulticastObjects . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358
4.16.4.4 UseMulticastMappings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358
4.16.5 AdvancedConfiguration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359
4.16.5.1 MemoryType . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359
4.16.5.2 CompressibleMemory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359
4.16.5.3 VirtualAliasingSupport . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360
4.16.5.4 OS-SpecificHandleDetailsforIPC . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361
4.17 ExtendedGPUMemory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362
4.17.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362
4.17.1.1 EGMPlatforms: Systemtopology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362
4.17.1.2 SocketIdentifiers: Whatarethey? Howtoaccessthem? . . . . . . . . . . . . . . 363
4.17.1.3 AllocatorsandEGMsupport . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363
4.17.1.4 MemorymanagementextensionstocurrentAPIs . . . . . . . . . . . . . . . . . . . 363
4.17.2 UsingtheEGMInterface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364
4.17.2.1 Single-Node,Single-GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364
4.17.2.2 Single-Node,Multi-GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364
4.17.2.3 Multi-Node,Multi-GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366
4.18 CUDADynamicParallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367
4.18.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367
4.18.1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367
4.18.2 ExecutionEnvironment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367
4.18.2.1 ParentandChildGrids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367
4.18.2.2 ScopeofCUDAPrimitives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368
4.18.2.3 StreamsandEvents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368
4.18.2.4 OrderingandConcurrency. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369
4.18.3 MemoryCoherenceandConsistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369
4.18.3.1 GlobalMemory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369
4.18.3.2 MappedMemory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370
4.18.3.3 SharedandLocalMemory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370
4.18.3.4 LocalMemory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371
4.18.4 ProgrammingInterface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371
4.18.4.1 Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371
4.18.4.2 C++LanguageInterfaceforCDP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372
4.18.5 ProgrammingGuidelines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374
4.18.5.1 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374
4.18.5.2 ImplementationRestrictionsandLimitations . . . . . . . . . . . . . . . . . . . . . . 374
4.18.5.3 CompatibilityandInteroperability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375
4.18.6 Device-sideLaunchfromPTX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375
4.18.6.1 KernelLaunchAPIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375
4.18.6.2 ParameterBufferLayout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377
4.19 CUDAInteroperabilitywithAPIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377
4.19.1 GraphicsInteroperability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377
4.19.1.1 OpenGLInteroperability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 378
4.19.1.2 Direct3DInteroperability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381
4.19.1.3 InteroperabilityinaScalableLinkInterface(SLI)configuration . . . . . . . . . . . 384
4.19.2 Externalresourceinteroperability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384
4.19.2.1 Vulkaninteroperability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385
4.19.2.2 Direct3DInteroperability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397
4.19.2.3 NVIDIASoftwareCommunicationInterfaceInteroperability(NVSCI). . . . . . . . 404
4.20 DriverEntryPointAccess . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 410
4.20.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 410
viii

4.20.2 DriverFunctionTypedefs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 410
4.20.3 DriverFunctionRetrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411
4.20.3.1 UsingtheDriverAPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412
4.20.3.2 UsingtheRuntimeAPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413
4.20.3.3 RetrievePer-threadDefaultStreamVersions . . . . . . . . . . . . . . . . . . . . . . 413
4.20.3.4 AccessNewCUDAfeatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414
4.20.4 PotentialImplicationswithcuGetProcAddress . . . . . . . . . . . . . . . . . . . . . . . . 414
4.20.4.1 ImplicationswithcuGetProcAddressvsImplicitLinking . . . . . . . . . . . . . . . 414
4.20.4.2 CompileTimevsRuntimeVersionUsageincuGetProcAddress . . . . . . . . . . . 415
4.20.4.3 APIVersionBumpswithExplicitVersionChecks . . . . . . . . . . . . . . . . . . . . 416
4.20.4.4 IssueswithRuntimeAPIUsage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
4.20.4.5 IssueswithRuntimeAPIandDynamicVersioning . . . . . . . . . . . . . . . . . . . 418
4.20.4.6 IssueswithRuntimeAPIallowingCUDAVersion . . . . . . . . . . . . . . . . . . . . 419
4.20.4.7 ImplicationstoAPI/ABI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419
4.20.5 DeterminingcuGetProcAddressFailureReasons. . . . . . . . . . . . . . . . . . . . . . . 420
5 TechnicalAppendices 423
5.1 ComputeCapabilities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423
5.1.1 ObtaintheGPUComputeCapability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423
5.1.2 FeatureAvailability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424
5.1.2.1 Architecture-SpecificFeatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424
5.1.2.2 Family-SpecificFeatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424
5.1.2.3 FeatureSetCompilerTargets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424
5.1.3 FeaturesandTechnicalSpecifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425
5.2 CUDAEnvironmentVariables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 431
5.2.1 DeviceEnumerationandProperties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 431
5.2.1.1 CUDA_VISIBLE_DEVICES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 431
5.2.1.2 CUDA_DEVICE_ORDER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 432
5.2.1.3 CUDA_MANAGED_FORCE_DEVICE_ALLOC . . . . . . . . . . . . . . . . . . . . . . . . . 432
5.2.2 JITCompilation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 432
5.2.2.1 CUDA_CACHE_DISABLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 432
5.2.2.2 CUDA_CACHE_PATH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433
5.2.2.3 CUDA_CACHE_MAXSIZE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433
5.2.2.4 CUDA_FORCE_PTX_JITandCUDA_FORCE_JIT . . . . . . . . . . . . . . . . . . . . . 433
5.2.2.5 CUDA_DISABLE_PTX_JITandCUDA_DISABLE_JIT . . . . . . . . . . . . . . . . . . 434
5.2.2.6 CUDA_FORCE_PRELOAD_LIBRARIES . . . . . . . . . . . . . . . . . . . . . . . . . . . 434
5.2.3 Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434
5.2.3.1 CUDA_LAUNCH_BLOCKING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434
5.2.3.2 CUDA_DEVICE_MAX_CONNECTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435
5.2.3.3 CUDA_DEVICE_MAX_COPY_CONNECTIONS . . . . . . . . . . . . . . . . . . . . . . . . 435
5.2.3.4 CUDA_SCALE_LAUNCH_QUEUES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435
5.2.3.5 CUDA_GRAPHS_USE_NODE_PRIORITY. . . . . . . . . . . . . . . . . . . . . . . . . . . 436
5.2.3.6 CUDA_DEVICE_WAITS_ON_EXCEPTION . . . . . . . . . . . . . . . . . . . . . . . . . . 436
5.2.3.7 CUDA_DEVICE_DEFAULT_PERSISTING_L2_CACHE_PERCENTAGE_LIMIT . . . . 436
5.2.3.8 CUDA_AUTO_BOOST[[deprecated]] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437
5.2.4 ModuleLoading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437
5.2.4.1 CUDA_MODULE_LOADING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437
5.2.4.2 CUDA_MODULE_DATA_LOADING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438
5.2.4.3 CUDA_BINARY_LOADER_THREAD_COUNT . . . . . . . . . . . . . . . . . . . . . . . . . 438
5.2.5 CUDAErrorLogManagement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438
5.2.5.1 CUDA_LOG_FILE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438
5.3 C++LanguageSupport . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 439
5.3.1 C++11LanguageFeatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 439
5.3.2 C++14LanguageFeatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442
ix

5.3.3 C++17LanguageFeatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442
5.3.4 C++20LanguageFeatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444
5.3.5 CUDAC++StandardLibrary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446
5.3.6 CStandardLibraryFunctions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447
5.3.6.1 clock()andclock64() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447
5.3.6.2 printf() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447
5.3.6.3 memcpy()andmemset() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 450
5.3.6.4 malloc()andfree() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 450
5.3.6.5 alloca() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454
5.3.7 LambdaExpressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454
5.3.7.1 LambdaExpressionsand__global__FunctionParameters . . . . . . . . . . . . 455
5.3.7.2 ExtendedLambdas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455
5.3.7.3 ExtendedLambdaTypeTraits. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456
5.3.7.4 ExtendedLambdaRestrictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 458
5.3.7.5 Host-DeviceLambdaOptimizationNotes . . . . . . . . . . . . . . . . . . . . . . . . 468
5.3.7.6 *thisCaptureBy-Value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 468
5.3.7.7 ArgumentDependentLookup(ADL). . . . . . . . . . . . . . . . . . . . . . . . . . . . 470
5.3.8 PolymorphicFunctionWrappers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 471
5.3.9 C/C++LanguageRestrictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474
5.3.9.1 UnsupportedFeatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474
5.3.9.2 NamespaceReservations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474
5.3.9.3 PointersandMemoryAddresses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475
5.3.9.4 Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475
5.3.9.5 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 481
5.3.9.6 Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483
5.3.9.7 Templates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 487
5.3.10 C++11Restrictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 488
5.3.10.1 inlineNamespaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 488
5.3.10.2 inlineUnnamedNamespaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 488
5.3.10.3 constexprFunctions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 488
5.3.10.4 constexprVariables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 491
5.3.10.5 __global__VariadicTemplate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 492
5.3.10.6 DefaultedFunctions= default . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 492
5.3.10.7 [cuda::]std::initializer_list. . . . . . . . . . . . . . . . . . . . . . . . . . . 493
5.3.10.8 [cuda::]std::move,[cuda::]std::forward . . . . . . . . . . . . . . . . . . . 494
5.3.11 C++14Restrictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494
5.3.11.1 FunctionswithDeducedReturnType . . . . . . . . . . . . . . . . . . . . . . . . . . . 494
5.3.11.2 VariableTemplates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495
5.3.12 C++17Restrictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 496
5.3.12.1 inlineVariables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 496
5.3.12.2 StructuredBinding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 496
5.3.13 C++20Restrictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 497
5.3.13.1 Three-wayComparisonOperator. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 497
5.3.13.2 constevalFunctions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 497
5.4 C/C++LanguageExtensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 498
5.4.1 FunctionandVariableAnnotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 498
5.4.1.1 ExecutionSpaceSpecifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 498
5.4.1.2 MemorySpaceSpecifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 498
5.4.1.3 InliningSpecifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 502
5.4.1.4 __restrict__Pointers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 502
5.4.1.5 __grid_constant__Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 503
5.4.1.6 AnnotationSummary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504
5.4.2 Built-inTypesandVariables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505
5.4.2.1 HostCompilerTypeExtensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505
x

5.4.2.2 Built-inVariables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505
5.4.2.3 Built-inTypes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 506
5.4.3 KernelConfiguration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 508
5.4.3.1 ThreadBlockCluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 508
5.4.3.2 LaunchBounds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 509
5.4.3.3 MaximumNumberofRegistersperThread . . . . . . . . . . . . . . . . . . . . . . . 511
5.4.4 SynchronizationPrimitives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 512
5.4.4.1 ThreadBlockSynchronizationFunctions . . . . . . . . . . . . . . . . . . . . . . . . . 512
5.4.4.2 WarpSynchronizationFunction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514
5.4.4.3 MemoryFenceFunctions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514
5.4.5 AtomicFunctions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 519
5.4.5.1 LegacyAtomicFunctions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 520
5.4.5.2 Built-inAtomicFunctions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525
5.4.6 WarpFunctions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 531
5.4.6.1 WarpActiveMask . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 531
5.4.6.2 WarpVoteFunctions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 532
5.4.6.3 WarpMatchFunctions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 532
5.4.6.4 WarpReduceFunctions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533
5.4.6.5 WarpShuffleFunctions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534
5.4.6.6 Warp__syncIntrinsicConstraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . 539
5.4.7 CUDA-SpecificMacros . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 541
5.4.7.1 __CUDA_ARCH__ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 541
5.4.7.2 __CUDA_ARCH_SPECIFIC__and__CUDA_ARCH_FAMILY_SPECIFIC__ . . . . . 543
5.4.7.3 CUDAFeatureTestingMacros . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544
5.4.7.4 __nv_pure__Attribute . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544
5.4.8 CUDA-SpecificFunctions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544
5.4.8.1 AddressSpacePredicateFunctions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544
5.4.8.2 AddressSpaceConversionFunctions . . . . . . . . . . . . . . . . . . . . . . . . . . . 545
5.4.8.3 Low-LevelLoadandStoreFunctions . . . . . . . . . . . . . . . . . . . . . . . . . . . 546
5.4.8.4 __trap() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 546
5.4.8.5 __nanosleep() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 547
5.4.8.6 DynamicProgrammingeXtension(DPX)Instructions . . . . . . . . . . . . . . . . . 547
5.4.9 CompilerOptimizationHints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 549
5.4.9.1 #pragma unroll . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 550
5.4.9.2 __builtin_assume_aligned() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 551
5.4.9.3 __builtin_assume()and__assume() . . . . . . . . . . . . . . . . . . . . . . . . 551
5.4.9.4 __builtin_expect() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 551
5.4.9.5 __builtin_unreachable() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 552
5.4.9.6 CustomABIPragmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 552
5.4.10 DebuggingandDiagnostics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 554
5.4.10.1 Assertion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 554
5.4.10.2 BreakpointFunction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555
5.4.10.3 DiagnosticPragmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555
5.4.11 WarpMatrixFunctions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 556
5.4.11.1 Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 556
5.4.11.2 AlternateFloatingPoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 558
5.4.11.3 DoublePrecision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 558
5.4.11.4 Sub-byteOperations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 559
5.4.11.5 Restrictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 560
5.4.11.6 ElementTypesandMatrixSizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 560
5.4.11.7 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 562
5.5 Floating-PointComputation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 562
5.5.1 Floating-PointIntroduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 562
5.5.1.1 Floating-PointFormat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 562
xi

5.5.1.2 NormalandSubnormalValues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564
5.5.1.3 SpecialValues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564
5.5.1.4 Associativity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 565
5.5.1.5 FusedMultiply-Add(FMA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 566
5.5.1.6 DotProductExample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 568
5.5.1.7 Rounding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 568
5.5.1.8 NotesonHost/DeviceComputationAccuracy . . . . . . . . . . . . . . . . . . . . . 569
5.5.2 Floating-PointDataTypes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 570
5.5.3 CUDAandIEEE-754Compliance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 572
5.5.4 CUDAandC/C++Compliance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 573
5.5.5 Floating-PointFunctionalityExposure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 574
5.5.6 Built-InArithmeticOperators. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 577
5.5.7 CUDAC++MathematicalStandardLibraryFunctions. . . . . . . . . . . . . . . . . . . . 578
5.5.7.1 BasicOperations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 578
5.5.7.2 ExponentialFunctions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 579
5.5.7.3 PowerFunctions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 580
5.5.7.4 TrigonometricFunctions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 581
5.5.7.5 HyperbolicFunctions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 581
5.5.7.6 ErrorandGammaFunctions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 582
5.5.7.7 NearestIntegerFloating-PointOperations. . . . . . . . . . . . . . . . . . . . . . . . 583
5.5.7.8 Floating-PointManipulationFunctions . . . . . . . . . . . . . . . . . . . . . . . . . . 583
5.5.7.9 ClassificationandComparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584
5.5.8 Non-StandardCUDAMathematicalFunctions . . . . . . . . . . . . . . . . . . . . . . . . 585
5.5.9 IntrinsicFunctions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 587
5.5.9.1 BasicIntrinsicFunctions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 587
5.5.9.2 Single-Precision-OnlyIntrinsicFunctions . . . . . . . . . . . . . . . . . . . . . . . . 588
5.5.9.3 --use_fast_mathEffect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 589
5.5.10 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 589
5.6 Device-CallableAPIsandIntrinsics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 590
5.6.1 MemoryBarrierPrimitivesInterface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 590
5.6.1.1 DataTypes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 590
5.6.1.2 MemoryBarrierPrimitivesAPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 590
5.6.2 PipelinePrimitivesInterface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 591
5.6.2.1 memcpy_asyncPrimitive. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 592
5.6.2.2 CommitPrimitive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 592
5.6.2.3 WaitPrimitive. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 592
5.6.2.4 ArriveOnBarrierPrimitive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 592
5.6.3 CooperativeGroupsAPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 593
5.6.3.1 cooperative_groups.h . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 593
5.6.3.2 cooperative_groups/async.h. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 599
5.6.3.3 cooperative_groups/partition.h . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 602
5.6.3.4 cooperative_groups/reduce.h . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 603
5.6.3.5 cooperative_groups/scan.h . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 606
5.6.3.6 cooperative_groups/sync.h . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 609
5.6.4 CUDADeviceRuntime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 611
5.6.4.1 IncludingDeviceRuntimeAPIinCUDACode . . . . . . . . . . . . . . . . . . . . . . 612
5.6.4.2 MemoryintheCUDADeviceRuntime. . . . . . . . . . . . . . . . . . . . . . . . . . . 612
5.6.4.3 SMIdandWarpId . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 614
5.6.4.4 LaunchSetupAPIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 614
5.6.4.5 DeviceManagement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 615
5.6.4.6 APIReference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 615
5.6.4.7 APIErrorsandLaunchFailures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 617
5.6.4.8 DeviceRuntimeStreams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 617
5.6.4.9 ECCErrors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 619
xii

6 Notices 621
6.1 Notice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 621
6.2 OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 622
6.3 Trademarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 622
xiii

xiv

CUDAProgrammingGuide,Release13.1
CUDAandtheCUDAProgrammingGuide
CUDAisaparallelcomputingplatformandprogrammingmodeldevelopedbyNVIDIAthatenablesdra-
maticincreasesincomputingperformancebyharnessingthepoweroftheGPU.Itallowsdevelopersto
acceleratecompute-intensiveapplicationsandiswidelyusedinfieldssuchasdeeplearning,scientific
computing,andhigh-performancecomputing(HPC).
This CUDA Programming Guide is the official, comprehensive resource on the CUDA programming
modelandhowtowritecodethatexecutesontheGPUusingtheCUDAplatform. Thisguidecovers
everythingfromthetheCUDAprogrammingmodelandtheCUDAplatformtothedetailsoflanguage
extensionsandcovershowtomakeuseofspecifichardwareandsoftwarefeatures. Thisguidepro-
videsapathwayfordeveloperstolearnCUDAiftheyarenew,andalsoprovidesanessentialresource
fordevelopersastheybuildapplicationsusingCUDA.
OrganizationofThisGuide
Evenfordeveloperswhoprimarilyuselibraries,frameworks,orDSLs,anunderstandingoftheCUDA
programmingmodelandhowGPUsexecutecodeisvaluableinknowingwhatishappeningbehindthe
layersofabstraction. ThisguidestartswithachapterontheCUDAprogrammingmodeloutsideofany
specificprogramminglanguagewhichisapplicabletoanyoneinterestedinunderstandinghowCUDA
works,evennon-developers.
Theguideisbrokendownintofiveprimaryparts:
▶ Part1: IntroductionandProgrammingModelAbstract
▶ AlanguageagnosticoverviewoftheCUDAprogrammingmodelaswellasabrieftourofthe
CUDAplatform.
▶ ThissectionismeanttobereadbyanyonewantingtounderstandGPUsandtheconcepts
ofexecutingcodeonGPUs,eveniftheyarenotdevelopers.
▶ Part2: ProgrammingGPUsinCUDA
▶ ThebasicsofprogrammingGPUsusingCUDAC++.
▶ ThissectionismeanttobereadbyanyonewantingtogetstartedinGPUprogramming.
▶ This section is meant to be instructional, not complete, and teaches the most important
andcommonpartsofCUDAprogramming,includingsomecommonperformanceconsider-
ations.
▶ Part3: AdvancedCUDA
▶ IntroducessomemoreadvancefeaturesofCUDAthatenablebothfine-grainedcontroland
moreopportunitiestomaximizeperformance,includingtheuseofmultipleGPUsinasingle
application.
▶ Thissectionconcludeswithatourofthefeaturescoveredinpart4withabriefintroduction
to the purpose and function of each, sorted by when and why a developer may find each
featureuseful.
▶ Part4: CUDAFeatures
▶ This section contains complete coverage of specific CUDA features such as CUDA graphs,
dynamicparallelism,interoperabilitywithgraphicsAPIs,andunifiedmemory.
▶ ThissectionshouldbeconsultedwhenknowingthecompletepictureofaspecificCUDAfea-
tureisneeded. Wherepossible,carehasbeentakentointroduceandmotivatethefeatures
coveredinthissectioninearliersections.
▶ Part5: TechnicalAppendices
Contents 1

CUDAProgrammingGuide,Release13.1
▶ ThetechnicalappendicesprovidesomereferencedocumentationonCUDA’sC++high-level
languagesupport,hardware-specificspecifications,andothertechnicalspecifications.
▶ This section is meant as technical reference for specific description of syntax, semantics,
andtechnicalbehaviorofelementsofCUDA.
Parts1-3provideaguidedlearningexperiencefordevelopersnewtoCUDA,thoughtheyalsoprovide
insightandupdatedinformationusefulforCUDAdevelopersofanyexperiencelevel.
Parts 4 and 5 provide a wealth of information about specific features and detailed topics, and are
intendedtoprovideacurated,well-organizedreferencefordevelopersneedingtoknowmoredetails
astheywriteCUDAapplications.
2 Contents

Chapter 1. Introduction to CUDA
1.1. Introduction
1.1.1. The Graphics Processing Unit
Bornasaspecial-purposeprocessorfor3Dgraphics,theGraphicsProcessingUnit(GPU)startedout
as fixed-function hardware to accelerate parallel operations in real-time 3D rendering. Over succes-
sivegenerations, GPUsbecamemoreprogrammable. By2003, somestagesofthegraphicspipeline
becamefullyprogrammable,runningcustomcodeinparallelforeachcomponentofa3Dsceneoran
image.
In2006, NVIDIAintroducedtheComputeUnifiedDeviceArchitecture(CUDA)toenableanycomputa-
tionalworkloadtousethethroughputcapabilityofGPUsindependentofgraphicsAPIs.
Sincethen,CUDAandGPUcomputinghavebeenusedtoacceleratecomputationalworkloadsofnearly
everytype, fromscientificsimulationssuchasfluiddynamicsorenergytransporttobusinessappli-
cationslikedatabasesandanalytics. Moreover,thecapabilityandprogrammabilityofGPUshasbeen
foundationaltotheadvancementofnewalgorithmsandtechnologiesrangingfromimageclassifica-
tiontogenerativeartificialintelligencesuchasdiffusionorlargelanguagemodels.
1.1.2. The Benefits of Using GPUs
AGPUprovidesmuchhigherinstructionthroughputandmemorybandwidththanaCPUwithinasim-
ilarpriceandpowerenvelope. Manyapplicationsleveragethesecapabilitiestorunsignificantlyfaster
on the GPU than on the CPU (see GPU Applications). Other computing devices, like FPGAs, are also
veryenergyefficient,butoffermuchlessprogrammingflexibilitythanGPUs.
GPUsandCPUsaredesignedwithdifferentgoalsinmind. WhileaCPUisdesignedtoexcelatexecuting
aserialsequenceofoperations(calledathread)asfastaspossibleandcanexecuteafewtensofthese
threadsinparallel,aGPUisdesignedtoexcelatexecutingthousandsofthreadsinparallel,tradingoff
lowersingle-threadperformancetoachievemuchgreatertotalthroughput.
GPUs are specialized for highly parallel computations and devote more transistors to data process-
ingunits, whileCPUsdedicatemoretransistorstodatacachingandflowcontrol. Figure1showsan
exampledistributionofchipresourcesforaCPUversusaGPU.
3

CUDAProgrammingGuide,Release13.1
Figure1: TheGPUDevotesMoreTransistorstoDataProcessing
1.1.3. Getting Started Quickly
TherearemanywaystoleveragethecomputepowerprovidedbyGPUs. Thisguidecoversprogramming
for the CUDA GPU platform in high-level languages such as C++. However, there are many ways to
utilizeGPUsinapplicationsthatdonotrequiredirectlywritingGPUcode.
Anever-growingcollectionofalgorithmsandroutinesfromavarietyofdomainsisavailablethrough
specialized libraries. When a library has already been implemented—especially those provided by
NVIDIA—using it is often more productive and performant than reimplementing algorithms from
scratch. LibrarieslikecuBLAS,cuFFT,cuDNN,andCUTLASSarejustafewexamplesoflibrariesthat
helpdevelopersavoidreimplementingwell-establishedalgorithms. Theselibrarieshavetheaddedben-
efitofbeingoptimizedforeachGPUarchitecture,providinganidealmixofproductivity,performance,
andportability.
There are also frameworks, particularly those used for artificial intelligence, that provide GPU-
accelerated building blocks. Many of these frameworks achieve their acceleration by leveraging the
GPU-acceleratedlibrariesmentionedabove.
Additionally, domain-specific languages (DSLs) such as NVIDIA’s Warp or OpenAI’s Triton compile to
rundirectlyontheCUDAplatform. Thisprovidesanevenhigher-levelmethodofprogrammingGPUs
thanthehigh-levellanguagescoveredinthisguide.
TheNVIDIAAcceleratedComputingHubcontainsresources,examples,andtutorialstoteachGPUand
CUDAcomputing.
1.2. Programming Model
ThischapterintroducestheCUDAprogrammingmodelatahighlevelandseparatefromanylanguage.
The terminology and concepts introduced here apply to CUDA in any supported programming lan-
guage. LaterchapterswillillustratetheseconceptsinC++.
4 Chapter1. IntroductiontoCUDA

CUDAProgrammingGuide,Release13.1
1.2.1. Heterogeneous Systems
TheCUDAprogrammingmodelassumesaheterogeneouscomputingsystem,whichmeansasystem
thatincludesbothGPUsandCPUs. TheCPUandthememorydirectlyconnectedtoitarecalledthe
host and hostmemory, respectively. A GPU and the memory directly connected to it are referred to
asthedeviceanddevicememory,respectively. Insomesystem-on-chip(SoC)systems,thesemaybe
partofasinglepackage. Inlargersystems,theremaybemultipleCPUsorGPUs.
CUDAapplicationsexecutesomepartoftheircodeontheGPU,butapplicationsalwaysstartexecution
ontheCPU.Thehostcode,whichisthecodethatrunsontheCPU,canuseCUDAAPIstocopydata
between the host memory and device memory, start code executing on the GPU, and wait for data
copiesorGPUcodetocomplete. TheCPUandGPUcanbothbeexecutingcodesimultaneously,and
bestperformanceisusuallyfoundbymaximizingutilizationofbothCPUsandGPUs.
ThecodeanapplicationexecutesontheGPUisreferredtoasdevicecode,andafunctionthatisinvoked
forexecutionontheGPUis,forhistoricalreasons,calledakernel. Theactofstartingakernelrunningis
calledlaunchingthekernel. Akernellaunchcanbethoughtofasstartingmanythreadsexecutingthe
kernelcodeinparallelontheGPU.GPUthreadsoperatesimilarlytothreadsonCPUs,thoughthereare
somedifferencesimportanttobothcorrectnessandperformancethatwillbecoveredinlatersections
(seeSection3.2.2.1.1).
1.2.2. GPU Hardware Model
Likeanyprogrammingmodel,CUDAreliesonaconceptualmodeloftheunderlyinghardware. Forthe
purposesofCUDAprogramming,theGPUcanbeconsideredtobeacollectionofStreamingMultipro-
cessors (SMs) which are organized into groups called GraphicsProcessingClusters (GPCs). Each SM
containsalocalregisterfile,aunifieddatacache,andanumberoffunctionalunitsthatperformcom-
putations. The unified data cache provides the physical resources for sharedmemory and L1 cache.
TheallocationoftheunifieddatacachetoL1andsharedmemorycanbeconfiguredatruntime. The
sizesofdifferenttypesofmemoryandthenumberoffunctionalunitswithinanSMcanvaryacross
GPUarchitectures.
Note
TheactualhardwarelayoutofaGPUorthewayitphysicallycarriesouttheexecutionofthepro-
grammingmodelmayvary. Thesedifferencesdonotaffectcorrectnessofsoftwarewrittenusing
theCUDAprogrammingmodel.
1.2.2.1 ThreadBlocksandGrids
Whenanapplicationlaunchesakernel,itdoessowithmanythreads,oftenmillionsofthreads. These
threadsareorganizedintoblocks. Ablockofthreadsisreferredto,perhapsunsurprisingly,asathread
block. Threadblocksareorganizedintoagrid. Allthethreadblocksinagridhavethesamesizeand
dimensions. Figure3showsanillustrationofagridofthreadblocks.
Thread blocks and grids may be 1, 2, or 3 dimensional. These dimensions can simplify mapping of
individualthreadstounitsofworkordataitems.
Whenakernelislaunched,itislaunchedusingaspecificexecutionconfigurationwhichspecifiesthe
gridandthreadblockdimensions. Theexecutionconfigurationmayalsoincludeoptionalparameters
suchasclustersize,stream,andSMconfigurationsettings,whichwillbeintroducedinlatersections.
Using built-in variables, each thread executing the kernel can determine its location within its con-
taining block and the location of its block within the containing grid. A thread can also use these
1.2. ProgrammingModel 5

CUDAProgrammingGuide,Release13.1
Figure2: AGPUhasmanystreamingmultiprocessors(SMs),eachofwhichcontainsmanyfunctional
units. Graphicsprocessingclusters(GPCs)arecollectionsofSMs. AGPUisasetofGPCsconnected
totheGPUmemory. ACPUtypicallyhasseveralcoresandamemorycontrollerwhichconnectstothe
systemmemory. ACPUandaGPUareconnectedbyaninterconnectsuchasPCIeorNVLINK.
Figure3: Grid ofThreadBlocks. Eacharrowrepresentsa thread(the number ofarrowsisnotrepre-
sentativeofactualnumberofthreads).
6 Chapter1. IntroductiontoCUDA

CUDAProgrammingGuide,Release13.1
built-invariablestodeterminethedimensionsofthethreadblocksandthegridonwhichthekernel
waslaunched. Thisgiveseachthreadauniqueidentityamongallthethreadsrunningthekernel. This
identityisfrequentlyusedtodeterminewhatdataoroperationsathreadisresponsiblefor.
AllthreadsofathreadblockareexecutedinasingleSM.Thisallowsthreadswithinathreadblockto
communicateandsynchronizewitheachotherefficiently. Threadswithinathreadblockallhaveaccess
totheon-chipsharedmemory, whichcanbeusedforexchanginginformationbetweenthreadsofa
threadblock.
Agridmayconsistofmillionsofthreadblocks,whiletheGPUexecutingthegridmayhaveonlytensor
hundredsofSMs. AllthreadsofathreadblockareexecutedbyasingleSMand,inmostcases1,runto
completiononthatSM.Thereisnoguaranteeofschedulingbetweenthreadblocks,soathreadblock
cannot rely on results from other thread blocks, as they may not be able to be scheduled until that
threadblockhascompleted. Figure4showsanexampleofhowthreadblocksfromagridareassigned
toanSM.
The CUDA programming model enables arbitrarily large grids to run on GPUs of any size, whether it
has only one SM or thousands of SMs. To achieve this, the CUDA programming model, with some
exceptions,requiresthattherebenodatadependenciesbetweenthreadsindifferentthreadblocks.
Thatis,athreadshouldnotdependonresultsfromorsynchronizewithathreadinadifferentthread
blockofthesame grid. All thethreadswithina threadblockrunon thesameSMatthesametime.
DifferentthreadblockswithinthegridarescheduledamongtheavailableSMsandmaybeexecuted
in any order. In short, the CUDA programming model requires that it be possible to execute thread
blocksinanyorder,inparallelorinseries.
1.2.2.1.1 ThreadBlockClusters
In addition to thread blocks, GPUs with compute capability 9.0 and higher have an optional level of
groupingcalledclusters. Clustersareagroupofthreadblockswhich,likethreadblocksandgrids,can
be laid out in 1, 2, or 3 dimensions. Figure 5 illustrates a grid of thread blocks that is also organized
intoclusters. Specifyingclustersdoesnotchangethegriddimensionsortheindicesofathreadblock
withinagrid.
Specifyingclustersgroupsadjacentthreadblocksintoclustersandprovidessomeadditionaloppor-
tunities for synchronization and communication at the cluster level. Specifically, all thread blocks in
a cluster are executed in a single GPC. Figure 6 shows how thread blocks are scheduled to SMs in a
GPCwhenclustersarespecified. Becausethethreadblocksarescheduledsimultaneouslyandwithin
asingleGPC,threadsindifferentblocksbutwithinthesameclustercancommunicateandsynchro-
nize with each other using software interfaces provided by CooperativeGroups. Threads in clusters
can access the shared memory of all blocks in the cluster, which is referred to as distributedshared
memory.Themaximumsizeofaclusterishardwaredependentandvariesbetweendevices.
Figure6illustratesthehowthreadblockswithinaclusterarescheduledsimultaneouslyonSMswithin
aGPC.Threadblockswithinaclusterarealwaysadjacenttoeachotherwithinthegrid.
1.2.2.2 WarpsandSIMT
Withinathreadblock,threadsareorganizedintogroupsof32threadscalledwarps. Awarpexecutes
the kernel code in a Single-InstructionMultiple-Threads (SIMT) paradigm. In SIMT, all threads in the
warpareexecutingthesamekernelcode,buteachthreadmayfollowdifferentbranchesthroughthe
code. Thatis,thoughallthreadsoftheprogramexecutethesamecode,threadsdonotneedtofollow
thesameexecutionpath.
1IncertainsituationswhenusingfeaturessuchasCUDADynamicParallelism,athreadblockmaybesuspendedtomemory.
ThismeansthestateoftheSMisstoredtoasystem-managedareaofGPUmemoryandtheSMisfreedtoexecuteother
threadblocks.ThisissimilartocontextswappingonCPUs.Thisisnotcommon.
1.2. ProgrammingModel 7

CUDAProgrammingGuide,Release13.1
Figure 4: Each SM has one or more active thread blocks. In this example, each SM has three thread
blocks scheduled simultaneously. There are no guarantees about the order in which thread blocks
fromagridareassignedtoSMs.
| 8   | Chapter1. | IntroductiontoCUDA |
| --- | --------- | ------------------ |

CUDAProgrammingGuide,Release13.1
Figure5: Whenclustersarespecified,threadblocksareinthesamelocationinthegridbutalsohave
apositionwithinthecontainingcluster.
Whenthreadsareexecutedbyawarp,theyareassignedawarplane. Warplanesarenumbered0to31
andthreadsfromathreadblockareassignedtowarpsinapredictablefashiondetailedinHardware
Multithreading.
All threads in the warp execute the same instruction simultaneously. If some threads within a warp
follow a control flow branch in execution while others do not, the threads which do not follow the
branch will be masked off while the threads which follow the branch are executed. For example, if a
conditional is only true for half the threads in a warp, the other half of the warp would be masked
offwhiletheactivethreadsexecutethoseinstructions. ThissituationisillustratedinFigure7. When
different threads in a warp follow different code paths, this is sometimes called warp divergence. It
followsthatutilizationoftheGPUismaximizedwhenthreadswithinawarpfollowthesamecontrol
flowpath.
In the SIMT model, all threads in a warp progress through the kernel in lock step. Hardware execu-
tionmaydiffer. SeethesectionsonIndependentThreadExecutionformoreinformationonwherethis
distinctionisimportant. Exploitingknowledgeofhowwarpexecutionisactuallymappedtorealhard-
wareisdiscouraged. TheCUDAprogrammingmodelandSIMTsaythatallthreadsinawarpprogress
throughthecodetogether. Hardwaremayoptimizemaskedlanesinwaysthataretransparenttothe
programso long as the programming model is followed. If the program violates this model, this can
resultinundefinedbehaviorthatcanbedifferentindifferentGPUhardware.
WhileitisnotnecessarytoconsiderwarpswhenwritingCUDAcode,understandingthewarpexecution
modelishelpfulinunderstandingconceptssuchasglobalmemorycoalescingandsharedmemorybank
accesspatterns. Someadvancedprogrammingtechniquesusespecializationofwarpswithinathread
block to limit thread divergence and maximize utilization. This and other optimizations make use of
theknowledgethatthreadsaregroupedintowarpswhenexecuting.
Oneimplicationofwarpexecutionisthatthreadblocksarebestspecifiedtohaveatotalnumberof
threads which is a multiple of 32. It is legal to use any number of threads, but when the total is not
a multiple of 32, the last warp of the thread block will have some lanes that are unused throughout
execution. This will likely lead to suboptimal functional units utilization and memory access for that
warp.
SIMTisoftencomparedtoSingleInstructionMultipleData(SIMD)parallelism,butthereare
someimportantdifferences. InSIMD,executionfollowsasinglecontrolflowpath,whilein
SIMT,eachthreadisallowedtofollowitsowncontrolflowpath. Becauseofthis,SIMTdoes
nothaveafixeddata-widthlikeSIMD.AmoredetaileddiscussionofSIMTcanbefoundin
SIMTExecutionModel.
1.2. ProgrammingModel 9

CUDAProgrammingGuide,Release13.1
Figure6: Whenclustersarespecified,thethreadblocksinaclusterarearrangedintheirclustershape
within the grid. The thread blocks of a cluster are scheduled simultaneously on the SMs of a single
GPC.
| 10  | Chapter1. | IntroductiontoCUDA |
| --- | --------- | ------------------ |

CUDAProgrammingGuide,Release13.1
Figure7: Inthisexample, onlythreadswitheventhreadindexexecutethebodyoftheifstatement,
theothersaremaskedoffwhilethebodyisexecuted.
1.2.3. GPU Memory
Inmoderncomputingsystems,efficientlyutilizingmemoryisjustasimportantasmaximizingtheuse
offunctionalunitsperformingcomputations. Heterogeneoussystemshavemultiplememoryspaces,
andGPUscontainvarioustypesofprogrammableon-chipmemoryinadditiontocaches. Thefollowing
sectionsintroducethesememoryspacesinmoredetails.
1.2.3.1 DRAMMemoryinHeterogeneousSystems
GPUsandCPUsbothhavedirectlyattachedDRAMchips. InsystemswithmorethanoneGPU,each
GPU has its own memory. From the perspective of device code, the DRAM attached to the GPU is
calledglobalmemory,becauseitisaccessibletoallSMsintheGPU.Thisterminologydoesnotmean
itisnecessarilyaccessibleeverywherewithinthesystem. TheDRAMattachedtotheCPU(s)iscalled
systemmemoryorhostmemory.
Like CPUs, GPUs use virtual memory addressing. On all currently-supported systems, the CPU and
GPUuseasingleunifiedvirtualmemoryspace. Thismeansthatthevirtualmemoryaddressrangefor
eachGPUinthesystemisuniqueanddistinctfromtheCPUandeveryotherGPUinthesystem. For
agivenvirtualmemoryaddress,itispossibletodeterminewhetherthataddressisinGPUmemoryor
systemmemoryand,onsystemswithmultipleGPUs,whichGPUmemorycontainsthataddress.
ThereareCUDAAPIstoallocateGPUmemory,CPUmemory,andtocopybetweenallocationsonthe
CPUandGPU,withinaGPU,orbetweenGPUsinmulti-GPUsystems. Thelocalityofdatacanbeex-
plicitlycontrolledwhendesired. UnifiedMemory,discussedbelow,allowstheplacementofmemoryto
behandledautomaticallybytheCUDAruntimeorsystemhardware.
1.2.3.2 On-ChipMemoryinGPUs
Inadditiontotheglobalmemory,eachGPUhassomeon-chipmemory. EachSMhasitsownregister
fileandsharedmemory. ThesememoriesarepartoftheSMandcanbeaccessedextremelyquickly
fromthreadsexecutingwithintheSM,buttheyarenotaccessibletothreadsrunninginotherSMs.
Theregisterfilestoresthreadlocalvariableswhichareusuallyallocatedbythecompiler. Theshared
memoryisaccessiblebyallthreadswithinathreadblockorcluster. Sharedmemorycanbeusedfor
exchangingdatabetweenthreadsofathreadblockorcluster.
The register file and unified data cache in an SM have finite sizes. The size of an SM’s register file,
unified data cache, and how the unified data cache can be configured for L1 and shared memory
1.2. ProgrammingModel 11

CUDAProgrammingGuide,Release13.1
balancecanbefoundinMemoryInformationperComputeCapability. Theregisterfile,sharedmemory
space,andL1cachearesharedamongallthreadsinathreadblock.
ToscheduleathreadblocktoanSM,thetotalnumberofregistersneededforeachthreadmultiplied
bythenumberofthreadsinthethreadblockmustbelessthanorequaltotheavailableregistersin
theSM.Ifthenumberofregistersrequiredforathreadblockexceedsthesizeoftheregisterfile,the
kernel is not launchable and the number of threads in the thread block must be decreased to make
thethreadblocklaunchable.
Sharedmemoryallocationsaredoneatthethreadblocklevel. Thatis,unlikeregisterallocationswhich
areperthread,allocationsofsharedmemoryarecommontotheentirethreadblock.
1.2.3.2.1 Caches
Inadditiontoprogrammablememories,GPUshavebothL1andL2caches. EachSMhasanL1cache
whichispartoftheunifieddatacache. AlargerL2cacheissharedbyallSMswithinaGPU.Thiscan
be seen in the GPU block diagram in Figure 2. Each SM also has a separate constant cache, which
is used to cache values in global memory that have been declared to be constant over the life of a
kernel. The compiler may place kernel parameters into constant memory as well. This can improve
kernelperformancebyallowingkernelparameterstobecachedintheSMseparatelyfromtheL1data
cache.
1.2.3.3 UnifiedMemory
When an applicationallocatesmemory explicitly on the GPU or CPU,thatmemory isonly accessible
tocoderunningonthatdevice. Thatis,CPUmemorycanonlybeaccessedfromCPUcode,andGPU
memory can only be accessed from kernels running on the GPU2 . CUDA APIs for copying memory
betweentheCPUandGPUareusedtoexplicitlycopydatatothecorrectmemoryattherighttime.
ACUDAfeaturecalledunifiedmemory allowsapplicationstomakememoryallocationswhichcanbe
accessedfromCPUorGPU.TheCUDAruntimeorunderlyinghardwareenablesaccessorrelocatesthe
data to the correct place when needed. Even with unified memory, optimal performance is attained
by keeping the migration of memory to a minimum and accessing data from the processor directly
attachedtothememorywhereitresidesasmuchaspossible.
Thehardwarefeaturesofthesystemdeterminehowaccessandexchangeofdatabetweenmemory
spacesisachieved. SectionUnifiedMemoryintroducesthedifferentcategoriesofunifiedmemorysys-
tems. SectionUnifiedMemorycontainsmanymoredetailsaboutuseandbehaviorofunifiedmemory
inallsituations.
1.3. The CUDA platform
The NVIDIA CUDA platform consists of many pieces of software and hardware and many important
technologies developed to enable computing on heterogeneous systems. This chapter serves to in-
troducesomeofthefundamentalconceptsandcomponentsoftheCUDAplatformthatareimportant
forapplicationdeveloperstounderstand. Thischapter,likeProgrammingModel,isnotspecifictoany
programminglanguage,butappliestoeverythingthatusestheCUDAplatform.
2Anexceptiontothisismappedmemory,whichisCPUmemoryallocatedwithpropertiesthatenableittobedirectlyaccessed
fromtheGPU.However,mappedaccessoccursoverthePCIeorNVLINKconnection.TheGPUisunabletohidethehigherlatency
andlowerbandwidthbehindparallelism,somappedmemoryisnotaperformantreplacementtounifiedmemoryorplacingdata
intheappropriatememoryspace.
12 Chapter1. IntroductiontoCUDA

CUDAProgrammingGuide,Release13.1
1.3.1. Compute Capability and Streaming Multiprocessor
Versions
EveryNVIDIAGPUhasaComputeCapability(CC)number,whichindicateswhatfeaturesaresupported
by that GPU and specifies some hardware parameters for that GPU. These specifications are docu-
mentedintheSection5.1appendix. AlistofallNVIDIAGPUsandtheircomputecapabilitiesismain-
tainedontheCUDAGPUComputeCapabilitypage.
ComputecapabilityisdenotedasamajorandminorversionnumberintheformatX.YwhereXisthe
major version number and Y is the minor version number. For example, CC 12.0 has a major version
of12andaminorversionof0. Thecomputecapabilitydirectlycorrespondstotheversionnumberof
theSM.Forexample,theSMswithinaGPUofCC12.0haveSMversionsm_120. Thisversionisused
tolabelbinaries.
Section5.1.1showshowtoqueryanddeterminethecomputecapabilityoftheGPU(s)inasystem.
1.3.2. CUDA Toolkit and NVIDIA Driver
TheNVIDIADrivercanbethoughtofastheoperatingsystemoftheGPU.TheNVIDIADriverisasoft-
warecomponentwhichmustbeinstalledonthehostsystem’soperatingsystemandisnecessaryfor
all GPU uses, including display and graphical functionality. The NVIDIA Driver is foundational to the
CUDAplatform. InadditiontoCUDA,theNVIDIADriverprovidesallothermethodsofusingtheGPU,
forexampleVulkanandDirect3D.TheNVIDIADriverhasversionnumberssuchasr580.
TheCUDAToolkit isasetoflibraries, headers, andtoolsforwriting, building, andanalyzingsoftware
whichutilizesGPUcomputing. TheCUDAToolkitisaseparatesoftwareproductfromtheNVIDIAdriver
The CUDAruntime is a special case of one of the libraries provided by the CUDA Toolkit. The CUDA
runtime provides both an API and some language extensions to handle common tasks such as allo-
catingmemory,copyingdatabetweenGPUsandotherGPUsorCPUs,andlaunchingkernels. TheAPI
componentsoftheCUDAruntimearereferredtoastheCUDAruntimeAPI.
The CUDA Compatibility document provides full details of compatibility between different GPUs,
NVIDIADrivers,andCUDAToolkitversions.
1.3.2.1 CUDARuntimeAPIandCUDADriverAPI
TheCUDAruntimeAPIisimplementedontopofalower-levelAPIcalledtheCUDAdriverAPI,whichis
an API exposed by the NVIDIA Driver. This guide focuses on the APIs exposed by the CUDA runtime
API.AllthesamefunctionalitycanbeachievedusingonlythedriverAPIifdesired. Somefeaturesare
onlyavailableusingthedriverAPI.ApplicationsmayuseeitherAPIorbothinteroperably. SectionThe
CUDADriverAPIcoversinteroperationbetweentheruntimeanddriverAPIs.
The full API reference for the CUDA runtime API functions can be found in the CUDA Runtime API
Documentation.
ThefullAPIreferencefortheCUDAdriverAPIcanbefoundintheCUDADriverAPIDocumentation.
1.3.3. Parallel Thread Execution (PTX)
A fundamental but sometimes invisible layer of the CUDA platform is the Parallel Thread Execution
(PTX)virtualinstructionsetarchitecture(ISA).PTXisahigh-levelassemblylanguageforNVIDIAGPUs.
PTX provides an abstraction layer over the physical ISA of real GPU hardware. Like other platforms,
1.3. TheCUDAplatform 13

CUDAProgrammingGuide,Release13.1
applicationscanbewrittendirectlyinthisassemblylanguage,thoughdoingsocanaddunnecessary
complexityanddifficultytosoftwaredevelopment.
Domain-specific languages and compilers for high-level languages can generate PTX code as an in-
termediaterepresentation(IR)andthenuseNVIDIA’sofflineorjust-in-time(JIT)compilationtoolsto
produceexecutablebinaryGPUcode. ThisenablestheCUDAplatformtobeprogrammablefromlan-
guages other than just those supported by NVIDIA-provided tools such as NVCC: The NVIDIA CUDA
Compiler.
SinceGPUcapabilitieschangeandgrowovertime,thePTXvirtualISAspecificationisversioned. PTX
versions, like SM versions, correspond to a compute capability. For example, PTX which supports all
thefeaturesofcomputecapability8.0iscalledcompute_80.
FulldocumentationonPTXcanbefoundinthePTXISA.
1.3.4. Cubins and Fatbins
CUDAapplicationsandlibrariesareusuallywritteninahigher-levellanguagelikeC++. Thathigher-level
languageiscompiledtoPTX,andthenthePTXiscompiledintorealbinaryforaphysicalGPU,calleda
CUDAbinary,orcubinforshort. AcubinhasaspecificbinaryformatforaspecificSMversion,suchas
sm_120.
ExecutablesandlibrarybinariesthatuseGPUcomputingcontainbothCPUandGPUcode. TheGPU
code is stored within a container called a fatbin. Fatbins can contain cubins and PTX for multiple
different targets. For example, an application could be built with binaries for multiple different GPU
architectures,thatis,differentSMversions. Whenanapplicationisrun,itsGPUcodeisloadedontoa
specificGPUandthebestbinaryforthatGPUfromthefatbinisused.
FatbinscanalsocontainoneormorePTXversionsofGPUcode,theuseforwhichisdescribedinPTX
Compatibility. Figure 8 shows an example of an application or library binary which contains multiple
cubinversionsofGPUcodeaswellasoneversionofPTXcode.
1.3.4.1 BinaryCompatibility
NVIDIAGPUsguaranteebinarycompatibilityincertaincircumstances. Specifically,withinamajorver-
sionofcomputecapability,GPUswithminorcomputecapabilitygreaterthanorequaltothetargeted
versionofcubincanloadandexecutethatcubin. Forexample,ifanapplicationcontainsacubinwith
codecompiledforcomputecapability8.6,thatcubincanbeloadedandexecutedonGPUswithcom-
putecapability8.6or8.9. Itcannot,however,beloadedonGPUswithcomputecapability8.0,because
theGPU’sCCminorversion,0,islowerthanthecode’sminorversion,6.
NVIDIA GPUs are not binary compatible between major compute capability versions. That is, cubin
codecompiledforcomputecapability8.6willnotloadonGPUsofcomputecapability9.0.
Whendiscussingbinarycode,thebinarycodeisoftenreferredtoashavingaversionsuchassm_86
intheaboveexample. Thisisthesameassayingthebinarywasbuiltforcomputecapability8.6. This
shorthandisoftenusedbecauseitishowadeveloperspecifiesthisbinarybuildtargettotheNVIDIA
CUDAcompiler,nvcc.
Note
Binary compatibility is promised only for binaries created by NVIDIA tools such as nvcc. Manual
editing or generating binary code for NVIDIA GPUs is not supported. Compatibility promises are
invalidatedifbinariesaremodifiedinanyway.
14 Chapter1. IntroductiontoCUDA

CUDAProgrammingGuide,Release13.1
Figure8: ThebinaryforanexecutableorlibrarycontainsbothCPUbinarycodeandafatbincontainer
forGPUcode. AfatbincancontainbothcubinGPUbinarycodeandPTXvirtualISAcode. PTXcode
canbeJITcompiledforfuturetargets.
1.3. TheCUDAplatform 15

CUDAProgrammingGuide,Release13.1
1.3.4.2 PTXCompatibility
GPUcodecanbestoredinexecutablesinbinaryorPTXform,whichiscoveredinCubinsandFatbins.
WhenanapplicationstoresthePTXversionofGPUcode,thatPTXcanbeJITcompiledatapplication
runtime for any compute capability equal or higher to the compute capability of the PTX code. For
example, if an application contains PTX for compute_80, that PTX code can be JIT compiled to later
SM versions, such as sm_120 at application runtime. This enables forward compatibility with future
GPUswithouttheneedtorebuildapplicationsorlibraries.
1.3.4.3 Just-in-TimeCompilation
PTX code loaded by an application at runtime is compiled to binary code by the device driver. This
is called just-in-time (JIT) compilation. Just-in-time compilation increases application load time, but
allowstheapplicationtobenefitfromanynewcompilerimprovementscomingwitheachnewdevice
driver. Italsoenablesapplicationstorunondevicesthatdidnotexistatthetimetheapplicationwas
compiled.
When the device driver just-in-time compiles PTX code for an application, it automatically caches a
copyofthegeneratedbinarycodeinordertoavoidrepeatingthecompilationinsubsequentinvoca-
tionsoftheapplication. Thecache-calledthecomputecache-isautomaticallyinvalidatedwhenthe
devicedriverisupgraded,sothatapplicationscanbenefitfromtheimprovementsinthenewjust-in-
timecompilerbuiltintothedevicedriver.
HowandwhenPTXisJITcompiledatruntimehasbeenrelaxedsincetheearliestversionsofCUDA,
allowingmoreflexibility forwhen and if to JIT compile some or all kernels. The section LazyLoading
describes the available options and how to control JIT behavior. There are also a few environment
variableswhichcontroljust-in-timecompilationbehavior,asdescribedinCUDAEnvironmentVariables.
As an alternative to using nvcc to compile CUDA C++ device code, NVRTC can be used to compile
CUDAC++devicecodetoPTXatruntime. NVRTCisaruntimecompilationlibraryforCUDAC++;more
informationcanbefoundintheNVRTCUserguide.
16 Chapter1. IntroductiontoCUDA

Chapter 2. Programming GPUs in CUDA
2.1. Intro to CUDA C++
ThischapterintroducessomeofthebasicconceptsoftheCUDAprogrammingmodelbyillustrating
howtheyareexposedinC++.
ThisprogrammingguidefocusesontheCUDAruntimeAPI.TheCUDAruntimeAPIisthemostcom-
monlyusedwayofusingCUDAinC++andisbuiltontopofthelowerlevelCUDAdriverAPI.
CUDARuntimeAPIandCUDADriverAPIdiscussesthedifferencebetweentheAPIsandCUDAdriverAPI
discusseswritingcodethatmixestheAPIs.
This guide assumes the CUDA Toolkit and NVIDIA Driver are installed and that a supported NVIDIA
GPU is present. See The CUDA Quickstart Guide for instructions on installing the necessary CUDA
components.
2.1.1. Compilation with NVCC
GPUcodewritteninC++iscompiledusingtheNVIDIACudaCompiler,nvcc. nvccisacompilerdriver
thatsimplifiestheprocessofcompilingC++orPTXcode: Itprovidessimpleandfamiliarcommandline
optionsandexecutesthembyinvokingthecollectionoftoolsthatimplementthedifferentcompilation
stages.
This guide will show nvcc command lines which can be used on any Linux system with the CUDA
Toolkitinstalled,ataWindowscommandlineorpowershell,oronWindowsSubsystemforLinuxwith
the CUDA Toolkit. The nvccchapter of this guide covers common use cases of nvcc, and complete
documentationisprovidedbythenvccusermanual.
2.1.2. Kernels
As mentioned in the introduction to the CUDAProgrammingModel, functions which execute on the
GPU which can be invoked from the host are called kernels. Kernels are written to be run by many
parallelthreadssimultaneously.
2.1.2.1 SpecifyingKernels
The code for a kernel is specified using the __global__ declaration specifier. This indicates to the
compiler that this function will be compiled for the GPU in a way that allows it to be invoked from a
kernel launch. A kernel launch is an operation which starts a kernel running, usually from the CPU.
Kernelsarefunctionswithavoidreturntype.
17

CUDAProgrammingGuide,Release13.1
∕∕ Kernel definition
__global__ void vecAdd(float* A, float* B, float* C)
{
}
2.1.2.2 LaunchingKernels
Thenumberofthreadsthatwillexecutethekernelinparallelisspecifiedaspartofthekernellaunch.
Thisiscalledtheexecutionconfiguration. Differentinvocationsofthesamekernelmayusedifferent
executionconfigurations,suchasadifferentnumberofthreadsorthreadblocks.
There are two ways of launching kernels from CPU code, triplechevronnotation and cudaLaunchK-
ernelEx. Triple chevron notation, the most common way of launching kernels, is introduced here.
AnexampleoflaunchingakernelusingcudaLaunchKernelExisshownanddiscussedindetailinin
sectionSection3.1.1.
2.1.2.2.1 TripleChevronNotation
TriplechevronnotationisaCUDAC++LanguageExtensionwhichisusedtolaunchkernels. Itiscalled
triple chevron because it uses three chevron characters to encapsulate the execution configuration
for the kernel launch, i.e. <<< >>>. Execution configuration parameters are specified as a comma
separated list inside the chevrons, similar to parameters to a function call. The syntax for a kernel
launchofthevecAddkernelisshownbelow.
__global__ void vecAdd(float* A, float* B, float* C)
{
}
int main()
{
...
∕∕ Kernel invocation
vecAdd<<<1, 256>>>(A, B, C);
...
}
Thefirsttwoparameterstothetriplechevronnotationarethegriddimensionsandthethreadblock
dimensions, respectively. When using 1-dimensional thread blocks or grids, integers can be used to
specifydimensions.
Theabovecodelaunchesasinglethreadblockcontaining256threads. Eachthreadwillexecutethe
exact same kernel code. In ThreadandGridIndexIntrinsics, we’ll show how each thread can use its
indexwithinthethreadblockandgridtochangethedataitoperateson.
There is a limit to the number of threads per block, since all threads of a block reside on the same
streaming multiprocessor(SM) and must share the resources of the SM. On current GPUs, a thread
blockmaycontainupto1024threads. Ifresourcesallow,morethanonethreadblockcanbescheduled
onanSMsimultaneously.
Kernel launches are asynchronous with respect to the host thread. That is, the kernel will be setup
for execution on the GPU, but the host code will not wait for the kernel to complete (or even start)
executingontheGPUbeforeproceeding. SomeformofsynchronizationbetweentheGPUandCPU
18 Chapter2. ProgrammingGPUsinCUDA

CUDAProgrammingGuide,Release13.1
must be used to determine that the kernel has completed. The most basic version, completely syn-
chronizing the entire GPU, is shown in Synchronizing CPU and GPU. More sophisticated methods of
synchronizationarecoveredinAsynchronousExecution.
When using 2 or 3-dimensional grids or thread blocks, the CUDA type dim3 is used as the grid and
thread block dimension parameters. The code fragment below shows a kernel launch of a MatAdd
kernelusing16by16gridofthreadblocks,eachthreadblockis8by8.
int main()
{
...
dim3 grid(16,16);
dim3 block(8,8);
MatAdd<<<grid, block>>>(A, B, C);
...
}
2.1.2.3 ThreadandGridIndexIntrinsics
Withinkernelcode,CUDAprovidesintrinsicstoaccessparametersoftheexecutionconfigurationand
theindexofathreadorblock.
▶ threadIdxgivestheindexofathreadwithinitsthreadblock. Eachthreadinathreadblockwill
haveadifferentindex.
▶ blockDimgivesthedimensionsofthethreadblock, whichwasspecifiedintheexecutioncon-
figurationofthekernellaunch.
▶ blockIdxgivestheindexofathreadblockwithinthegrid. Eachthreadblockwillhaveadifferent
index.
▶ gridDim gives the dimensions of the grid, which was specified in the execution configuration
whenthekernelwaslaunched.
Eachoftheseintrinsicsisa3-componentvectorwitha.x,.y,and.zmember. Dimensionsnotspec-
ifiedbyalaunchconfigurationwilldefaultto1. threadIdxandblockIdxarezeroindexed. Thatis,
threadIdx.xwilltakeonvaluesfrom0uptoandincludingblockDim.x-1. .yand.zoperatethe
sameintheirrespectivedimensions.
Similarly,blockIdx.xwillhavevaluesfrom0uptoandincludinggridDim.x-1,andthesamefor.y
and.zdimensions,respectively.
These allow an individual thread to identify what work it should carry out. Returning to the vecAdd
kernel,thekerneltakesthreeparameters,eachisavectoroffloats. Thekernelperformsanelement-
wiseadditionofAandBandstorestheresultinC.Thekernelisparallelizedsuchthateachthreadwill
performoneaddition. Whichelementitcomputesisdeterminedbyitsthreadandgridindex.
__global__ void vecAdd(float* A, float* B, float* C)
{
∕∕ calculate which element this thread is responsible for computing
int workIndex = threadIdx.x + blockDim.x * blockIdx.x
∕∕ Perform computation
C[workIndex] = A[workIndex] + B[workIndex];
}
int main()
(continuesonnextpage)
2.1. IntrotoCUDAC++ 19

CUDAProgrammingGuide,Release13.1
(continuedfrompreviouspage)
{
...
|     | ∕∕ A,       | B,  | and C     | are vectors |        | of 1024 elements |     |     |
| --- | ----------- | --- | --------- | ----------- | ------ | ---------------- | --- | --- |
|     | vecAdd<<<4, |     | 256>>>(A, |             | B, C); |                  |     |     |
...
}
Inthisexample,4threadblocksof256threadsareusedtoaddavectorof1024elements. Inthefirst
threadblock,blockIdx.xwillbezero,andsoeachthread’sworkIndexwillsimplybeitsthreadIdx.x.
Inthesecondthreadblock,blockIdx.xwillbe1,soblockDim.x * blockIdx.xwillbethesame
asblockDim.x,whichis256inthiscase. TheworkIndexforeachthreadinthesecondthreadblock
willbeitsthreadIdx.x + 256. InthethirdthreadblockworkIndexwillbethreadIdx.x + 512.
ThiscomputationofworkIndexisverycommonfor1-dimensionalparallelizations. Expandingtotwo
orthreedimensionsoftenfollowsthesamepatternineachofthosedimensions.
| 2.1.2.3.1 | BoundsChecking |     |     |     |     |     |     |     |
| --------- | -------------- | --- | --- | --- | --- | --- | --- | --- |
Theexamplegivenaboveassumesthatthelengthofthevectorisamultipleofthethreadblocksize,
256 threads in this case. To make the kernel handle any vector length, we can add checks that the
memoryaccessisnotexceedingtheboundsofthearraysasshownbelow,andthenlaunchonethread
blockwhichwillhavesomeinactivethreads.
__global__ void vecAdd(float* A, float* B, float* C, int vectorLength)
{
∕∕ calculate which element this thread is responsible for computing
|     | int          | workIndex |     | = threadIdx.x |     | + blockDim.x | * blockIdx.x |     |
| --- | ------------ | --------- | --- | ------------- | --- | ------------ | ------------ | --- |
|     | if(workIndex |           | <   | vectorLength) |     |              |              |     |
{
|     |     | ∕∕ Perform   |     | computation |              |                 |     |     |
| --- | --- | ------------ | --- | ----------- | ------------ | --------------- | --- | --- |
|     |     | C[workIndex] |     | =           | A[workIndex] | + B[workIndex]; |     |     |
}
}
With the above kernel code, more threads than needed can be launched without causing out-of-
bounds accesses to the arrays. When workIndex exceeds vectorLength, threads exit and do not
doanywork. Launchingextrathreadsinablockthatdonoworkdoesnotincuralargeoverheadcost,
howeverlaunchingthreadblocksinwhichnothreadsdoworkshouldbeavoided. Thiskernelcannow
handlevectorlengthswhicharenotamultipleoftheblocksize.
The number of thread blocks which are needed can be calculated as the ceiling of the number of
threads needed, the vector length in this case, divided by the number of threads per block. That is,
theintegerdivisionofthenumberofthreadsneededbythenumberofthreadsperblock,roundedup.
Acommonwayofexpressingthisasasingleintegerdivisionisgivenbelow. Byaddingthreads
- 1
beforetheintegerdivision,thisbehaveslikeaceilingfunction,addinganotherthreadblockonlyifthe
vectorlengthisnotdivisiblebythenumberofthreadsperblock.
∕∕ vectorLength is an integer storing number of elements in the vector
| int              | threads | =   | 256;             |     |                       |             |                |     |
| ---------------- | ------- | --- | ---------------- | --- | --------------------- | ----------- | -------------- | --- |
| int              | blocks  | =   | (vectorLength    |     | + threads-1)∕threads; |             |                |     |
| vecAdd<<<blocks, |         |     | threads>>>(devA, |     |                       | devB, devC, | vectorLength); |     |
TheCUDACoreComputeLibrary(CCCL)providesaconvenientutility,cuda::ceil_div,fordoingthis
| 20  |     |     |     |     |     |     | Chapter2. | ProgrammingGPUsinCUDA |
| --- | --- | --- | --- | --- | --- | --- | --------- | --------------------- |

CUDAProgrammingGuide,Release13.1
ceilingdividetocalculatethenumberofblocksneededforakernellaunch. Thisutilityisavailableby
includingtheheader<cuda∕cmath>.
∕∕ vectorLength is an integer storing number of elements in the vector
| int threads      |     | = 256;                       |     |     |     |       |                      |
| ---------------- | --- | ---------------------------- | --- | --- | --- | ----- | -------------------- |
| int blocks       | =   | cuda::ceil_div(vectorLength, |     |     |     |       | threads);            |
| vecAdd<<<blocks, |     | threads>>>(devA,             |     |     |     | devB, | devC, vectorLength); |
Thechoiceof256threadsperblockhereisarbitrary,butthisisquiteoftenagoodvaluetostartwith.
| 2.1.3. | Memory |     | in  | GPU | Computing |     |     |
| ------ | ------ | --- | --- | --- | --------- | --- | --- |
InordertousethevecAddkernelshownabove,thearraysA,B,andCmustbeinmemoryaccessible
to the GPU. There are several different ways to do this, two of which will be illustrated here. Other
methods will be covered in later sections on unifiedmemory. The memory spaces available to code
running on the GPU were introduced in GPU Memory and are covered in more detail in GPU Device
MemorySpaces.
2.1.3.1 UnifiedMemory
Unified memory is a feature of the CUDA runtime which lets the NVIDIA Driver manage movement
of data between host and device(s). Memory is allocated using the cudaMallocManaged API or by
declaringavariablewiththe__managed__specifier. TheNVIDIADriverwillmakesurethatthememory
isaccessibletotheGPUorCPUwhenevereithertriestoaccessit.
ThecodebelowshowsacompletefunctiontolaunchthevecAddkernelwhichusesunifiedmemory
fortheinputandoutputvectorsthatwillbeusedontheGPU.cudaMallocManagedallocatesbuffers
whichcanbeaccessedfromeithertheCPUortheGPU.ThesebuffersarereleasedusingcudaFree.
| void unifiedMemExample(int |     |     |     |     | vectorLength) |     |     |
| -------------------------- | --- | --- | --- | --- | ------------- | --- | --- |
{
| ∕∕     | Pointers | to         | memory | vectors |     |     |     |
| ------ | -------- | ---------- | ------ | ------- | --- | --- | --- |
| float* | A        | = nullptr; |        |         |     |     |     |
| float* | B        | = nullptr; |        |         |     |     |     |
| float* | C        | = nullptr; |        |         |     |     |     |
float* comparisonResult = (float*)malloc(vectorLength*sizeof(float));
| ∕∕                    | Use unified |                | memory  | to                           | allocate | buffers |     |
| --------------------- | ----------- | -------------- | ------- | ---------------------------- | -------- | ------- | --- |
| cudaMallocManaged(&A, |             |                |         | vectorLength*sizeof(float)); |          |         |     |
| cudaMallocManaged(&B, |             |                |         | vectorLength*sizeof(float)); |          |         |     |
| cudaMallocManaged(&C, |             |                |         | vectorLength*sizeof(float)); |          |         |     |
| ∕∕                    | Initialize  |                | vectors | on                           | the      | host    |     |
| initArray(A,          |             | vectorLength); |         |                              |          |         |     |
| initArray(B,          |             | vectorLength); |         |                              |          |         |     |
∕∕ Launch the kernel. Unified memory will make sure A, B, and C are
| ∕∕               | accessible |                                | to the        | GPU |          |           |                |
| ---------------- | ---------- | ------------------------------ | ------------- | --- | -------- | --------- | -------------- |
| int              | threads    | =                              | 256;          |     |          |           |                |
| int              | blocks     | = cuda::ceil_div(vectorLength, |               |     |          |           | threads);      |
| vecAdd<<<blocks, |            |                                | threads>>>(A, |     |          | B, C,     | vectorLength); |
| ∕∕               | Wait       | for the                        | kernel        | to  | complete | execution |                |
cudaDeviceSynchronize();
(continuesonnextpage)
2.1. IntrotoCUDAC++ 21

CUDAProgrammingGuide,Release13.1
(continuedfrompreviouspage)
| ∕∕              | Perform | computation |                      | serially    | on  | CPU for        | comparison |     |
| --------------- | ------- | ----------- | -------------------- | ----------- | --- | -------------- | ---------- | --- |
| serialVecAdd(A, |         |             | B, comparisonResult, |             |     | vectorLength); |            |     |
| ∕∕              | Confirm | that        | CPU                  | and GPU got | the | same           | answer     |     |
if(vectorApproximatelyEqual(C, comparisonResult, vectorLength))
{
|     | printf("Unified |     |     | Memory: | CPU and | GPU | answers | match\n"); |
| --- | --------------- | --- | --- | ------- | ------- | --- | ------- | ---------- |
}
else
{
printf("Unified Memory: Error - CPU and GPU answers do not match\n");
}
| ∕∕  | Clean | Up  |     |     |     |     |     |     |
| --- | ----- | --- | --- | --- | --- | --- | --- | --- |
cudaFree(A);
cudaFree(B);
cudaFree(C);
free(comparisonResult);
}
Unified memory is supported on all operating systems and GPUs supported by CUDA, though the
underlying mechanism and performance may differ based on system architecture. Unified Memory
providesmoredetails. OnsomeLinuxsystems,(e.g. thosewithaddresstranslationservicesorhetero-
geneousmemorymanagement) all system memory is automatically unified memory, and there is no
needtousecudaMallocManagedorthe__managed__specifier.
2.1.3.2 ExplicitMemoryManagement
Explicitlymanagingmemoryallocationanddatamigrationbetweenmemoryspacescanhelpimprove
applicationperformance,thoughitdoesmakeformoreverbosecode. Thecodebelowexplicitlyallo-
catesmemoryontheGPUusingcudaMalloc. MemoryontheGPUisfreedusingthesamecudaFree
APIaswasusedforunifiedmemoryinthepreviousexample.
| void | explicitMemExample(int |     |     | vectorLength) |     |     |     |     |
| ---- | ---------------------- | --- | --- | ------------- | --- | --- | --- | --- |
{
| ∕∕     | Pointers | for        | host | memory |     |     |     |     |
| ------ | -------- | ---------- | ---- | ------ | --- | --- | --- | --- |
| float* | A        | = nullptr; |      |        |     |     |     |     |
| float* | B        | = nullptr; |      |        |     |     |     |     |
| float* | C        | = nullptr; |      |        |     |     |     |     |
float* comparisonResult = (float*)malloc(vectorLength*sizeof(float));
| ∕∕     | Pointers | for | device   | memory |     |     |     |     |
| ------ | -------- | --- | -------- | ------ | --- | --- | --- | --- |
| float* | devA     | =   | nullptr; |        |     |     |     |     |
| float* | devB     | =   | nullptr; |        |     |     |     |     |
| float* | devC     | =   | nullptr; |        |     |     |     |     |
∕∕Allocate Host Memory using cudaMallocHost API. This is best practice
∕∕ when buffers will be used for copies between CPU and GPU memory
| cudaMallocHost(&A, |     |     | vectorLength*sizeof(float)); |     |     |     |     |     |
| ------------------ | --- | --- | ---------------------------- | --- | --- | --- | --- | --- |
| cudaMallocHost(&B, |     |     | vectorLength*sizeof(float)); |     |     |     |     |     |
(continuesonnextpage)
| 22  |     |     |     |     |     |     | Chapter2. | ProgrammingGPUsinCUDA |
| --- | --- | --- | --- | --- | --- | --- | --------- | --------------------- |

CUDAProgrammingGuide,Release13.1
(continuedfrompreviouspage)
| cudaMallocHost(&C, |     |                | vectorLength*sizeof(float)); |      |     |     |     |
| ------------------ | --- | -------------- | ---------------------------- | ---- | --- | --- | --- |
| ∕∕ Initialize      |     | vectors        | on the                       | host |     |     |     |
| initArray(A,       |     | vectorLength); |                              |      |     |     |     |
| initArray(B,       |     | vectorLength); |                              |      |     |     |     |
∕∕ start-allocate-and-copy
| ∕∕ Allocate       | memory |        | on the                       | GPU |     |     |     |
| ----------------- | ------ | ------ | ---------------------------- | --- | --- | --- | --- |
| cudaMalloc(&devA, |        |        | vectorLength*sizeof(float)); |     |     |     |     |
| cudaMalloc(&devB, |        |        | vectorLength*sizeof(float)); |     |     |     |     |
| cudaMalloc(&devC, |        |        | vectorLength*sizeof(float)); |     |     |     |     |
| ∕∕ Copy           | data   | to the | GPU                          |     |     |     |     |
cudaMemcpy(devA, A, vectorLength*sizeof(float), cudaMemcpyDefault);
cudaMemcpy(devB, B, vectorLength*sizeof(float), cudaMemcpyDefault);
| cudaMemset(devC, |     | 0,  | vectorLength*sizeof(float)); |     |     |     |     |
| ---------------- | --- | --- | ---------------------------- | --- | --- | --- | --- |
∕∕ end-allocate-and-copy
| ∕∕ Launch        | the        | kernel                       |           |     |          |           |     |
| ---------------- | ---------- | ---------------------------- | --------- | --- | -------- | --------- | --- |
| int threads      | =          | 256;                         |           |     |          |           |     |
| int blocks       | =          | cuda::ceil_div(vectorLength, |           |     |          | threads); |     |
| vecAdd<<<blocks, |            | threads>>>(devA,             |           |     | devB,    | devC);    |     |
| ∕∕ wait          | for kernel |                              | execution | to  | complete |           |     |
cudaDeviceSynchronize();
| ∕∕ Copy | results | back | to host |     |     |     |     |
| ------- | ------- | ---- | ------- | --- | --- | --- | --- |
cudaMemcpy(C, devC, vectorLength*sizeof(float), cudaMemcpyDefault);
| ∕∕ Perform      | computation |     | serially          |     | on CPU   | for comparison |     |
| --------------- | ----------- | --- | ----------------- | --- | -------- | -------------- | --- |
| serialVecAdd(A, |             | B,  | comparisonResult, |     |          | vectorLength); |     |
| ∕∕ Confirm      | that        | CPU | and GPU           | got | the same | answer         |     |
if(vectorApproximatelyEqual(C, comparisonResult, vectorLength))
{
| printf("Explicit |     |     | Memory: | CPU | and | GPU answers | match\n"); |
| ---------------- | --- | --- | ------- | --- | --- | ----------- | ---------- |
}
else
{
printf("Explicit Memory: Error - CPU and GPU answers to not match\n");
}
| ∕∕ clean | up  |     |     |     |     |     |     |
| -------- | --- | --- | --- | --- | --- | --- | --- |
cudaFree(devA);
cudaFree(devB);
cudaFree(devC);
cudaFreeHost(A);
cudaFreeHost(B);
cudaFreeHost(C);
free(comparisonResult);
}
TheCUDAAPIcudaMemcpyisusedtocopydatafromabufferresidingontheCPUtoabufferresiding
ontheGPU.Alongwiththedestinationpointer,sourcepointer,andsizeinbytes,thefinalparameter
2.1. IntrotoCUDAC++ 23

CUDAProgrammingGuide,Release13.1
of cudaMemcpy is a cudaMemcpyKind_t. This can have values such as cudaMemcpyHostToDevice
forcopiesfromtheCPUtoaGPU,cudaMemcpyDeviceToHostforcopiesfromtheCPUtotheGPU,
orcudaMemcpyDeviceToDeviceforcopieswithinaGPUorbetweenGPUs.
In this example, cudaMemcpyDefault is passed as the last argument to cudaMemcpy. This causes
CUDAtousethevalueofthesourceanddestinationpointerstodeterminethetypeofcopytoperform.
ThecudaMemcpyAPIissynchronous. Thatis,itdoesnotreturnuntilthecopyhascompleted. Asyn-
chronouscopiesareintroducedinLaunchingMemoryTransfersinCUDAStreams.
ThecodeusescudaMallocHosttoallocatememoryontheCPU.Thisallocatespage-lockedmemory
onthehost,whichcanimprovecopyperformanceandisnecessaryforasynchronousmemorytrans-
fers. Ingeneral,itisgoodpracticetousepage-lockedmemoryforCPUbuffersthatwillbeusedindata
transfers to and from GPUs. Performance can degrade on some systems if too much host memory
ispage-locked. Bestpracticeistopage-lockonlybufferswhichwillbeusedforsendingorreceiving
datafromtheGPU.
2.1.3.3 MemoryManagementandApplicationPerformance
As can be seen in the above example, explicit memory management is more verbose, requiring the
programmertospecifycopiesbetweenthehostanddevice. Thisistheadvantageanddisadvantage
of explicit memory management: it affords more control of when data is copied between host and
devices, where memory is resident, and exactly what memory is allocated where. Explicit memory
management can provide performance opportunities controlling memory transfers and overlapping
themwithothercomputations.
When using unified memory, there are CUDA APIs (which will be covered in Memory Advise and
Prefetch), which provide hints to the NVIDIA driver managing the memory, which can enable some
oftheperformancebenefitsofusingexplicitmemorymanagementwhenusingunifiedmemory.
2.1.4. Synchronizing CPU and GPU
AsmentionedinLaunchingKernels,kernellaunchesareasynchronouswithrespecttotheCPUthread
whichcalledthem. ThismeansthecontrolflowoftheCPUthreadwillcontinueexecutingbeforethe
kernelhascompleted,andpossiblyevenbeforeithaslaunched. Inordertoguaranteethatakernelhas
completedexecutionbeforeproceedinginhostcode,somesynchronizationmechanismisnecessary.
ThesimplestwaytosynchronizetheGPUandahostthreadiswiththeuseofcudaDeviceSynchro-
nize,whichblocksthehostthreaduntilallpreviouslyissuedworkontheGPUhascompleted. Inthe
examples of this chapter this is sufficient because only single operations are being executed on the
GPU. In larger applications, there may be multiple streams executing work on the GPU and cudaDe-
viceSynchronizewillwaitforworkinallstreamstocomplete. Intheseapplications, usingStream
Synchronization APIs to synchronize only with a specific stream or CUDA Events is recommended.
ThesewillbecoveredindetailintheAsynchronousExecutionchapter.
2.1.5. Putting it All Together
The following listings show the entire code for the simple vector addition kernel introduced in this
chapteralongwithallhostcodeandutilityfunctionsforcheckingtoverifythattheanswerobtained
is correct. These examples default to using a vector length of 1024, but accept a different vector
lengthasacommandlineargumenttotheexecutable.
24 Chapter2. ProgrammingGPUsinCUDA

CUDAProgrammingGuide,Release13.1
UnifiedMemory
| #include | <cuda_runtime_api.h> |     |     |     |     |     |     |     |     |
| -------- | -------------------- | --- | --- | --- | --- | --- | --- | --- | --- |
| #include | <memory.h>           |     |     |     |     |     |     |     |     |
| #include | <cstdlib>            |     |     |     |     |     |     |     |     |
| #include | <ctime>              |     |     |     |     |     |     |     |     |
| #include | <stdio.h>            |     |     |     |     |     |     |     |     |
| #include | <cuda∕cmath>         |     |     |     |     |     |     |     |     |
__global__ void vecAdd(float* A, float* B, float* C, int vectorLength)
{
| int          | workIndex |     | = threadIdx.x |     |     | + blockIdx.x*blockDim.x; |     |     |     |
| ------------ | --------- | --- | ------------- | --- | --- | ------------------------ | --- | --- | --- |
| if(workIndex |           | <   | vectorLength) |     |     |                          |     |     |     |
{
|     | C[workIndex] |     | =   | A[workIndex] |     | + B[workIndex]; |     |     |     |
| --- | ------------ | --- | --- | ------------ | --- | --------------- | --- | --- | --- |
}
}
| void | initArray(float* |     | A,  | int | length) |     |     |     |     |
| ---- | ---------------- | --- | --- | --- | ------- | --- | --- | --- | --- |
{
std::srand(std::time({}));
| for(int | i=0; | i<length; |     |     | i++) |     |     |     |     |
| ------- | ---- | --------- | --- | --- | ---- | --- | --- | --- | --- |
{
|     | A[i] | = rand() | ∕   | (float)RAND_MAX; |     |     |     |     |     |
| --- | ---- | -------- | --- | ---------------- | --- | --- | --- | --- | --- |
}
}
| void | serialVecAdd(float* |     |     | A,  | float* | B, float* | C,  | int length) |     |
| ---- | ------------------- | --- | --- | --- | ------ | --------- | --- | ----------- | --- |
{
| for(int | i=0; | i<length; |     |     | i++) |     |     |     |     |
| ------- | ---- | --------- | --- | --- | ---- | --- | --- | --- | --- |
{
|     | C[i] | = A[i] | + B[i]; |     |     |     |     |     |     |
| --- | ---- | ------ | ------- | --- | --- | --- | --- | --- | --- |
}
}
bool vectorApproximatelyEqual(float* A, float* B, int length, float epsilon=0.
,→00001)
{
| for(int | i=0; | i<length; |     |     | i++) |     |     |     |     |
| ------- | ---- | --------- | --- | --- | ---- | --- | --- | --- | --- |
{
|     | if(fabs(A[i] |     | -B[i]) |     | > epsilon) |     |     |     |     |
| --- | ------------ | --- | ------ | --- | ---------- | --- | --- | --- | --- |
{
|     | printf("Index |     |        | %d  | mismatch: | %f  | != %f", | i, A[i], | B[i]); |
| --- | ------------- | --- | ------ | --- | --------- | --- | ------- | -------- | ------ |
|     | return        |     | false; |     |           |     |         |          |        |
}
}
| return | true; |     |     |     |     |     |     |     |     |
| ------ | ----- | --- | --- | --- | --- | --- | --- | --- | --- |
}
∕∕unified-memory-begin
| void | unifiedMemExample(int |     |     |     | vectorLength) |     |     |     |     |
| ---- | --------------------- | --- | --- | --- | ------------- | --- | --- | --- | --- |
{
| ∕∕  | Pointers | to  | memory | vectors |     |     |     |     |     |
| --- | -------- | --- | ------ | ------- | --- | --- | --- | --- | --- |
(continuesonnextpage)
2.1. IntrotoCUDAC++ 25

CUDAProgrammingGuide,Release13.1
(continuedfrompreviouspage)
| float* |     | A = | nullptr; |     |     |     |     |     |     |
| ------ | --- | --- | -------- | --- | --- | --- | --- | --- | --- |
| float* |     | B = | nullptr; |     |     |     |     |     |     |
| float* |     | C = | nullptr; |     |     |     |     |     |     |
float* comparisonResult = (float*)malloc(vectorLength*sizeof(float));
| ∕∕                    | Use        | unified | memory         |     | to allocate                  | buffers |     |     |     |
| --------------------- | ---------- | ------- | -------------- | --- | ---------------------------- | ------- | --- | --- | --- |
| cudaMallocManaged(&A, |            |         |                |     | vectorLength*sizeof(float)); |         |     |     |     |
| cudaMallocManaged(&B, |            |         |                |     | vectorLength*sizeof(float)); |         |     |     |     |
| cudaMallocManaged(&C, |            |         |                |     | vectorLength*sizeof(float)); |         |     |     |     |
| ∕∕                    | Initialize |         | vectors        |     | on the                       | host    |     |     |     |
| initArray(A,          |            |         | vectorLength); |     |                              |         |     |     |     |
| initArray(B,          |            |         | vectorLength); |     |                              |         |     |     |     |
∕∕ Launch the kernel. Unified memory will make sure A, B, and C are
| ∕∕               | accessible |     | to                             | the GPU       |             |     |                   |           |     |
| ---------------- | ---------- | --- | ------------------------------ | ------------- | ----------- | --- | ----------------- | --------- | --- |
| int              | threads    |     | = 256;                         |               |             |     |                   |           |     |
| int              | blocks     |     | = cuda::ceil_div(vectorLength, |               |             |     |                   | threads); |     |
| vecAdd<<<blocks, |            |     |                                | threads>>>(A, |             | B,  | C, vectorLength); |           |     |
| ∕∕               | Wait       | for | the kernel                     |               | to complete |     | execution         |           |     |
cudaDeviceSynchronize();
| ∕∕              | Perform |     | computation |                   | serially | on  | CPU for        | comparison |     |
| --------------- | ------- | --- | ----------- | ----------------- | -------- | --- | -------------- | ---------- | --- |
| serialVecAdd(A, |         |     | B,          | comparisonResult, |          |     | vectorLength); |            |     |
| ∕∕              | Confirm |     | that CPU    | and               | GPU got  | the | same           | answer     |     |
if(vectorApproximatelyEqual(C, comparisonResult, vectorLength))
{
|     | printf("Unified |     |     | Memory: |     | CPU and | GPU | answers | match\n"); |
| --- | --------------- | --- | --- | ------- | --- | ------- | --- | ------- | ---------- |
}
else
{
printf("Unified Memory: Error - CPU and GPU answers do not match\n");
}
| ∕∕  | Clean | Up  |     |     |     |     |     |     |     |
| --- | ----- | --- | --- | --- | --- | --- | --- | --- | --- |
cudaFree(A);
cudaFree(B);
cudaFree(C);
free(comparisonResult);
}
∕∕unified-memory-end
| int main(int |     | argc, | char** |     | argv) |     |     |     |     |
| ------------ | --- | ----- | ------ | --- | ----- | --- | --- | --- | --- |
{
| int     | vectorLength |      |     | = 1024; |     |     |     |     |     |
| ------- | ------------ | ---- | --- | ------- | --- | --- | --- | --- | --- |
| if(argc |              | >=2) |     |         |     |     |     |     |     |
{
|     | vectorLength |     |     | = std::atoi(argv[1]); |     |     |     |     |     |
| --- | ------------ | --- | --- | --------------------- | --- | --- | --- | --- | --- |
}
(continuesonnextpage)
| 26  |     |     |     |     |     |     |     | Chapter2. | ProgrammingGPUsinCUDA |
| --- | --- | --- | --- | --- | --- | --- | --- | --------- | --------------------- |

CUDAProgrammingGuide,Release13.1
(continuedfrompreviouspage)
unifiedMemExample(vectorLength);
| return | 0;  |     |     |     |     |     |     |     |
| ------ | --- | --- | --- | --- | --- | --- | --- | --- |
}
ExplicitMemoryManagement
| #include | <cuda_runtime_api.h> |     |     |     |     |     |     |     |
| -------- | -------------------- | --- | --- | --- | --- | --- | --- | --- |
| #include | <memory.h>           |     |     |     |     |     |     |     |
| #include | <cstdlib>            |     |     |     |     |     |     |     |
| #include | <ctime>              |     |     |     |     |     |     |     |
| #include | <stdio.h>            |     |     |     |     |     |     |     |
| #include | <cuda∕cmath>         |     |     |     |     |     |     |     |
__global__ void vecAdd(float* A, float* B, float* C, int vectorLength)
{
| int          | workIndex | = threadIdx.x   |     |     | + blockIdx.x*blockDim.x; |     |     |     |
| ------------ | --------- | --------------- | --- | --- | ------------------------ | --- | --- | --- |
| if(workIndex |           | < vectorLength) |     |     |                          |     |     |     |
{
|     | C[workIndex] | =   | A[workIndex] |     | + B[workIndex]; |     |     |     |
| --- | ------------ | --- | ------------ | --- | --------------- | --- | --- | --- |
}
}
| void | initArray(float* |     | A, int | length) |     |     |     |     |
| ---- | ---------------- | --- | ------ | ------- | --- | --- | --- | --- |
{
std::srand(std::time({}));
| for(int | i=0; | i<length; |     | i++) |     |     |     |     |
| ------- | ---- | --------- | --- | ---- | --- | --- | --- | --- |
{
|     | A[i] = | rand() | ∕ (float)RAND_MAX; |     |     |     |     |     |
| --- | ------ | ------ | ------------------ | --- | --- | --- | --- | --- |
}
}
| void | serialVecAdd(float* |     | A,  | float* | B, float* | C,  | int length) |     |
| ---- | ------------------- | --- | --- | ------ | --------- | --- | ----------- | --- |
{
| for(int | i=0; | i<length; |     | i++) |     |     |     |     |
| ------- | ---- | --------- | --- | ---- | --- | --- | --- | --- |
{
|     | C[i] = | A[i] + | B[i]; |     |     |     |     |     |
| --- | ------ | ------ | ----- | --- | --- | --- | --- | --- |
}
}
bool vectorApproximatelyEqual(float* A, float* B, int length, float epsilon=0.
,→00001)
{
| for(int | i=0; | i<length; |     | i++) |     |     |     |     |
| ------- | ---- | --------- | --- | ---- | --- | --- | --- | --- |
{
|     | if(fabs(A[i] | -B[i]) |     | > epsilon) |     |     |     |     |
| --- | ------------ | ------ | --- | ---------- | --- | --- | --- | --- |
{
|     | printf("Index |        | %d  | mismatch: | %f  | != %f", | i, A[i], | B[i]); |
| --- | ------------- | ------ | --- | --------- | --- | ------- | -------- | ------ |
|     | return        | false; |     |           |     |         |          |        |
}
}
| return | true; |     |     |     |     |     |     |     |
| ------ | ----- | --- | --- | --- | --- | --- | --- | --- |
(continuesonnextpage)
2.1. IntrotoCUDAC++ 27

CUDAProgrammingGuide,Release13.1
(continuedfrompreviouspage)
}
∕∕explicit-memory-begin
| void explicitMemExample(int |     |     | vectorLength) |     |     |     |     |
| --------------------------- | --- | --- | ------------- | --- | --- | --- | --- |
{
| ∕∕ Pointers | for        | host | memory |     |     |     |     |
| ----------- | ---------- | ---- | ------ | --- | --- | --- | --- |
| float* A    | = nullptr; |      |        |     |     |     |     |
| float* B    | = nullptr; |      |        |     |     |     |     |
| float* C    | = nullptr; |      |        |     |     |     |     |
float* comparisonResult = (float*)malloc(vectorLength*sizeof(float));
| ∕∕ Pointers | for | device   | memory |     |     |     |     |
| ----------- | --- | -------- | ------ | --- | --- | --- | --- |
| float* devA | =   | nullptr; |        |     |     |     |     |
| float* devB | =   | nullptr; |        |     |     |     |     |
| float* devC | =   | nullptr; |        |     |     |     |     |
∕∕Allocate Host Memory using cudaMallocHost API. This is best practice
∕∕ when buffers will be used for copies between CPU and GPU memory
| cudaMallocHost(&A, |     |                | vectorLength*sizeof(float)); |      |     |     |     |
| ------------------ | --- | -------------- | ---------------------------- | ---- | --- | --- | --- |
| cudaMallocHost(&B, |     |                | vectorLength*sizeof(float)); |      |     |     |     |
| cudaMallocHost(&C, |     |                | vectorLength*sizeof(float)); |      |     |     |     |
| ∕∕ Initialize      |     | vectors        | on the                       | host |     |     |     |
| initArray(A,       |     | vectorLength); |                              |      |     |     |     |
| initArray(B,       |     | vectorLength); |                              |      |     |     |     |
∕∕ start-allocate-and-copy
| ∕∕ Allocate       | memory |        | on the                       | GPU |     |     |     |
| ----------------- | ------ | ------ | ---------------------------- | --- | --- | --- | --- |
| cudaMalloc(&devA, |        |        | vectorLength*sizeof(float)); |     |     |     |     |
| cudaMalloc(&devB, |        |        | vectorLength*sizeof(float)); |     |     |     |     |
| cudaMalloc(&devC, |        |        | vectorLength*sizeof(float)); |     |     |     |     |
| ∕∕ Copy           | data   | to the | GPU                          |     |     |     |     |
cudaMemcpy(devA, A, vectorLength*sizeof(float), cudaMemcpyDefault);
cudaMemcpy(devB, B, vectorLength*sizeof(float), cudaMemcpyDefault);
| cudaMemset(devC, |     | 0,  | vectorLength*sizeof(float)); |     |     |     |     |
| ---------------- | --- | --- | ---------------------------- | --- | --- | --- | --- |
∕∕ end-allocate-and-copy
| ∕∕ Launch        | the        | kernel                       |           |     |          |           |     |
| ---------------- | ---------- | ---------------------------- | --------- | --- | -------- | --------- | --- |
| int threads      | =          | 256;                         |           |     |          |           |     |
| int blocks       | =          | cuda::ceil_div(vectorLength, |           |     |          | threads); |     |
| vecAdd<<<blocks, |            | threads>>>(devA,             |           |     | devB,    | devC);    |     |
| ∕∕ wait          | for kernel |                              | execution | to  | complete |           |     |
cudaDeviceSynchronize();
| ∕∕ Copy | results | back | to host |     |     |     |     |
| ------- | ------- | ---- | ------- | --- | --- | --- | --- |
cudaMemcpy(C, devC, vectorLength*sizeof(float), cudaMemcpyDefault);
| ∕∕ Perform      | computation |     | serially          |     | on CPU   | for comparison |     |
| --------------- | ----------- | --- | ----------------- | --- | -------- | -------------- | --- |
| serialVecAdd(A, |             | B,  | comparisonResult, |     |          | vectorLength); |     |
| ∕∕ Confirm      | that        | CPU | and GPU           | got | the same | answer         |     |
(continuesonnextpage)
| 28  |     |     |     |     |     | Chapter2. | ProgrammingGPUsinCUDA |
| --- | --- | --- | --- | --- | --- | --------- | --------------------- |

CUDAProgrammingGuide,Release13.1
(continuedfrompreviouspage)
if(vectorApproximatelyEqual(C, comparisonResult, vectorLength))
{
|     | printf("Explicit | Memory: CPU | and GPU answers | match\n"); |
| --- | ---------------- | ----------- | --------------- | ---------- |
}
else
{
printf("Explicit Memory: Error - CPU and GPU answers to not match\n");
}
| ∕∕  | clean up |     |     |     |
| --- | -------- | --- | --- | --- |
cudaFree(devA);
cudaFree(devB);
cudaFree(devC);
cudaFreeHost(A);
cudaFreeHost(B);
cudaFreeHost(C);
free(comparisonResult);
}
∕∕explicit-memory-end
| int main(int | argc, | char** argv) |     |     |
| ------------ | ----- | ------------ | --- | --- |
{
| int     | vectorLength | = 1024; |     |     |
| ------- | ------------ | ------- | --- | --- |
| if(argc | >=2)         |         |     |     |
{
|     | vectorLength | = std::atoi(argv[1]); |     |     |
| --- | ------------ | --------------------- | --- | --- |
}
explicitMemExample(vectorLength);
| return | 0;  |     |     |     |
| ------ | --- | --- | --- | --- |
}
Thesecanbebuiltandrunusingnvccasfollows:
| $ nvcc | vecAdd_unifiedMemory.cu | -o  | vecAdd_unifiedMemory |     |
| ------ | ----------------------- | --- | -------------------- | --- |
$ .∕vecAdd_unifiedMemory
| Unified                  | Memory: CPU              | and GPU answers | match                 |     |
| ------------------------ | ------------------------ | --------------- | --------------------- | --- |
| $ .∕vecAdd_unifiedMemory |                          | 4096            |                       |     |
| Unified                  | Memory: CPU              | and GPU answers | match                 |     |
| $ nvcc                   | vecAdd_explicitMemory.cu | -o              | vecAdd_explicitMemory |     |
$ .∕vecAdd_explicitMemory
| Explicit                  | Memory: CPU | and GPU answers | match |     |
| ------------------------- | ----------- | --------------- | ----- | --- |
| $ .∕vecAdd_explicitMemory |             | 4096            |       |     |
| Explicit                  | Memory: CPU | and GPU answers | match |     |
Intheseexamples,allthreadsaredoingindependentworkanddonotneedtocoordinateorsynchro-
nizewitheachother. Frequently,threadswillneedtocooperateandcommunicatewithotherthreads
tocarryouttheirwork. Threadswithinablockcansharedatathroughsharedmemoryandsynchronize
tocoordinatememoryaccesses.
Themostbasicmechanismforsynchronizationattheblocklevelisthe__syncthreads()intrinsic,
whichactsasabarrieratwhichallthreadsintheblockmustwaitbeforeanythreadsareallowedto
2.1. IntrotoCUDAC++ 29

CUDAProgrammingGuide,Release13.1
proceed. SharedMemorygivesanexampleofusingsharedmemory.
Forefficientcooperation,sharedmemoryisexpectedtobealow-latencymemoryneareachprocessor
core(muchlikeanL1cache)and__syncthreads()isexpectedtobelightweight. __syncthreads()
onlysynchronizesthethreadswithinasinglethreadblock. Synchronizationbetweenblocksisnotsup-
portedbytheCUDAprogrammingmodel. CooperativeGroupsprovidesmechanismtosetsynchroniza-
tiondomainsotherthanasinglethreadblock.
Best performance is usually achieved when synchronization is kept within a thread block. Thread
blockscanstillworkoncommonresultsusingatomicmemoryfunctions,whichwillbecoveredincom-
ingsections.
Section Section 3.2.4 covers CUDA synchronization primitives that provide very fine-grained control
formaximizingperformanceandresourceutilization.
2.1.6. Runtime Initialization
TheCUDAruntimecreatesaCUDAcontextforeachdeviceinthesystem. Thiscontextistheprimary
contextforthisdeviceandisinitializedatthefirstruntimefunctionwhichrequiresanactivecontext
onthisdevice. Thecontextissharedamongallthehostthreadsoftheapplication. Aspartofcontext
creation,thedevicecodeisjust-in-timecompiledifnecessaryandloadedintodevicememory. Thisall
happenstransparently. TheprimarycontextcreatedbytheCUDAruntimecanbeaccessedfromthe
driverAPIforinteroperabilityasdescribedinInteroperabilitybetweenRuntimeandDriverAPIs.
AsofCUDA12.0, thecudaInitDeviceandcudaSetDevicecallsinitializetheruntimeandthepri-
mary context associated with the specified device. The runtime will implicitly use device 0 and self-
initializeasneededtoprocessruntimeAPIrequestsiftheyoccurbeforethesecalls. Thisisimportant
when timing runtime function calls and when interpreting the error code from the first call into the
runtime. PriortoCUDA12.0,cudaSetDevicewouldnotinitializetheruntime.
cudaDeviceResetdestroystheprimarycontextofthecurrentdevice. IfCUDAruntimeAPIsarecalled
aftertheprimarycontexthasbeendestroyed,anewprimarycontextforthatdevicewillbecreated.
Note
TheCUDAinterfacesuseglobalstatethatisinitializedduringhostprograminitiationanddestroyed
duringhostprogramtermination. Usinganyoftheseinterfaces(implicitlyorexplicitly)duringpro-
graminitiationorterminationaftermainwillresultinundefinedbehavior.
As of CUDA 12.0, cudaSetDevice explicitly initializes the runtime, if it has not already been ini-
tialized,afterchangingthecurrentdeviceforthehostthread. PreviousversionsofCUDAdelayed
runtimeinitializationonthenewdeviceuntilthefirstruntimecallwasmadeaftercudaSetDevice.
Becauseofthis,itisveryimportanttocheckthereturnvalueofcudaSetDeviceforinitialization
errors.
Theruntimefunctionsfromtheerrorhandlingandversionmanagementsectionsofthereference
manualdonotinitializetheruntime.
2.1.7. Error Checking in CUDA
EveryCUDAAPIreturnsavalueofanenumeratedtype,cudaError_t. Inexamplecodetheseerrors
areoftennotchecked. Inproductionapplications,itisbestpracticetoalwayscheckandmanagethe
return value of every CUDA API call. When there are no errors, the value returned is cudaSuccess.
Manyapplicationschoosetoimplementautilitymacrosuchastheoneshownbelow
30 Chapter2. ProgrammingGPUsinCUDA

CUDAProgrammingGuide,Release13.1
| #define CUDA_CHECK(expr_to_check) |                 | do {             | \        |     |
| --------------------------------- | --------------- | ---------------- | -------- | --- |
| cudaError_t                       | result          | = expr_to_check; | \        |     |
| if(result                         | != cudaSuccess) |                  | \        |     |
| {                                 |                 |                  | \        |     |
| fprintf(stderr,                   |                 |                  | \        |     |
|                                   | "CUDA Runtime   | Error: %s:%i:%d  | = %s\n", | \   |
|                                   | __FILE__,       |                  | \        |     |
|                                   | __LINE__,       |                  | \        |     |
result,\
|     | cudaGetErrorString(result)); |     | \   |     |
| --- | ---------------------------- | --- | --- | --- |
| }   |                              |     | \   |     |
} while(0)
This macro uses the cudaGetErrorString API, which returns a human readable string describing
themeaningofaspecificcudaError_tvalue. Usingtheabovemacro,anapplicationwouldcallCUDA
runtimeAPIcallswithinaCUDA_CHECK(expression)macro,asshownbelow:
CUDA_CHECK(cudaMalloc(&devA, vectorLength*sizeof(float)));
CUDA_CHECK(cudaMalloc(&devB, vectorLength*sizeof(float)));
CUDA_CHECK(cudaMalloc(&devC, vectorLength*sizeof(float)));
Ifanyofthesecallsdetectanerror,itwillbeprintedtostderrusingthismacro. Thismacroiscommon
for smaller projects, but can be adapted to a logging system or other error handling mechanism in
largerapplications.
Note
It is important to note that the error state returned from any CUDA API call can also indicate an
errorfromapreviouslyissuedasynchronousoperation. SectionAsynchronousErrorHandlingcovers
thisinmoredetail.
2.1.7.1 ErrorState
TheCUDAruntimemaintainsacudaError_tstateforeachhostthread. Thevaluedefaultstocuda-
Successandisoverwrittenwheneveranerroroccurs. cudaGetLastErrorreturnscurrenterrorstate
and then resets it to cudaSuccess. Alternatively, returns error state without
cudaPeekLastError
resettingit.
KernellaunchesusingtriplechevronnotationdonotreturnacudaError_t. Itisgoodpracticetocheck
theerrorstateimmediatelyafterkernellaunchestodetectimmediateerrorsinthekernellaunchor
asynchronouserrorspriortothekernellaunch. AvalueofcudaSuccesswhencheckingtheerrorstate
immediatelyafterakernellaunchdoesnotmeanthekernelhasexecutedsuccessfullyorevenstarted
execution. It only verifies that the kernel launch parameters and execution configuration passed to
theruntimedidnottriggeranyerrorsandthattheerrorstateisnotapreviousorasynchronouserror
beforethekernelstarted.
2.1.7.2 AsynchronousErrors
CUDA kernel launches and many runtime APIs are asynchronous. Asynchronous CUDA runtime APIs
will be discussed in detail in Asynchronous Execution. The CUDA error state is set and overwritten
wheneveranerroroccurs. Thismeansthaterrorswhichoccurduringtheexecutionofasynchronous
operationswillonlybereportedwhentheerrorstateisexaminednext. Asnoted,thismaybeacallto
cudaGetLastError,cudaPeekLastError,oritcouldbeanyCUDAAPIwhichreturnscudaError_t.
2.1. IntrotoCUDAC++ 31

CUDAProgrammingGuide,Release13.1
WhenerrorsarereturnedbyCUDAruntimeAPIfunctions,theerrorstateisnotcleared. Thismeans
thaterrorcodefromanasynchronouserror,suchasaninvalidmemoryaccessbyakernel,willbere-
turnedbyeveryCUDAruntimeAPIuntiltheerrorstatehasbeenclearedbycallingcudaGetLastEr-
ror.
| vecAdd<<<blocks, |       |       | threads>>>(devA, |        | devB,  | devC); |     |
| ---------------- | ----- | ----- | ---------------- | ------ | ------ | ------ | --- |
| ∕∕               | check | error | state after      | kernel | launch |        |     |
CUDA_CHECK(cudaGetLastError());
| ∕∕  | wait for | kernel | execution | to  | complete |     |     |
| --- | -------- | ------ | --------- | --- | -------- | --- | --- |
∕∕ The CUDA_CHECK will report errors that occurred during execution of the
,→kernel
CUDA_CHECK(cudaDeviceSynchronize());
Note
ThecudaError_tvaluecudaErrorNotReady,whichmaybereturnedbycudaStreamQueryand
cudaEventQuery, is not considered an error and is not reported by cudaPeekAtLastError or
cudaGetLastError.
2.1.7.3 CUDA_LOG_FILE
Another good way to identify CUDA errors is with the CUDA_LOG_FILE environment variable. When
this environment variable is set, the CUDA driver will write error messages encountered out to a file
whosepathisspecifiedintheenvironmentvariable. Forexample,takethefollowingincorrectCUDA
code, which attemtps to launch a thread block which is larger than the maximum supported by any
architecture.
| __global__ | void | k() |     |     |     |     |     |
| ---------- | ---- | --- | --- | --- | --- | --- | --- |
{ }
int main()
{
|     | k<<<8192, |     | 4096>>>(); | ∕∕ Invalid | block | size |     |
| --- | --------- | --- | ---------- | ---------- | ----- | ---- | --- |
CUDA_CHECK(cudaGetLastError());
|     | return | 0;  |     |     |     |     |     |
| --- | ------ | --- | --- | --- | --- | --- | --- |
}
Building and running this, the check after the kernel launch detects and reports the error using the
macrosillustratedinSection2.1.7.
| $ nvcc | errorLogIllustration.cu |     |     | -o  | errlog |     |     |
| ------ | ----------------------- | --- | --- | --- | ------ | --- | --- |
$ .∕errlog
CUDA Runtime Error: ∕home∕cuda∕intro-cpp∕errorLogIllustration.cu:24:1 =
| ,→invalid | argument |     |     |     |     |     |     |
| --------- | -------- | --- | --- | --- | --- | --- | --- |
However, when the application is run with CUDA_LOG_FILE set to a text file, that file contains a bit
moreinformationabouttheerror.
| $ env | CUDA_LOG_FILE=cudaLog.txt |     |     | .∕errlog |     |     |     |
| ----- | ------------------------- | --- | --- | -------- | --- | --- | --- |
CUDA Runtime Error: ∕home∕cuda∕intro-cpp∕errorLogIllustration.cu:24:1 =
| ,→invalid | argument |     |     |     |     |     |     |
| --------- | -------- | --- | --- | --- | --- | --- | --- |
$ cat cudaLog.txt
(continuesonnextpage)
| 32  |     |     |     |     |     | Chapter2. | ProgrammingGPUsinCUDA |
| --- | --- | --- | --- | --- | --- | --------- | --------------------- |

CUDAProgrammingGuide,Release13.1
(continuedfrompreviouspage)
[12:46:23.854][137216133754880][CUDA][E] One or more of block dimensions of
,→(4096,1,1) exceeds corresponding maximum value of (1024,1024,64)
[12:46:23.854][137216133754880][CUDA][E] Returning 1 (CUDA_ERROR_INVALID_
,→VALUE) from cuLaunchKernel
SettingCUDA_LOG_FILEtostdoutorstderrwillprinttostandardoutandstandarderror, respec-
tively. Using the CUDA_LOG_FILE environment variable, it is possible to capture and identify CUDA
errors,eveniftheapplicationdoesnotimplementpropererrorcheckingonCUDAreturnvalues. This
approachcanbeextremelypowerfulfordebugging,buttheenvironmentvariablealonedoesnotallow
anapplicationtohandleandrecoverfromCUDAerrorsatruntime. Theerrorlogmanagementfeature
ofCUDAalsoallowsacallbackfunctiontoberegisteredwiththedriverwhichwillbecalledwhenever
anerrorisdetected. Thiscanbeusedtocaptureandhandleerrorsatruntime,andalsotointegrate
CUDAerrorloggingseamlesslyintoanapplication’sexistingloggingsystem.
Section4.8 showsmoreexamplesoftheerrorlogmanagementfeatureofCUDA.Errorlogmanage-
mentandCUDA_LOG_FILEareavailablewithNVIDIADriverversionr570andlater.
2.1.8. Device and Host Functions
The__global__specifierisusedtoindicatetheentrypointforakernel. Thatis,afunctionwhichwill
beinvokedforparallelexecutionontheGPU.Mostoften,kernelsarelaunchedfromthehost,however
itispossibletolaunchakernelfromwithinanotherkernelusingdynamicparallelism.
The specifier __device__ indicates that a function should be compiled for the GPU and be callable
from other __device__ or __global__ functions. A function, including class member functions,
functors,andlambdas,canbespecifiedasboth__device__and__host__asintheexamplebelow.
2.1.9. Variable Specifiers
CUDAspecifierscanbeusedonstaticvariabledeclarationstocontrolplacement.
▶ __device__specifiesthatavariableisstoredinGlobalMemory
▶ __constant__specifiesthatavariableisstoredinConstantMemory
▶ __managed__specifiesthatavariableisstoredasUnifiedMemory
▶ __shared__specifiesthatavariableisstoreinSharedMemory
When a variable is declared with no specifier inside a __device__ or __global__ function, it is al-
locatedtoregisterswhenpossible, andlocalmemory whennecessary. Anyvariabledeclaredwithno
specifieroutsidea__device__or__global__functionwillbeallocatedinsystemmemory.
2.1.9.1 DetectingDeviceCompilation
Whenafunctionisspecifiedwith__host__ __device__,thecompilerisinstructedtogenerateboth
aGPUandaCPUcodeforthisfunction. Insuchfunctions,itmaybedesirabletousethepreprocessor
tospecifycodeonlyfortheGPUortheCPUcopyofthefunction. Checkingwhether__CUDA_ARCH_
isdefinedisthemostcommonwayofdoingthis,asillustratedintheexamplebelow.
2.1. IntrotoCUDAC++ 33

CUDAProgrammingGuide,Release13.1
2.1.10. Thread Block Clusters
From compute capability 9.0 onward, the CUDA programming model includes an optional level of hi-
erarchy called thread block clusters that are made up of thread blocks. Similar to how threads in a
thread block are guaranteed to be co-scheduled on a streaming multiprocessor, thread blocks in a
clusterarealsoguaranteedtobeco-scheduledonaGPUProcessingCluster(GPC)intheGPU.
Similar to thread blocks, clusters are also organized into a one-dimension, two-dimension, or three-
dimensiongridofthreadblockclustersasillustratedbyFigure5.
The number of thread blocks in a cluster can be user-defined, and a maximum of 8 thread blocks in
aclusterissupportedasaportableclustersizeinCUDA.NotethatonGPUhardwareorMIGconfig-
urationswhicharetoosmalltosupport8multiprocessorsthemaximumclustersizewillbereduced
accordingly. Identification of these smaller configurations, as well as of larger configurations sup-
porting a thread block cluster size beyond 8, is architecture-specific and can be queried using the
cudaOccupancyMaxPotentialClusterSizeAPI.
Allthethreadblocksintheclusterareguaranteedtobeco-scheduledtoexecutesimultaneouslyon
a single GPU Processing Cluster (GPC) and allow thread blocks in the cluster to perform hardware-
supported synchronization using the cooperative groups API cluster.sync(). Cluster group also
providesmemberfunctionstoqueryclustergroupsizeintermsofnumberofthreadsornumberof
blocks using num_threads() and num_blocks() API respectively. The rank of a thread or block in
theclustergroupcanbequeriedusingdim_threads()anddim_blocks()APIrespectively.
Thread blocks that belong to a cluster have access to the distributed shared memory, which is the
combinedsharedmemoryofallthreadblocksinthecluster. Threadblocksinaclusterhavetheability
to read, write, and perform atomics to any address in the distributed shared memory. Distributed
SharedMemorygivesanexampleofperforminghistogramsindistributedsharedmemory.
Note
In a kernel launched using cluster support, the gridDim variable still denotes the size in terms of
numberofthreadblocks,forcompatibilitypurposes. Therankofablockinaclustercanbefound
usingtheCooperativeGroupsAPI.
2.1.10.1 LaunchingwithClustersinTripleChevronNotation
A thread block cluster can be enabled in a kernel either using a compile-time kernel attribute using
__cluster_dims__(X,Y,Z)orusingtheCUDAkernellaunchAPIcudaLaunchKernelEx. Theexam-
plebelowshowshowtolaunchaclusterusingacompile-timekernelattribute. Theclustersizeusing
kernelattributeisfixedatcompiletimeandthenthekernelcanbelaunchedusingtheclassical<<<
, >>>. Ifakernelusescompile-timeclustersize,theclustersizecannotbemodifiedwhenlaunching
thekernel.
∕∕ Kernel definition
∕∕ Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__(2, 1, 1) cluster_kernel(float *input, float*
,→output)
{
}
int main()
{
(continuesonnextpage)
34 Chapter2. ProgrammingGPUsinCUDA

CUDAProgrammingGuide,Release13.1
(continuedfrompreviouspage)
float *input, *output;
∕∕ Kernel invocation with compile time cluster size
dim3 threadsPerBlock(16, 16);
dim3 numBlocks(N ∕ threadsPerBlock.x, N ∕ threadsPerBlock.y);
∕∕ The grid dimension is not affected by cluster launch, and is still
,→enumerated
∕∕ using number of blocks.
∕∕ The grid dimension must be a multiple of cluster size.
cluster_kernel<<<numBlocks, threadsPerBlock>>>(input, output);
}
2.2. Writing CUDA SIMT Kernels
CUDAC++kernelscanlargelybewritteninthesamewaythattraditionalCPUcodewouldbewritten
foragivenproblem. However,therearesomeuniquefeaturesoftheGPUthatcanbeusedtoimprove
performance. Additionally,someunderstandingofhowthreadsontheGPUarescheduled,howthey
access memory, and how their execution proceeds can help developers write kernels that maximize
utilizationoftheavailablecomputingresources.
2.2.1. Basics of SIMT
Fromthedeveloper’sperspective,theCUDAthreadisthefundamentalunitofparallelism. Warpsand
SIMT describesthebasicSIMTmodelofGPUexecutionandSIMTExecutionModelprovidesadditional
details of the SIMT model. The SIMT model allows each thread to maintain its own state and con-
trol flow. From a functional perspective, each thread can execute a separate code path. However,
substantialperformanceimprovementscanberealizedbytakingcarethatkernelcodeminimizesthe
situationswherethreadsinthesamewarptakedivergentcodepaths.
2.2.2. Thread Hierarchy
Threads are organized into thread blocks, which are then organized into a grid. Grids may be 1, 2,
or 3 dimensional and the size of the grid can be queried inside a kernel with the gridDim built-in
variable. Threadblocksmayalsobe1,2,or3dimensional. Thesizeofthethreadblockcanbequeried
insideakernelwiththeblockDimbuilt-invariable. Theindexofthethreadblockcanbequeriedwith
the blockIdx built-in variable. Within a thread block, the index of the thread is obtained using the
threadIdxbuilt-invariable. Thesebuilt-invariablesareusedtocomputeauniqueglobalthreadindex
for each thread, thereby enabling each thread to load/store specific data from global memory and
executeauniquecodepathasneeded.
▶ gridDim.[x|y|z]: Size of the grid in the x, y and z dimension respectively. These values are
setatkernellaunch.
▶ blockDim.[x|y|z]: Sizeoftheblockinthex,yandzdimensionrespectively. Thesevaluesare
setatkernellaunch.
▶ blockIdx.[x|y|z]: Index of the block in the x, y and z dimension respectively. These values
changedependingonwhichblockisexecuting.
2.2. WritingCUDASIMTKernels 35

CUDAProgrammingGuide,Release13.1
▶
threadIdx.[x|y|z]: Indexofthethreadinthex,yandzdimensionrespectively. Thesevalues
changedependingonwhichthreadisexecuting.
The use of multi-dimensional thread blocks and grids is for convenience only and does not affect
performance. The threads of a block are linearized predictably: the first index x moves the fastest,
followed by y and then z. This means that in the linearization of a thread indices, consecutive val-
ues of threadIdx.x indicate consecutive threads, threadIdx.y has a stride of blockDim.x, and
threadIdx.zhasastrideofblockDim.x * blockDim.y. Thisaffectshowthreadsareassignedto
warps,asdetailedinHardwareMultithreading.
Figure9showsasimpleexampleofa2Dgrid,with1Dthreadblocks.
|        |            | Figure9: | GridofThreadBlocks |     |     |
| ------ | ---------- | -------- | ------------------ | --- | --- |
| 2.2.3. | GPU Device | Memory   | Spaces             |     |     |
CUDA devices have several memory spaces that can be accessed by CUDA threads within kernels.
Table1showsasummaryofthecommonmemorytypes,theirthreadscopes,andtheirlifetimes. The
followingsectionsexplaineachofthesememorytypesinmoredetail.
|     |     | Table1: MemoryTypes,ScopesandLifetimes |                  |          |     |
| --- | --- | -------------------------------------- | ---------------- | -------- | --- |
|     |     | MemoryType                             | Scope Lifetime   | Location |     |
|     |     | Global                                 | Grid Application | Device   |     |
|     |     | Constant                               | Grid Application | Device   |     |
|     |     | Shared                                 | Block Kernel     | SM       |     |
|     |     | Local                                  | Thread Kernel    | Device   |     |
|     |     | Register                               | Thread Kernel    | SM       |     |
2.2.3.1 GlobalMemory
Globalmemory(alsocalleddevicememory)istheprimarymemoryspaceforstoringdatathatisac-
cessiblebyallthreadsinakernel. ItissimilartoRAMinaCPUsystem. KernelsrunningontheGPU
havedirectaccesstoglobalmemoryinthesamewaycoderunningontheCPUhasaccesstosystem
memory.
| 36  |     |     |     | Chapter2. | ProgrammingGPUsinCUDA |
| --- | --- | --- | --- | --------- | --------------------- |

CUDAProgrammingGuide,Release13.1
Global memory is persistent. That is, an allocation made in global memory and the data stored in it
persistuntiltheallocationisfreedoruntiltheapplicationisterminated. cudaDeviceResetalsofrees
allallocations.
GlobalmemoryisallocatedwithCUDAAPIcallssuchascudaMallocandcudaMallocManaged. Data
canbecopiedintoglobalmemoryfromCPUmemoryusingCUDAruntimeAPIcallssuchascudaMem-
cpy. GlobalmemoryallocationsmadewithCUDAAPIsarefreedusingcudaFree.
Prior to a kernel launch, global memory is allocated and initialized by CUDA API calls. During kernel
execution,datafromglobalmemorycanbereadbytheCUDAthreads,andtheresultfromoperations
carried out by CUDA threads can be written back to global memory. Once a kernel has completed
execution, the results it wrote to global memory can be copied back to the host or used by other
kernelsontheGPU.
Becauseglobalmemoryisaccessiblebyallthreadsinagrid, caremustbetakentoavoiddataraces
betweenthreads. SinceCUDAkernelslaunchedfromthehosthavethereturntypevoid,theonlyway
for numerical results computed by a kernel to be returned to the host is by writing those results to
globalmemory.
A simple example illustrating the use of global memory is the vecAdd kernel below, where the three
arraysA,B,andCareinglobalmemoryandarebeingaccessedbythisvectoraddkernel.
__global__ void vecAdd(float* A, float* B, float* C, int vectorLength)
{
int workIndex = threadIdx.x + blockIdx.x*blockDim.x;
if(workIndex < vectorLength)
{
C[workIndex] = A[workIndex] + B[workIndex];
2.2.3.2 SharedMemory
Sharedmemoryisamemoryspacethatisaccessiblebyallthreadsinathreadblock. Itisphysically
locatedoneachSMandusesthesamephysicalresourceastheL1cache,theunifieddatacache. The
datainsharedmemorypersiststhroughoutthekernelexecution. Sharedmemorycanbeconsidered
a user-managed scratchpad for use during kernel execution. While small in size compared to global
memory, because shared memory is located on each SM, the bandwidth is higher and the latency is
lowerthanaccessingglobalmemory.
Sincesharedmemoryisaccessiblebyallthreadsinathreadblock,caremustbetakentoavoiddata
racesbetweenthreadsinthesamethreadblock. Synchronizationbetweenthreadsinthesamethread
blockcanbeachievedusingthe__syncthreads()function. Thisfunctionblocksallthreadsinthe
threadblockuntilallthreadshavereachedthecallto__syncthreads().
∕∕ assuming blockDim.x is 128
__global__ void example_syncthreads(int* input_data, int* output_data) {
__shared__ int shared_data[128];
∕∕ Every thread writes to a distinct element of 'shared_data':
shared_data[threadIdx.x] = input_data[threadIdx.x];
∕∕ All threads synchronize, guaranteeing all writes to 'shared_data' are
,→ordered
∕∕ before any thread is unblocked from '__syncthreads()':
__syncthreads();
∕∕ A single thread safely reads 'shared_data':
if (threadIdx.x == 0) {
(continuesonnextpage)
2.2. WritingCUDASIMTKernels 37

CUDAProgrammingGuide,Release13.1
(continuedfrompreviouspage)
int sum = 0;
for (int i = 0; i < blockDim.x; ++i) {
sum += shared_data[i];
}
output_data[blockIdx.x] = sum;
}
}
The size of shared memory varies depending on the GPU architecture being used. Because shared
memory and L1 cache share the same physical space, using shared memory reduces the size of the
usable L1 cache for a kernel. Additionally, if no shared memory is used by the kernel, the entire
physical space will be utilized by L1 cache. The CUDA runtime API provides functions to query the
shared memory size on a per SM basis and a per thread block basis, using the cudaGetDevice-
PropertiesfunctionandinvestigatingthecudaDeviceProp.sharedMemPerMultiprocessorand
cudaDeviceProp.sharedMemPerBlockdeviceproperties.
TheCUDAruntimeAPIprovidesafunctioncudaFuncSetCacheConfigtotelltheruntimewhetherto
allocatemorespacetosharedmemory,ormorespacetoL1cache. Thisfunctionspecifiesapreference
totheruntime,butisnotguaranteedtobehonored. Theruntimeisfreetomakedecisionsbasedon
theavailableresourcesandtheneedsofthekernel.
Sharedmemorycanbeallocatedbothstaticallyanddynamically.
2.2.3.2.1 StaticAllocationofSharedMemory
Toallocatesharedmemorystatically,theprogrammermustdeclareavariableinsidethekernelusing
the __shared__ specifier. The variable will be allocated in shared memory and will persist for the
durationofthekernelexecution. Thesizeofthesharedmemorydeclaredinthiswaymustbespecified
atcompiletime. Forexample,thefollowingcodesnippet,locatedinthebodyofthekernel,declaresa
sharedmemoryarrayoftypefloatwith1024elements.
__shared__ float sharedArray[1024];
Afterthisdeclaration,allthethreadsinthethreadblockwillhaveaccesstothissharedmemoryarray.
Caremustbetakentoavoiddataracesbetweenthreadsinthesamethreadblock,typicallywiththe
useof__syncthreads().
2.2.3.2.2 DynamicAllocationofSharedMemory
To allocate shared memory dynamically, the programmer can specify the desired amount of shared
memory per thread block in bytes as the third (and optional) argument to the kernel launch in the
triplechevronnotationlikethisfunctionName<<<grid, block, sharedMemoryBytes>>>().
Then,insidethekernel,theprogrammercanusetheextern __shared__specifiertodeclareavari-
ablethatwillbeallocateddynamicallyatkernellaunch.
extern __shared__ float sharedArray[];
Onecaveatisthatifonewantsmultipledynamicallyallocatedsharedmemoryarrays,thesingleextern
__shared__ must be partitioned manually using pointer arithmetic. For example, if one wants the
equivalentofthefollowing,
38 Chapter2. ProgrammingGPUsinCUDA

CUDAProgrammingGuide,Release13.1
short array0[128];
float array1[64];
int array2[256];
indynamicallyallocatedsharedmemory,onecoulddeclareandinitializethearraysinthefollowingway:
extern __shared__ float array[];
short* array0 = (short*)array;
float* array1 = (float*)&array0[128];
int* array2 = (int*)&array1[64];
Note that pointers need to be aligned to the type they point to, so the following code, for example,
doesnotworksincearray1isnotalignedto4bytes.
extern __shared__ float array[];
short* array0 = (short*)array;
float* array1 = (float*)&array0[127];
2.2.3.3 Registers
RegistersarelocatedontheSMandhavethreadlocalscope. Registerusageismanagedbythecom-
pilerandregistersareusedforthreadlocalstorageduringtheexecutionofakernel. Thenumberof
registersperSMandthenumberofregistersperthreadblockcanbequeriedusingtheregsPerMul-
tiprocessorandregsPerBlockdevicepropertiesoftheGPU.
NVCCallowsthedevelopertospecifyamaximumnumberofregisterstobeusedbyakernelviathe
-maxrregcount option. Using this option to reduce the number of registers a kernel can use may
result in more thread blocks being scheduled on the SM concurrently, but may also result in more
registerspilling.
2.2.3.4 LocalMemory
LocalmemoryisthreadlocalstoragesimilartoregistersandmanagedbyNVCC,butthephysicallo-
cationoflocalmemoryisintheglobalmemoryspace. The‘local’labelreferstoitslogicalscope,not
itsphysicallocation. Localmemoryisusedforthreadlocalstorageduringtheexecutionofakernel.
Automaticvariablesthatthecompilerislikelytoplaceinlocalmemoryare:
▶ Arraysforwhichitcannotdeterminethattheyareindexedwithconstantquantities,
▶ Largestructuresorarraysthatwouldconsumetoomuchregisterspace,
▶ Anyvariableifthekernelusesmoreregistersthanavailable,thatisregisterspilling.
Becausethelocalmemoryspaceresidesindevicememory,localmemoryaccesseshavethesamela-
tencyandbandwidthasglobalmemoryaccessesandaresubjecttothesamerequirementsformem-
orycoalescingasdescribedinCoalescedGlobalMemoryAccess. Localmemoryishoweverorganized
suchthatconsecutive32-bitwordsareaccessedbyconsecutivethreadIDs. Accessesaretherefore
fully coalesced as long as all threads in a warp access the same relative address, such as the same
indexinanarrayvariableorthesamememberinastructurevariable.
2.2. WritingCUDASIMTKernels 39

CUDAProgrammingGuide,Release13.1
2.2.3.5 ConstantMemory
Constantmemoryhasagridscopeandisaccessibleforthelifetimeoftheapplication. Theconstant
memoryresidesonthedeviceandisread-onlytothekernel. Assuch,itmustbedeclaredandinitialized
onthehostwiththe__constant__specifier,outsideanyfunction.
The__constant__memoryspacespecifierdeclaresavariablethat:
▶
Residesinconstantmemoryspace,
▶ HasthelifetimeoftheCUDAcontextinwhichitiscreated,
▶
Hasadistinctobjectperdevice,
▶ Isaccessiblefromallthethreadswithinthegridandfromthehostthroughtheruntimelibrary
(cudaGetSymbolAddress()/cudaGetSymbolSize()/cudaMemcpyToSymbol()/cudaMem-
cpyFromSymbol()).
ThetotalamountofconstantmemorycanbequeriedwiththetotalConstMemdevicepropertyele-
ment.
Constantmemoryisusefulforsmallamountsofdatathateachthreadwilluseinaread-onlyfashion.
Constantmemoryissmallrelativetoothermemories,typically64KBperdevice.
Anexamplesnippetofdeclaringandusingconstantmemoryfollows.
| ∕∕ In your   | .cu file           |                    |     |     |     |
| ------------ | ------------------ | ------------------ | --- | --- | --- |
| __constant__ | float coeffs[4];   |                    |     |     |     |
| __global__   | void compute(float | *out)              | {   |     |     |
| int          | idx = threadIdx.x; |                    |     |     |     |
| out[idx]     | = coeffs[0]        | * idx + coeffs[1]; |     |     |     |
}
| ∕∕ In your                 | host code          |             |                    |     |     |
| -------------------------- | ------------------ | ----------- | ------------------ | --- | --- |
| float h_coeffs[4]          | = {1.0f,           | 2.0f, 3.0f, | 4.0f};             |     |     |
| cudaMemcpyToSymbol(coeffs, |                    | h_coeffs,   | sizeof(h_coeffs)); |     |     |
| compute<<<1,               | 10>>>(device_out); |             |                    |     |     |
2.2.3.6 Caches
GPUdeviceshaveamulti-levelcachestructurewhichincludesL2andL1caches.
TheL2cacheislocatedonthedeviceandissharedamongalltheSMs. ThesizeoftheL2cachecan
bequeriedwiththel2CacheSizedevicepropertyelementfromthefunctioncudaGetDeviceProp-
erties.
As described above in Shared Memory, L1 cache is physically located on each SM and is the same
physicalspaceusedbysharedmemory. Ifnosharedmemoryisutilizedbyakernel,theentirephysical
spacewillbeutilizedbytheL1cache.
The L2 and L1 caches can be controlled via functions that allow the developer to specify various
cachingbehaviors. ThedetailsofthesefunctionsarefoundinConfiguringL1/SharedMemoryBalance,
L2CacheControl,andLow-LevelLoadandStoreFunctions.
Ifthesehintsarenotused,thecompilerandruntimewilldotheirbesttoutilizethecachesefficiently.
| 40  |     |     |     | Chapter2. | ProgrammingGPUsinCUDA |
| --- | --- | --- | --- | --------- | --------------------- |

CUDAProgrammingGuide,Release13.1
2.2.3.7 TextureandSurfaceMemory
Note
Some older CUDA code may use texture memory because, in older NVIDIA GPUs, doing so pro-
videdperformancebenefitsinsomescenarios. OnallcurrentlysupportedGPUs,thesescenarios
maybehandledusingdirectloadandstoreinstructions, anduseoftextureandsurfacememory
instructionsnolongerprovidesanyperformancebenefit.
AGPUmayhavespecializedinstructionsforloadingdatafromanimagetobeusedastexturesin3D
rendering. CUDAexposestheseinstructionsandthemachinerytousetheminthetextureobjectAPI
andthesurfaceobjectAPI.
TextureandSurfacememoryarenotdiscussedfurtherinthisguideasthereisnoadvantagetousing
them in CUDA on any currently supported NVIDIA GPU. CUDA developers should feel free to ignore
theseAPIs. Fordevelopersworkingonexistingcodebaseswhichstillusethem,explanationsofthese
APIscanstillbefoundinthelegacyCUDAC++ProgrammingGuide.
2.2.3.8 DistributedSharedMemory
ThreadBlockClustersintroducedincomputecapability9.0andfacilitatedbyCooperativeGroups,pro-
videtheabilityforthreadsinathreadblockclustertoaccesssharedmemoryofalltheparticipating
threadblocksinthatcluster. ThispartitionedsharedmemoryiscalledDistributedSharedMemory,and
the corresponding address space is called Distributed Shared Memory address space. Threads that
belongtoathreadblockclustercanread,writeorperformatomicsinthedistributedaddressspace,
regardlesswhethertheaddressbelongstothelocalthreadblockoraremotethreadblock. Whether
akernelusesdistributedsharedmemoryornot,thesharedmemorysizespecifications,staticordy-
namic is still per thread block. The size of distributed shared memory is just the number of thread
blocksperclustermultipliedbythesizeofsharedmemoryperthreadblock.
Accessingdataindistributedsharedmemoryrequiresallthethreadblockstoexist. Ausercanguar-
anteethatallthreadblockshavestartedexecutingusingcluster.sync()fromclasscluster_group.
Theuseralsoneedstoensurethatalldistributedsharedmemoryoperationshappenbeforetheexit
ofathreadblock,e.g.,ifaremotethreadblockistryingtoreadagiventhreadblock’ssharedmemory,
theprogramneedstoensurethatthesharedmemoryreadbytheremotethreadblockiscompleted
beforeitcanexit.
Let’s look at a simple histogram computation and how to optimize it on the GPU using thread block
cluster. Astandardwayofcomputinghistogramsistoperformthecomputationinthesharedmem-
ory of each thread block and then perform global memory atomics. A limitation of this approach is
the shared memory capacity. Once the histogram bins no longer fit in the shared memory, a user
needstodirectlycomputehistogramsandhencetheatomicsintheglobalmemory. Withdistributed
sharedmemory,CUDAprovidesanintermediatestep,wheredependingonthehistogrambinssize,the
histogramcanbecomputedinsharedmemory,distributedsharedmemoryorglobalmemorydirectly.
TheCUDAkernelexamplebelowshowshowtocomputehistogramsinsharedmemoryordistributed
sharedmemory,dependingonthenumberofhistogrambins.
#include <cooperative_groups.h>
∕∕ Distributed Shared memory histogram kernel
__global__ void clusterHist_kernel(int *bins, const int nbins, const int bins_
,→per_block, const int *__restrict__ input,
size_t array_size)
(continuesonnextpage)
2.2. WritingCUDASIMTKernels 41

CUDAProgrammingGuide,Release13.1
(continuedfrompreviouspage)
{
| extern    | __shared__                       |                       | int smem[]; |     |     |     |     |     |
| --------- | -------------------------------- | --------------------- | ----------- | --- | --- | --- | --- | --- |
| namespace | cg                               | = cooperative_groups; |             |     |     |     |     |     |
| int tid   | = cg::this_grid().thread_rank(); |                       |             |     |     |     |     |     |
∕∕ Cluster initialization, size and calculating local bin offsets.
| cg::cluster_group |     |                  | cluster                   | =   | cg::this_cluster(); |                       |     |     |
| ----------------- | --- | ---------------- | ------------------------- | --- | ------------------- | --------------------- | --- | --- |
| unsigned          | int | clusterBlockRank |                           |     | =                   | cluster.block_rank(); |     |     |
| int cluster_size  |     |                  | = cluster.dim_blocks().x; |     |                     |                       |     |     |
for (int i = threadIdx.x; i < bins_per_block; i += blockDim.x)
{
| smem[i] | = 0; | ∕∕Initialize |     |     | shared | memory | histogram | to zeros |
| ------- | ---- | ------------ | --- | --- | ------ | ------ | --------- | -------- |
}
∕∕ cluster synchronization ensures that shared memory is initialized to zero
,→in
∕∕ all thread blocks in the cluster. It also ensures that all thread blocks
| ∕∕ have | started | executing |     | and | they | exist | concurrently. |     |
| ------- | ------- | --------- | --- | --- | ---- | ----- | ------------- | --- |
cluster.sync();
for (int i = tid; i < array_size; i += blockDim.x * gridDim.x)
{
| int           | ldata =        | input[i]; |           |             |                 |                    |               |     |
| ------------- | -------------- | --------- | --------- | ----------- | --------------- | ------------------ | ------------- | --- |
| ∕∕Find        | the            | right     | histogram |             | bin.            |                    |               |     |
| int           | binid =        | ldata;    |           |             |                 |                    |               |     |
| if (ldata     | <              | 0)        |           |             |                 |                    |               |     |
| binid         | = 0;           |           |           |             |                 |                    |               |     |
| else          | if (ldata      |           | >= nbins) |             |                 |                    |               |     |
| binid         | = nbins        |           | - 1;      |             |                 |                    |               |     |
| ∕∕Find        | destination    |           | block     | rank        |                 | and offset         | for computing |     |
| ∕∕distributed |                | shared    |           | memory      | histogram       |                    |               |     |
| int           | dst_block_rank |           | =         | (int)(binid |                 | ∕ bins_per_block); |               |     |
| int           | dst_offset     |           | = binid   | %           | bins_per_block; |                    |               |     |
| ∕∕Pointer     | to             | target    | block     |             | shared          | memory             |               |     |
int *dst_smem = cluster.map_shared_rank(smem, dst_block_rank);
| ∕∕Perform          | atomic |     | update | of          | the | histogram | bin |     |
| ------------------ | ------ | --- | ------ | ----------- | --- | --------- | --- | --- |
| atomicAdd(dst_smem |        |     | +      | dst_offset, |     | 1);       |     |     |
}
∕∕ cluster synchronization is required to ensure all distributed shared
∕∕ memory operations are completed and no thread block exits while
∕∕ other thread blocks are still accessing distributed shared memory
cluster.sync();
∕∕ Perform global memory histogram, using the local distributed memory
,→histogram
| int *lbins | =   | bins | + cluster.block_rank() |     |     |     | * bins_per_block; |     |
| ---------- | --- | ---- | ---------------------- | --- | --- | --- | ----------------- | --- |
(continuesonnextpage)
| 42  |     |     |     |     |     |     | Chapter2. | ProgrammingGPUsinCUDA |
| --- | --- | --- | --- | --- | --- | --- | --------- | --------------------- |

CUDAProgrammingGuide,Release13.1
(continuedfrompreviouspage)
for (int i = threadIdx.x; i < bins_per_block; i += blockDim.x)
{
| atomicAdd(&lbins[i], |     |     | smem[i]); |     |     |     |     |
| -------------------- | --- | --- | --------- | --- | --- | --- | --- |
}
}
The above kernel can be launched at runtime with a cluster size depending on the amount of dis-
tributedsharedmemoryrequired. Ifthehistogramissmallenoughtofitinsharedmemoryofjustone
block,theusercanlaunchthekernelwithclustersize1. Thecodesnippetbelowshowshowtolaunch
aclusterkerneldynamicallybasedonsharedmemoryrequirements.
| ∕∕ Launch | via | extensible | launch |     |     |     |     |
| --------- | --- | ---------- | ------ | --- | --- | --- | --- |
{
| cudaLaunchConfig_t |     |         | config             | = {0};               |     |       |     |
| ------------------ | --- | ------- | ------------------ | -------------------- | --- | ----- | --- |
| config.gridDim     |     | =       | array_size         | ∕ threads_per_block; |     |       |     |
| config.blockDim    |     | =       | threads_per_block; |                      |     |       |     |
| ∕∕ cluster_size    |     | depends |                    | on the histogram     |     | size. |     |
∕∕ ( cluster_size == 1 ) implies no distributed shared memory, just thread
| ,→block   | local           | shared | memory  |                 |        |         |      |
| --------- | --------------- | ------ | ------- | --------------- | ------ | ------- | ---- |
| int       | cluster_size    |        | = 2; ∕∕ | size 2 is       | an     | example | here |
| int       | nbins_per_block |        | = nbins | ∕ cluster_size; |        |         |      |
| ∕∕dynamic | shared          | memory |         | size is per     | block. |         |      |
∕∕Distributed shared memory size = cluster_size * nbins_per_block *
,→sizeof(int)
| config.dynamicSmemBytes |     |     |     | = nbins_per_block |     | *   | sizeof(int); |
| ----------------------- | --- | --- | --- | ----------------- | --- | --- | ------------ |
CUDA_CHECK(::cudaFuncSetAttribute((void *)clusterHist_kernel,
,→cudaFuncAttributeMaxDynamicSharedMemorySize, config.dynamicSmemBytes));
| cudaLaunchAttribute           |     |              | attribute[1];                        |     |               |     |     |
| ----------------------------- | --- | ------------ | ------------------------------------ | --- | ------------- | --- | --- |
| attribute[0].id               |     | =            | cudaLaunchAttributeClusterDimension; |     |               |     |     |
| attribute[0].val.clusterDim.x |     |              |                                      | =   | cluster_size; |     |     |
| attribute[0].val.clusterDim.y |     |              |                                      | =   | 1;            |     |     |
| attribute[0].val.clusterDim.z |     |              |                                      | =   | 1;            |     |     |
| config.numAttrs               |     | =            | 1;                                   |     |               |     |     |
| config.attrs                  |     | = attribute; |                                      |     |               |     |     |
cudaLaunchKernelEx(&config, clusterHist_kernel, bins, nbins, nbins_per_
| ,→block, | input, | array_size); |     |     |     |     |     |
| -------- | ------ | ------------ | --- | --- | --- | --- | --- |
}
| 2.2.4. | Memory |     | Performance |     |     |     |     |
| ------ | ------ | --- | ----------- | --- | --- | --- | --- |
Ensuring proper memory usage is key to achieving high performance in CUDA kernels. This section
discussessomegeneralprinciplesandexamplesforachievinghighmemorythroughputinCUDAker-
nels.
2.2. WritingCUDASIMTKernels 43

CUDAProgrammingGuide,Release13.1
2.2.4.1 CoalescedGlobalMemoryAccess
Globalmemoryisaccessedvia32-bytememorytransactions. WhenaCUDAthreadrequestsaword
of data from global memory, the relevant warp coalesces the memory requests from all the threads
inthatwarpintothenumberofmemorytransactionsnecessarytosatisfytherequest,dependingon
the size of the word accessed by each thread and the distribution of the memory addresses across
thethreads. Forexample,ifathreadrequestsa4-byteword,theactualmemorytransactionthewarp
willgeneratetoglobalmemorywillbe32bytesintotal. Tousethememorysystemmostefficiently,
thewarpshoulduseallthememorythatisfetchedinasinglememorytransaction. Thatis,ifathread
isrequestinga4-bytewordfromglobalmemory,andthetransactionsizeis32bytes,ifotherthreads
inthatwarpcanuseother4-bytewordsofdatafromthat32-byterequest,thiswillresultinthemost
efficientuseofthememorysystem.
Asasimpleexample,ifconsecutivethreadsinthewarprequestconsecutive4-bytewordsinmemory,
then the warp will request 128 bytes of memory total, and this 128 bytes required will be fetched
in four 32-byte memory transactions. This results in 100% utilization of the memory system. That
is, 100%ofthememorytrafficisutilizedbythewarp. Figure10illustratesthisexampleofperfectly
coalescedmemoryaccess.
Figure10: Coalescedmemoryaccess
Conversely,thepathologicallyworstcasescenarioiswhenconsecutivethreadsaccessdataelements
that are 32 bytes or more apart from each other in memory. In this case, the warp will be forced to
issuea32-bytememorytransactionforeachthread,andthetotalnumberofbytesofmemorytraffic
will be 32 bytes times 32 threads/warp = 1024 bytes. However, the amount of memory used will be
128bytesonly(4bytesforeachthreadinthewarp),sothememoryutilizationwillonlybe128/1024
= 12.5%. This is a very inefficient use of the memory system. Figure 11 illustrates this example of
uncoalescedmemoryaccess.
Figure11: Uncoalescedmemoryaccess
Themoststraightforwardwaytoachievecoalescedmemoryaccessisforconsecutivethreadstoac-
cessconsecutiveelementsinmemory. Forexample,forakernellaunchedwith1dthreadblocks,the
following VecAdd kernel will achieve coalesced memory access. Notice how thread workIndex ac-
cessesthethreearrays,andconsecutivethreads(indicatedbyconsecutivevaluesofworkIndex)ac-
cessconsecutiveelementsinthearrays.
44 Chapter2. ProgrammingGPUsinCUDA

CUDAProgrammingGuide,Release13.1
__global__ void vecAdd(float* A, float* B, float* C, int vectorLength)
{
int workIndex = threadIdx.x + blockIdx.x*blockDim.x;
if(workIndex < vectorLength)
{
C[workIndex] = A[workIndex] + B[workIndex];
Thereisnorequirementthatconsecutivethreadsaccessconsecutiveelementsofmemorytoachieve
coalescedmemoryaccess,itismerelythetypicalwaycoalescingisachieved. Coalescedmemoryac-
cess occurs provided all the threads in the warp access elements from the same 32-byte segments
of memory in some linear or permuted way. Stated another way, the best way to achieve coalesced
memoryaccessistomaximizetheratioofbytesusedtobytestransferred.
Note
Ensuringpropercoalescingofglobalmemoryaccessesisoneofthemostimportantperformance
considerations for writing performant CUDA kernels. It is imperative that applications use the
memorysystemasefficientlyaspossible.
2.2.4.1.1 MatrixTransposeExampleUsingGlobalMemory
Asasimpleexample,consideranout-of-placematrixtransposekernelthattransposesa32bitfloat
square matrix of size N x N, from matrix a to matrix c. This example uses a 2d grid, and assumes a
launchof2dthreadblocksofsize32x32threads,thatis,blockDim.x = 32andblockDim.y = 32,
soeach2dthreadblockwilloperateona32x32tileofthematrix. Eachthreadoperatesonaunique
elementofthematrix,sonoexplicitsynchronizationofthreadsisnecessary. Figure12illustratesthis
matrixtransposeoperation. Thekernelsourcecodefollowsthefigure.
Figure12: MatrixTransposeusingGlobalmemory
Thelabelsonthetopandleftofeachmatrixarethe2dthreadblockindicesandalsocanbeconsidered
thetileindices,whereeachsmallsquareindicatesatileofthematrixthatwillbeoperatedonbya2d
threadblock.Inthisexample,thetilesizeis32x32elements,soeachofthesmallsquaresrepresents
a32x32tileofthematrix. Thegreenshadedsquareshowsthelocationofanexampletilebefore
andafterthetransposeoperation.
2.2. WritingCUDASIMTKernels 45

CUDAProgrammingGuide,Release13.1
∕* macro to index a 1D memory array with 2D indices in row-major order *∕
∕* ld is the leading dimension, i.e. the number of columns in the matrix *∕
| #define | INDX(  | row, col, | ld     | ) ( ( (row) | * (ld) | ) + (col) | )   |     |
| ------- | ------ | --------- | ------ | ----------- | ------ | --------- | --- | --- |
| ∕* CUDA | kernel | for naive | matrix | transpose   | *∕     |           |     |     |
__global__ void naive_cuda_transpose(int m, float *a, float *c )
{
| int | myCol | = blockDim.x |       | * blockIdx.x | +   | threadIdx.x; |     |     |
| --- | ----- | ------------ | ----- | ------------ | --- | ------------ | --- | --- |
| int | myRow | = blockDim.y |       | * blockIdx.y | +   | threadIdx.y; |     |     |
| if( | myRow | < m &&       | myCol | < m )        |     |              |     |     |
{
|     | c[INDX( | myCol, | myRow, | m )] = | a[INDX( | myRow, | myCol, | m )]; |
| --- | ------- | ------ | ------ | ------ | ------- | ------ | ------ | ----- |
| }   | ∕* end  | if *∕  |        |        |         |        |        |       |
return;
| } ∕* | end naive_cuda_transpose |     |     | *∕  |     |     |     |     |
| ---- | ------------------------ | --- | --- | --- | --- | --- | --- | --- |
To determine whether this kernel is achieving coalesced memory access one needs to determine
whether consecutive threads are accessing consecutive elements of memory. In a 2d thread block,
thexindexmovesthefastest,soconsecutivevaluesofthreadIdx.xshouldbeaccessingconsecu-
tiveelementsofmemory. threadIdx.xappearsinmyCol,andonecanobservethatwhenmyColis
thesecondargumenttotheINDXmacro,consecutivethreadsarereadingconsecutivevaluesofa,so
thereadofaisperfectlycoalesced.
However,thewritingofcisnotcoalesced,becauseconsecutivevaluesofthreadIdx.x(againexamine
myCol)arewritingelementstocthatareld(leadingdimension)elementsapartfromeachother. This
isobservedbecausenowmyColisthefirstargumenttotheINDXmacro,andasthefirstargumentto
INDXincrementsby1,thememorylocationchangesbyld. Whenldislargerthan32(whichoccurs
whenever the matrix sizes are larger than 32), this is equivalent to the pathological case shown in
Figure11.
Toalleviatetheseuncoalescedwrites, theuseofsharedmemorycanbeemployed, whichwillbede-
scribedinthenextsection.
2.2.4.2 SharedMemoryAccessPatterns
Sharedmemoryhas32banksthatareorganizedsuchthatsuccessive32-bitwordsmaptosuccessive
banks. Eachbankhasabandwidthof32bitsperclockcycle.
When multiple threads in the same warp attempt to access different elements in the same bank, a
bankconflictoccurs. Inthiscase,theaccesstothedatainthatbankwillbeserializeduntilthedata
inthatbankhasbeenobtainedbyallthethreadsthathaverequestedit. Thisserializationofaccess
resultsinaperformancepenalty.
The two exceptions to this scenario happen when multiple threads in the same warp are accessing
(eitherreadingorwriting)thesamesharedmemorylocation. Forreadaccesses,thewordisbroadcast
totherequestingthreads. Forwriteaccesses,eachsharedmemoryaddressiswrittenbyonlyoneof
thethreads(whichthreadperformsthewriteisundefined).
Figure 13 shows some examples of strided access. The red box inside the bank indicates a unique
locationinsharedmemory.
Figure14showssomeexamplesofmemoryreadaccessesthatinvolvethebroadcastmechanism. The
redboxinsidethebankindicatesauniquelocationinsharedmemory. Ifmultiplearrowspointtothe
| 46  |     |     |     |     |     | Chapter2. |     | ProgrammingGPUsinCUDA |
| --- | --- | --- | --- | --- | --- | --------- | --- | --------------------- |

CUDAProgrammingGuide,Release13.1
Figure13: StridedSharedMemoryAccessesin32bitbanksizemode.
Left
Linearaddressingwithastrideofone32-bitword(nobankconflict).
Middle
Linearaddressingwithastrideoftwo32-bitwords(two-waybankconflict).
Right
Linearaddressingwithastrideofthree32-bitwords(nobankconflict).
2.2. WritingCUDASIMTKernels 47

CUDAProgrammingGuide,Release13.1
samelocation,thedataisbroadcasttoallthreadsthatrequestedit.
Note
Avoiding bank conflicts is an important performance consideration for writing performant CUDA
kernelsthatusesharedmemory.
| 2.2.4.2.1 |     | MatrixTransposeExampleUsingSharedMemory |     |     |     |     |     |     |     |     |
| --------- | --- | --------------------------------------- | --- | --- | --- | --- | --- | --- | --- | --- |
In the previous example Matrix Transpose Example Using Global Memory, a naive implementation of
matrixtransposewasillustratedthatwasfunctionallycorrect, butnotoptimizedforefficientuseof
globalmemorybecausethewriteofthecmatrixwasnotcoalescedproperly. Inthisexample,shared
memorywillbetreatedasauser-managedcachetostageloadsandstoresfromglobalmemory,re-
sultingincoalescedglobalmemoryaccessofbothreadsandwrites.
Example
1 ∕* definitions of thread block size in X and Y directions *∕
2
| 3 #define |     | THREADS_PER_BLOCK_X |     |     | 32  |     |     |     |     |     |
| --------- | --- | ------------------- | --- | --- | --- | --- | --- | --- | --- | --- |
| #define   |     | THREADS_PER_BLOCK_Y |     |     | 32  |     |     |     |     |     |
4
5
6 ∕* macro to index a 1D memory array with 2D indices in row-major order *∕
∕* ld is the leading dimension, i.e. the number of columns in the matrix *∕
7
8
| #define |     | INDX( | row, col, | ld  | ) ( ( | (row) | * (ld) | )   | + (col) | )   |
| ------- | --- | ----- | --------- | --- | ----- | ----- | ------ | --- | ------- | --- |
9
10
| 11 ∕* | CUDA | kernel | for shared |     | memory | matrix | transpose |     | *∕  |     |
| ----- | ---- | ------ | ---------- | --- | ------ | ------ | --------- | --- | --- | --- |
12
__global__ void smem_cuda_transpose(int m, float *a, float *c )
13
{
14
15
| 16  | ∕*  | declare | a statically |     | allocated |     | shared | memory | array | *∕  |
| --- | --- | ------- | ------------ | --- | --------- | --- | ------ | ------ | ----- | --- |
17
__shared__ float smemArray[THREADS_PER_BLOCK_X][THREADS_PER_BLOCK_Y];
18
19
|     | ∕*  | determine | my row | tile | and | column | tile | index | *∕  |     |
| --- | --- | --------- | ------ | ---- | --- | ------ | ---- | ----- | --- | --- |
20
21
|     | const | int | tileCol | = blockDim.x |     | *   | blockIdx.x; |     |     |     |
| --- | ----- | --- | ------- | ------------ | --- | --- | ----------- | --- | --- | --- |
22
|     | const | int | tileRow | = blockDim.y |     | *   | blockIdx.y; |     |     |     |
| --- | ----- | --- | ------- | ------------ | --- | --- | ----------- | --- | --- | --- |
23
24
|     | ∕*  | read from | global | memory | into | shared |     | memory | array | *∕  |
| --- | --- | --------- | ------ | ------ | ---- | ------ | --- | ------ | ----- | --- |
25
26 smemArray[threadIdx.x][threadIdx.y] = a[INDX( tileRow + threadIdx.y,
|     | ,→tileCol | + threadIdx.x, |     | m   | )]; |     |     |     |     |     |
| --- | --------- | -------------- | --- | --- | --- | --- | --- | --- | --- | --- |
27
| 28  | ∕*  | synchronize | the | threads | in  | the | thread | block | *∕  |     |
| --- | --- | ----------- | --- | ------- | --- | --- | ------ | ----- | --- | --- |
__syncthreads();
29
30
|     | ∕*  | write the | result | from | shared | memory |     | to global |     | memory *∕ |
| --- | --- | --------- | ------ | ---- | ------ | ------ | --- | --------- | --- | --------- |
31
c[INDX( tileCol + threadIdx.y, tileRow + threadIdx.x, m )] =
32
,→smemArray[threadIdx.y][threadIdx.x];
return;
33
(continuesonnextpage)
| 48  |     |     |     |     |     |     |     |     | Chapter2. | ProgrammingGPUsinCUDA |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --------- | --------------------- |

CUDAProgrammingGuide,Release13.1
Figure14: IrregularSharedMemoryAccesses.
Left
Conflict-freeaccessviarandompermutation.
Middle
Conflict-freeaccesssincethreads3,4,6,7,and9accessthesamewordwithinbank5.
Right
Conflict-freebroadcastaccess(threadsaccessthesamewordwithinabank).
2.2. WritingCUDASIMTKernels 49

CUDAProgrammingGuide,Release13.1
(continuedfrompreviouspage)
34
| }   | ∕* end | smem_cuda_transpose |     |     |     | *∕  |     |     |     |     |     |     |
| --- | ------ | ------------------- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
35
Examplewitharraychecks
1 ∕* definitions of thread block size in X and Y directions *∕
2
| #define |     | THREADS_PER_BLOCK_X |     |     |     | 32  |     |     |     |     |     |     |
| ------- | --- | ------------------- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
3
| 4 #define |     | THREADS_PER_BLOCK_Y |     |     |     | 32  |     |     |     |     |     |     |
| --------- | --- | ------------------- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
5
6 ∕* macro to index a 1D memory array with 2D indices in column-major order *∕
∕* ld is the leading dimension, i.e. the number of rows in the matrix *∕
7
8
| 9 #define |     | INDX( | row, | col, | ld  | ) ( | ( (col) | * (ld) | )   | + (row) | )   |     |
| --------- | --- | ----- | ---- | ---- | --- | --- | ------- | ------ | --- | ------- | --- | --- |
10
| ∕*  | CUDA | kernel | for | shared |     | memory | matrix | transpose |     | *∕  |     |     |
| --- | ---- | ------ | --- | ------ | --- | ------ | ------ | --------- | --- | --- | --- | --- |
11
12
| __global__ |     |     | void | smem_cuda_transpose(int |     |     |     | m,  |     |     |     |     |
| ---------- | --- | --- | ---- | ----------------------- | --- | --- | --- | --- | --- | --- | --- | --- |
13
| 14  |     |     |     |     |     |     |     | float | *a,  |     |     |     |
| --- | --- | --- | --- | --- | --- | --- | --- | ----- | ---- | --- | --- | --- |
|     |     |     |     |     |     |     |     | float | *c ) |     |     |     |
15
{
16
17
|     | ∕*  | declare | a   | statically |     | allocated |     | shared | memory | array | *∕  |     |
| --- | --- | ------- | --- | ---------- | --- | --------- | --- | ------ | ------ | ----- | --- | --- |
18
19
__shared__ float smemArray[THREADS_PER_BLOCK_X][THREADS_PER_BLOCK_Y];
20
21
22 ∕* determine my row and column indices for the error checking code *∕
23
| 24  | const | int | myRow |     | = blockDim.x |     | *   | blockIdx.x |     | + threadIdx.x; |     |     |
| --- | ----- | --- | ----- | --- | ------------ | --- | --- | ---------- | --- | -------------- | --- | --- |
|     | const | int | myCol |     | = blockDim.y |     | *   | blockIdx.y |     | + threadIdx.y; |     |     |
25
26
| 27  | ∕*  | determine |     | my row | tile | and | column | tile | index | *∕  |     |     |
| --- | --- | --------- | --- | ------ | ---- | --- | ------ | ---- | ----- | --- | --- | --- |
28
| 29  | const | int | tileX |     | = blockDim.x |     | *   | blockIdx.x; |     |     |     |     |
| --- | ----- | --- | ----- | --- | ------------ | --- | --- | ----------- | --- | --- | --- | --- |
|     | const | int | tileY |     | = blockDim.y |     | *   | blockIdx.y; |     |     |     |     |
30
31
| 32  | if( | myRow | <   | m && | myCol | < m | )   |     |     |     |     |     |
| --- | --- | ----- | --- | ---- | ----- | --- | --- | --- | --- | --- | --- | --- |
{
33
|     |     | ∕*  | read | from | global | memory |     | into shared |     | memory | array *∕ |     |
| --- | --- | --- | ---- | ---- | ------ | ------ | --- | ----------- | --- | ------ | -------- | --- |
34
smemArray[threadIdx.x][threadIdx.y] = a[INDX( tileX + threadIdx.x,
35
|     | ,→tileY | + threadIdx.y, |     |     | m   | )]; |     |     |     |     |     |     |
| --- | ------- | -------------- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| 36  | }       | ∕* end         | if  | *∕  |     |     |     |     |     |     |     |     |
37
|     | ∕*  | synchronize |     | the | threads |     | in the | thread | block | *∕  |     |     |
| --- | --- | ----------- | --- | --- | ------- | --- | ------ | ------ | ----- | --- | --- | --- |
38
| 39  | __syncthreads(); |     |     |     |     |     |     |     |     |     |     |     |
| --- | ---------------- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
40
| 41  | if( | myRow | <   | m && | myCol | < m | )   |     |     |     |     |     |
| --- | --- | ----- | --- | ---- | ----- | --- | --- | --- | --- | --- | --- | --- |
{
42
|     |     | ∕*  | write | the | result | from | shared | memory |     | to global | memory | *∕  |
| --- | --- | --- | ----- | --- | ------ | ---- | ------ | ------ | --- | --------- | ------ | --- |
43
| 44  |     | c[INDX( |     | tileY | + threadIdx.x, |     |     | tileX | + threadIdx.y, |     | m   | )] = |
| --- | --- | ------- | --- | ----- | -------------- | --- | --- | ----- | -------------- | --- | --- | ----- |
,→smemArray[threadIdx.y][threadIdx.x];
(continuesonnextpage)
| 50  |     |     |     |     |     |     |     |     |     | Chapter2. | ProgrammingGPUsinCUDA |     |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --------- | --------------------- | --- |

CUDAProgrammingGuide,Release13.1
(continuedfrompreviouspage)
} ∕* end if *∕
45
return;
46
47
} ∕* end smem_cuda_transpose *∕
48
Thefundamentalperformanceoptimizationillustratedinthisexampleistoensurethatwhenaccess-
ingglobalmemory, thememoryaccessesarecoalescedproperly. Priortotheexecutionofthecopy,
each thread computes its tileRow and tileCol indices. These are the indices for the specific tile
that will be operated on, and these tile indices are based on which thread block is executing. Each
threadinthesamethreadblockhasthesametileRowandtileColvalues,soitcanbethoughtof
asthestartingpositionofthetilethatthisspecificthreadblockwilloperateon.
The kernel then proceeds with each thread block copying a 32 x 32 tile of the matrix from global
memorytosharedmemorywiththefollowingstatement. Sincethesizeofawarpis32threads,this
copyoperationwillbeexecutedby32warps,withnoguaranteedorderbetweenthewarps.
smemArray[threadIdx.x][threadIdx.y] = a[INDX( tileRow + threadIdx.y, tileCol
,→+ threadIdx.x, m )];
NotethatbecausethreadIdx.xappearsinthesecondargumenttoINDX,consecutivethreadsare
accessingconsecutiveelementsinmemory,andthereadofaisperfectlycoalesced.
Thenextstepinthekernelisthecalltothe__syncthreads()function. Thisensuresthatallthreads
inthethreadblockhavecompletedtheirexecutionofthepreviouscodebeforeproceedingandthere-
forethatthewriteofaintosharedmemoryiscompletedbeforethenextstep. Thisiscriticallyimpor-
tant because the next step will involve threads reading from shared memory. Without the __sync-
threads()call,thereadofaintosharedmemorywouldnotbeguaranteedtobecompletedbyallthe
warpsinthethreadblockbeforesomewarpsadvancefurtherinthecode.
At this point in the kernel, for each thread block, the smemArray has a 32 x 32 tile of the matrix,
arranged in the same order as the original matrix. To ensure that the elements within the tile are
transposed properly, threadIdx.x and threadIdx.y are swapped when they read smemArray. To
ensure that the overall tile is placed in the correct place in c, the tileRow and tileCol indices are
alsoswappedwhentheywritetoc. Toensurepropercoalescing,threadIdx.xisusedinthesecond
argumenttoINDX,asshownbythestatementbelow.
c[INDX( tileCol + threadIdx.y, tileRow + threadIdx.x, m )] =
,→smemArray[threadIdx.y][threadIdx.x];
Thiskernelillustratestwocommonusesofsharedmemory.
▶ Sharedmemoryisusedtostagedatafromglobalmemorytoensurethatreadsfromandwrites
toglobalmemoryarebothcoalescedproperly.
▶ Shared memory is used to allow threads in the same thread block to share data among them-
selves.
2.2.4.2.2 SharedMemoryBankConflicts
InSection2.2.4.2,thebankstructureofsharedmemorywasdescribed. Inthepreviousmatrixtrans-
poseexample,thepropercoalescedmemoryaccessto/fromglobalmemorywasachieved,butnocon-
siderationwasgiventowhethersharedmemorybankconflictswerepresent. Considerthefollowing
2dsharedmemorydeclaration,
__shared__ float smemArray[32][32];
2.2. WritingCUDASIMTKernels 51

CUDAProgrammingGuide,Release13.1
Sinceawarpis32threads,eachthreadinthesamewarpwillhaveafixedvalueforthreadIdx.yand
willhave0 <= threadIdx.x < 32.
The left panel of Figure 15 illustrates the situation when the threads in a warp access the data in a
columnofsmemArray. Warp0isaccessingmemorylocationssmemArray[0][0]throughsmemAr-
ray[31][0]. In C++ multi-dimensional array ordering, the last index moves the fastest, so consec-
utivethreadsinwarp0areaccessingmemorylocationsthatare32elementsapart. Asillustratedin
thefigure,thecolorsdenotethebanks,andthisaccessdowntheentirecolumnbywarp0resultsina
32-waybankconflict.
TherightpanelofFigure15illustratesthesituationwhenthethreadsinawarpaccessthedataacross
a row of smemArray. Warp 0 is accessing memory locations smemArray[0][0] through smemAr-
ray[0][31]. In this case, consecutive threads in warp 0 are accessing memory locations that are
adjacent. As illustrated in the figure, the colors denote the banks, and this access across the entire
rowbywarp0resultsinnobankconflicts. Theidealscenarioisforeachthreadinawarptoaccessa
sharedmemorylocationwithadifferentcolor.
Figure15: Bankstructureina32x32sharedmemoryarray.
Thenumbersintheboxesindicatethewarpindex. Thecolorsindicatewhichbankisassociatedwiththatshared
memorylocation.
Returning to the example from Section 2.2.4.2.1, one can examine the usage of shared memory to
determinewhetherbankconflictsarepresent. Thefirstusageofsharedmemoryiswhendatafrom
globalmemoryisstoredtosharedmemory:
smemArray[threadIdx.x][threadIdx.y] = a[INDX( tileRow + threadIdx.y, tileCol
,→+ threadIdx.x, m )];
BecauseC++arraysarestoredinrow-majororder,consecutivethreadsinthesamewarp,asindicated
byconsecutivevaluesofthreadIdx.x,willaccesssmemArraywithastrideof32elements,because
threadIdx.xisthefirstindexintothearray. Thisresultsina32-waybankconflictandisillustrated
bytheleftpanelofFigure15.
Thesecondusageofsharedmemoryiswhendatafromsharedmemoryiswrittenbacktoglobalmem-
ory:
52 Chapter2. ProgrammingGPUsinCUDA

CUDAProgrammingGuide,Release13.1
c[INDX( tileCol + threadIdx.y, tileRow + threadIdx.x, m )] =
,→smemArray[threadIdx.y][threadIdx.x];
Inthiscase,becausethreadIdx.xisthesecondindexintothesmemArrayarray,consecutivethreads
inthesamewarpwillaccesssmemArraywithastrideof1element. Thisresultsinnobankconflicts
andisillustratedbytherightpanelofFigure15.
ThematrixtransposekernelasillustratedinSection2.2.4.2.1hasoneaccessofsharedmemorythat
has no bank conflicts and one access that has a 32-way bank conflict. A common fix to avoid bank
conflictsistopadthesharedmemorybyaddingonetothecolumndimensionofthearrayasfollows:
__shared__ float smemArray[THREADS_PER_BLOCK_X][THREADS_PER_BLOCK_Y+1];
ThisminoradjustmenttothedeclarationofsmemArraywilleliminatethebankconflicts. Toillustrate
this,considerFigure16wherethesharedmemoryarrayhasbeendeclaredwithasizeof32x33. One
observesthatwhetherthethreadsinthesamewarpaccessthesharedmemoryarraydownanentire
columnoracrossanentirerow,thebankconflictshavebeeneliminated,i.e.,thethreadsinthesame
warpaccesslocationswithdifferentcolors.
Figure16: Bankstructureina32x33sharedmemoryarray.
Thenumbersintheboxesindicatethewarpindex. Thecolorsindicatewhichbankisassociatedwiththatshared
memorylocation.
2.2.5. Atomics
Performant CUDA kernels rely on expressing as much algorithmic parallelism as possible. The asyn-
chronousnatureofGPUkernelexecutionrequiresthatthreadsoperateasindependentlyaspossible.
It’snotalwayspossibletohavecompleteindependenceofthreadsandaswesawinSharedMemory,
thereexistsamechanismforthreadsinthesamethreadblocktoexchangedataandsynchronize.
On the level of an entire grid there is no such mechanism to synchronize all threads in a grid. There
is however a mechanism to provide synchronous access to global memory locations via the use of
atomicfunctions. Atomic functionsallow a thread to obtaina lock on a global memory location and
performaread-modify-writeoperationonthatlocation. Nootherthreadcanaccessthesamelocation
while the lock is held. CUDA provides atomics with the same behavior as the C++ standard library
atomicsascuda::std::atomicandcuda::std::atomic_ref. CUDAalsoprovidesextendedC++
2.2. WritingCUDASIMTKernels 53

CUDAProgrammingGuide,Release13.1
atomicscuda::atomicandcuda::atomic_refwhichallowtheusertospecifythethreadscopeof
theatomicoperation. ThedetailsofatomicfunctionsarecoveredinAtomicFunctions.
An example usage of cuda::atomic_ref to perform a device-wide atomic addition is as follows,
wherearrayisanarrayoffloats,andresultisafloatpointertoalocationinglobalmemorywhich
isthelocationwherethesumofthearraywillbestored.
__global__ void sumReduction(int n, float *array, float *result) {
...
tid = threadIdx.x + blockIdx.x * blockDim.x;
cuda::atomic_ref<float, cuda::thread_scope_device> result_ref(result);
result_ref.fetch_add(array[tid]);
...
}
Atomic functions should be used sparingly as they enforce thread synchronization that can impact
performance.
2.2.6. Cooperative Groups
CooperativegroupsisasoftwaretoolavailableinCUDAC++thatallowsapplicationstodefinegroups
ofthreadswhichcansynchronizewitheachother,evenifthatgroupofthreadsspansmultiplethread
blocks,multiplegridsonasingleGPU,orevenacrossmultipleGPUs. TheCUDAprogrammingmodel
ingeneralallowsthreadswithinathreadblockorthreadblockclustertosynchronizeefficiently, but
does not provide a mechanism for specifying thread groups smaller than a thread block or cluster.
Similarly,theCUDAprogrammingmodeldoesnotprovidemechanismsorguaranteesthatenablesyn-
chronizationacrossthreadblocks.
Cooperative groups provide both of these capabilities through software. Cooperative groups allows
theapplicationtocreatethreadgroupsthatcrosstheboundaryofthreadblocksandclusters,though
doingsocomeswithsomesemanticlimitationsandperformanceimplicationswhicharedescribedin
detailinthefeaturesectioncoveringcooperativegroups.
2.2.7. Kernel Launch and Occupancy
WhenaCUDAkernelislaunched, CUDAthreadsaregroupedintothreadblocksandagrid basedon
the execution configuration specified at kernel launch. Once the kernel is launched, the scheduler
assignsthreadblockstoSMs. Thedetailsofwhichthreadblocksarescheduledtoexecuteonwhich
SMscannotbecontrolledorqueriedbytheapplicationandnoorderingguaranteesaremadebythe
scheduler,soprogramscannotnotrelyonaspecificschedulingorderorschemeforcorrectexecution.
The number of blocks that can be scheduled on an SM depends on the hardware resources a given
threadblockrequires,andthehardwareresourcesavailableontheSM.Whenakernelisfirstlaunched,
the scheduler begins assigning thread blocks to SMs. As long as SMs have sufficient hardware re-
sources unoccupied by other thread blocks, the scheduler will continue assigning thread blocks to
SMs. IfatsomepointnoSMhasthecapacitytoacceptanotherthreadblock,theschedulerwillwait
untiltheSMscompletepreviouslyassignedthreadblocks. Oncethishappens,SMsarefreetoaccept
morework,andtheschedulerassignsthreadblockstothem. Thisprocesscontinuesuntilallthread
blockshavebeenscheduledandexecuted.
The cudaGetDeviceProperties function allows an application to query the limits of each SM via
deviceproperties. NotethattherearelimitsperSMandperthreadblock.
▶ maxBlocksPerMultiProcessor: ThemaximumnumberofresidentblocksperSM.
54 Chapter2. ProgrammingGPUsinCUDA

CUDAProgrammingGuide,Release13.1
▶
sharedMemPerMultiprocessor: TheamountofsharedmemoryavailableperSMinbytes.
▶ regsPerMultiprocessor: Thenumberof32-bitregistersavailableperSM.
▶ maxThreadsPerMultiProcessor: ThemaximumnumberofresidentthreadsperSM.
▶
sharedMemPerBlock: The maximum amount of shared memory that can be allocated by a
threadblockinbytes.
▶
regsPerBlock: Themaximumnumberof32-bitregistersthatcanbeallocatedbyathreadblock.
▶ maxThreadsPerBlock: Themaximumnumberofthreadsperthreadblock.
TheoccupancyofaCUDAkernelistheratioofthenumberofactivewarpstothemaximumnumber
of active warps supported by the SM. In general, it’s a good practice to have occupancy as high as
possiblewhichhideslatencyandincreasesperformance.
Tocalculateoccupancy,oneneedstoknowtheresourcelimitsoftheSM,whichwerejustdescribed,
and one needs to know what resources are required by the CUDA kernel in question. To determine
resourceusageonaperkernelbasis,duringprogramcompilationonecanusethe--resource-usage
optiontonvcc,whichwillshowthenumberofregistersandsharedmemoryrequiredbythekernel.
Toillustrate,consideradevicesuchascomputecapability10.0withthedevicepropertiesenumerated
inTable2.
Table2: SMResourceExample
|     |     | Resource                    | Value  |     |     |
| --- | --- | --------------------------- | ------ | --- | --- |
|     |     | maxBlocksPerMultiProcessor  | 32     |     |     |
|     |     | sharedMemPerMultiprocessor  | 233472 |     |     |
|     |     | regsPerMultiprocessor       | 65536  |     |     |
|     |     | maxThreadsPerMultiProcessor | 2048   |     |     |
|     |     | sharedMemPerBlock           | 49152  |     |     |
|     |     | regsPerBlock                | 65536  |     |     |
|     |     | maxThreadsPerBlock          | 1024   |     |     |
IfakernelwaslaunchedastestKernel<<<512, 768>>>(),i.e.,768threadsperblock,eachSMwould
only be able to execute 2 thread blocks at a time. The scheduler cannot assign more than 2 thread
| blocks | per SM because | the | is 2048. | So the occupancy | would be |
| ------ | -------------- | --- | -------- | ---------------- | -------- |
maxThreadsPerMultiProcessor
(768*2)/2048,or75%.
IfakernelwaslaunchedastestKernel<<<512, 32>>>(),i.e.,32threadsperblock,eachSMwould
notrunintoalimitonmaxThreadsPerMultiProcessor,butsincethemaxBlocksPerMultiProces-
soris32,theschedulerwouldonlybeabletoassign32threadblockstoeachSM.Sincethenumber
ofthreadsintheblockis32,thetotalnumberofthreadsresidentontheSMwouldbe32blocks*32
threadsperblock,or1024totalthreads. Sinceacomputecapability10.0SMhasamaximumvalueof
2048residentthreadsperSM,theoccupancyinthiscaseis1024/2048,or50%.
The same analysis can be done with shared memory. If a kernel uses 100KB of shared memory, for
example, the scheduler would only be able to assign 2 thread blocks to each SM, because the third
threadblockonthatSMwouldrequireanother100KBofsharedmemoryforatotalof300KB,which
ismorethanthe233472bytesavailableperSM.
Threadsperblockandsharedmemoryusageperblockareexplicitlycontrolledbytheprogrammerand
canbeadjustedtoachievethedesiredoccupancy. Theprogrammerhaslimitedcontroloverregister
| 2.2. WritingCUDASIMTKernels |     |     |     |     | 55  |
| --------------------------- | --- | --- | --- | --- | --- |

CUDAProgrammingGuide,Release13.1
usage as the compiler and runtime will attempt to optimize register usage. However the program-
mercanspecifyamaximumnumberofregistersperthreadblockviathe--maxrregcountoptionto
nvcc. Ifthekernelneedsmoreregistersthanthisspecifiedamount,thekernelislikelytospilltolocal
memory,whichwillchangetheperformancecharacteristicsofthekernel. Insomecaseseventhough
spilling occurs, limiting registers allows more thread blocks to be scheduled which in turn increases
occupancyandmayresultinanetincreaseinperformance.
2.3. Asynchronous Execution
2.3.1. What is Asynchronous Concurrent Execution?
CUDAallowsconcurrent,oroverlapping,executionofmultipletasks,specifically:
▶ computationonthehost
▶ computationonthedevice
▶ memorytransfersfromthehosttothedevice
▶ memorytransfersfromthedevicetothehost
▶ memorytransferswithinthememoryofagivendevice
▶ memorytransfersamongdevices
Theconcurrencyisexpressedviaanasynchronousinterface,whereadispatchingfunctioncallorker-
nel launch returns immediately. Asynchronous calls usually return before the dispatched operation
has completed and may return before the asynchronous operation has started. The application is
thenfreetoperformothertasksatthesametimeastheoriginallydispatchedoperation. Whenthe
finalresultsoftheinitiallydispatchedoperationareneeded,theapplicationmustperformsomeform
ofsynchronizationtoensurethattheoperationinquestionhascompleted. Atypicalexampleofacon-
currentexecutionpatternistheoverlappingofhostanddevicememorytransferswithcomputation
andthusreducingoreliminatingtheiroverhead.
Figure17: AsynchronousCOncurrentExecutionwithCUDAstreams
In general, asynchronous interfaces typically provide three main ways to synchronize with the dis-
patchedoperation
▶ ablockingapproach,wheretheapplicationcallsafunctionthatblocks,orwaitsuntiltheopera-
tionhascompleted
56 Chapter2. ProgrammingGPUsinCUDA

CUDAProgrammingGuide,Release13.1
▶ anon-blockingapproach,orpollingapproachwheretheapplicationcallsafunctionthatreturns
immediatelyandsuppliesinformationaboutthestatusoftheoperation
▶ acallbackapproach,whereapre-registeredfunctionisexecutedwhentheoperationhascom-
pleted.
Whiletheprogramminginterfacesareasynchronous,theactualabilitytocarryoutvariousoperations
concurrently will depend on the version of CUDA and the compute capability of the hardware being
used–thesedetailswillbelefttoalatersectionofthisguide(seeComputeCapabilities).
In SynchronizingCPUandGPU, the CUDA runtime function cudaDeviceSynchronize() was intro-
duced, which is a blocking call which waits for all previously issued work to complete. The reason
the cudaDeviceSynchronize() call was needed is because the kernel launch is asynchronous and
returns immediately. CUDA provides an API for both blocking and non-blocking approaches to syn-
chronizationandevensupportstheuseofhost-sidecallbackfunctions.
ThecoreAPIcomponentsforasynchronousexecutioninCUDAareCUDAStreamsandCUDAEvents.
In the rest of this section we will explain how these elements can be used to express asynchronous
executioninCUDA.
ArelatedtopicisthatofCUDAGraphs,whichallowagraphofasynchronousoperationstobedefined
upfront,whichcanthenbeexecutedrepeatedlywithminimaloverhead. WecoverCUDAGraphsina
veryintroductorylevelinsection2.4.9.2IntroductiontoCUDAGraphswithStreamCapture,andamore
comprehensivediscussionisprovidedinsection4.1CUDAGraphs.
2.3.2. CUDA Streams
At the most basic level, a CUDA stream is an abstraction which allows the programmer to express a
sequenceofoperations. Astreamoperateslikeawork-queueintowhichprogramscanaddoperations,
such as memory copies or kernel launches, to be executed in order. Operations at the front of the
queueforagivenstreamareexecutedandthendequeuedallowingthenextqueuedoperationtocome
to the front and to be considered for execution. The order of execution of operations in a stream is
sequentialandtheoperationsareexecutedintheordertheyareenqueuedintothestream.
Anapplicationmayusemultiplestreamssimultaneously. Insuchcases,theruntimewillselectatask
toexecutefromthestreamsthathaveworkavailabledependingonthestateoftheGPUresources.
Streams may be assigned a priority which acts as a hint to the runtime to influence the scheduling,
butdoesnotguaranteeaspecificorderofexecution.
TheAPIfunctioncallsandkernel-launchesoperatinginastreamareasynchronouswithrespecttothe
hostthread. Applicationscansynchronizewithastreambywaitingforittobeemptyoftasks,orthey
canalsosynchronizeatthedevicelevel.
CUDAhasadefaultstream,andoperationsandkernellauncheswithoutaspecificstreamarequeued
intothisdefaultstream. Codeexampleswhichdonotspecifyastreamareusingthisdefaultstream
implicitly. ThedefaultstreamhassomespecificsemanticswhicharediscussedinsubsectionBlocking
andnon-blockingstreamsandthedefaultstream.
2.3.2.1 CreatingandDestroyingCUDAStreams
CUDAstreamscanbecreatedusingthecudaStreamCreate()function. Thefunctioncallinitializes
thestreamhandlewhichcanbeusedtoidentifythestreaminsubsequentfunctioncalls.
cudaStream_t stream; ∕∕ Stream handle
cudaStreamCreate(&stream); ∕∕ Create a new stream
(continuesonnextpage)
2.3. AsynchronousExecution 57

CUDAProgrammingGuide,Release13.1
(continuedfrompreviouspage)
∕∕ stream based operations ...
cudaStreamDestroy(stream); ∕∕ Destroy the stream
If the the device is still doing work in stream stream when the application calls cudaStreamDe-
stroy(),thestreamwillcompletealltheworkinthestreambeforebeingdestroyed.
2.3.2.2 LaunchingKernelsinCUDAStreams
Theusualtriple-chevronsyntaxforlaunchingakernelcanalsobeusedtolaunchakernelintoaspe-
cific stream. The stream is specified as an extra parameter to the kernel launch. In the following
examplethekernelnamedkernelislaunchedintothestreamwithhandlestream,whichisoftype
cudaStream_tandhasbeenassumedtohavebeencreatedpreviously:
kernel<<<grid, block, shared_mem_size, stream>>>(...);
Thekernellaunchisasynchronousandthefunctioncallreturnsimmediately. Assumingthatthekernel
launchissuccessful,thekernelwillexecuteinthestreamstreamandtheapplicationisfreetoperform
othertasksontheCPUorinotherstreamsontheGPUwhilethekernelisexecuting.
2.3.2.3 LaunchingMemoryTransfersinCUDAStreams
Tolaunchamemorytransferintoastream,wecanusethefunctioncudaMemcpyAsync(). Thisfunc-
tion is similar to the cudaMemcpy() function, but it takes an additional parameter specifying the
streamtouseforthememorytransfer. Thefunctioncallinthecodeblockbelowcopiessizebytes
from the host memory pointed to by src to the device memory pointed to by dst in the stream
stream.
∕∕ Copy `size` bytes from `src` to `dst` in stream `stream`
cudaMemcpyAsync(dst, src, size, cudaMemcpyHostToDevice, stream);
Likeotherasynchronousfunctioncalls,thisfunctioncallreturnsimmediately,whereasthecudaMem-
cpy() function blocks until the memory transfer is complete. In order to access the results of the
transfersafely,theapplicationmustdeterminethattheoperationhascompletedusingsomeformof
synchronization.
OtherCUDAmemorytransferfunctionssuchascudaMemcpy2D()alsohaveasynchronousvariants.
Note
In order for memory copies involving CPU memory to be carried out asynchronously, the host
buffersmustbepinnedandpage-locked. cudaMemcpyAsync()willfunctioncorrectlyifhostmem-
orywhichisnotpinnedandpage-lockedisused,butitwillreverttoasynchronousbehaviorwhich
willnotoverlapwithotherwork. Thiscaninhibittheperformancebenefitsofusingasynchronous
memorytransfers. ItisrecommendedprogramsusecudaMallocHost()toallocatebufferswhich
willbeusedtosendorreceivedatafromGPUs.
2.3.2.4 StreamSynchronization
Thesimplestwaytosynchronizewithastreamistowaitforthestreamtobeemptyoftasks. Thiscan
be done in two ways, using the cudaStreamSynchronize() function or the cudaStreamQuery()
function.
58 Chapter2. ProgrammingGPUsinCUDA

CUDAProgrammingGuide,Release13.1
ThecudaStreamSynchronize()functionwillblockuntilalltheworkinthestreamhascompleted.
| ∕∕ Wait | for the | stream | to  | be empty | of  | tasks |     |     |
| ------- | ------- | ------ | --- | -------- | --- | ----- | --- | --- |
cudaStreamSynchronize(stream);
| ∕∕ At  | this point    | the | stream      | is  | done      |            |     |        |
| ------ | ------------- | --- | ----------- | --- | --------- | ---------- | --- | ------ |
| ∕∕ and | we can access |     | the results |     | of stream | operations |     | safely |
If we prefer not to block, but just need a quick check to see if the steam is empty we can use the
cudaStreamQuery()function.
| ∕∕ Have     | a peek            | at the | stream                     |            |            |       |            |     |
| ----------- | ----------------- | ------ | -------------------------- | ---------- | ---------- | ----- | ---------- | --- |
| ∕∕ returns  | cudaSuccess       |        | if                         | the stream | is         | empty |            |     |
| ∕∕ returns  | cudaErrorNotReady |        |                            | if         | the stream | is    | not empty  |     |
| cudaError_t | status            |        | = cudaStreamQuery(stream); |            |            |       |            |     |
| switch      | (status)          | {      |                            |            |            |       |            |     |
| case        | cudaSuccess:      |        |                            |            |            |       |            |     |
|             | ∕∕ The            | stream | is empty                   |            |            |       |            |     |
|             | std::cout         | <<     | "The                       | stream     | is empty"  | <<    | std::endl; |     |
break;
| case | cudaErrorNotReady: |        |        |        |        |        |     |            |
| ---- | ------------------ | ------ | ------ | ------ | ------ | ------ | --- | ---------- |
|      | ∕∕ The             | stream | is not | empty  |        |        |     |            |
|      | std::cout          | <<     | "The   | stream | is not | empty" | <<  | std::endl; |
break;
default:
|     | ∕∕ An | error | occurred | -   | we should | handle | this |     |
| --- | ----- | ----- | -------- | --- | --------- | ------ | ---- | --- |
break;
};
| 2.3.3. | CUDA | Events |     |     |     |     |     |     |
| ------ | ---- | ------ | --- | --- | --- | --- | --- | --- |
CUDA events are a mechanism for inserting markers into a CUDA stream. They are essentially like
tracer particles that can be used to track the progress of tasks in a stream. Imagine launching two
kernelsintoastream. Withoutsuchtrackingevents,wewouldonlybeabletodeterminewhetherthe
streamisemptyornot. Ifwehadanoperationthatwasdependentontheoutputofthefirstkernel,
wewouldnotbeabletostartthatoperationsafelyuntilweknewthestreamwasemptybywhichtime
bothkernelswouldhavecompleted.
UsingCUDAEventswecandobetter. Byenqueuinganeventintoastreamdirectlyafterthefirstkernel,
butbeforethesecondkernel,wecanwaitforthiseventtocometothefrontofthestream. Then,we
can safely start our dependent operation knowing that the first kernel has completed, but before
thesecondkernelhasstarted. UsingCUDAeventsinthiswaycanbuildupagraphofdependencies
between operations and streams. This graph analogy translates directly into the later discussion of
CUDAgraphs.
CUDA streams also keep time information which can be used to time kernel launches and memory
transfers.
2.3. AsynchronousExecution 59

CUDAProgrammingGuide,Release13.1
2.3.3.1 CreatingandDestroyingCUDAEvents
CUDA Events can be created and destroyed using the cudaEventCreate() and cudaEventDe-
stroy()functions.
| cudaEvent_t |     | event; |     |     |     |     |     |
| ----------- | --- | ------ | --- | --- | --- | --- | --- |
| ∕∕ Create   | the | event  |     |     |     |     |     |
cudaEventCreate(&event);
| ∕∕ do some | work    | involving |           | the event     |       |               |     |
| ---------- | ------- | --------- | --------- | ------------- | ----- | ------------- | --- |
| ∕∕ Once    | the     | work is   | done      | and the event | is no | longer needed |     |
| ∕∕ we can  | destroy |           | the event |               |       |               |     |
cudaEventDestroy(event);
Theapplicationisresponsiblefordestroyingeventswhentheyarenolongerneeded.
2.3.3.2 InsertingEventsintoCUDAStreams
CUDAEventscanbeinsertedintoastreamusingthecudaEventRecord()function.
| cudaEvent_t  |     | event;  |     |     |     |     |     |
| ------------ | --- | ------- | --- | --- | --- | --- | --- |
| cudaStream_t |     | stream; |     |     |     |     |     |
| ∕∕ Create    | the | event   |     |     |     |     |     |
cudaEventCreate(&event);
| ∕∕ Insert              | the | event | into | the stream |     |     |     |
| ---------------------- | --- | ----- | ---- | ---------- | --- | --- | --- |
| cudaEventRecord(event, |     |       |      | stream);   |     |     |     |
2.3.3.3 TimingOperationsinCUDAStreams
CUDAeventscanbeusedtotimetheexecutionofvariousstreamoperationsincludingkernels. When
an event reaches the front of a stream it records a timestamp. By surrounding a kernel in a stream
withtwoeventswecangetanaccuratetimingofthedurationofthekernelexecutionasisshownin
thecodesnippetbelow:
| cudaStream_t |     | stream; |     |     |     |     |     |
| ------------ | --- | ------- | --- | --- | --- | --- | --- |
cudaStreamCreate(&stream);
| cudaEvent_t |     | start; |     |     |     |     |     |
| ----------- | --- | ------ | --- | --- | --- | --- | --- |
| cudaEvent_t |     | stop;  |     |     |     |     |     |
| ∕∕ create   | the | events |     |     |     |     |     |
cudaEventCreate(&start);
cudaEventCreate(&stop);
| ∕∕ record              | the | start  | event |                 |     |     |     |
| ---------------------- | --- | ------ | ----- | --------------- | --- | --- | --- |
| cudaEventRecord(start, |     |        |       | stream);        |     |     |     |
| ∕∕ launch              | the | kernel |       |                 |     |     |     |
| kernel<<<grid,         |     | block, | 0,    | stream>>>(...); |     |     |     |
(continuesonnextpage)
| 60  |     |     |     |     |     | Chapter2. | ProgrammingGPUsinCUDA |
| --- | --- | --- | --- | --- | --- | --------- | --------------------- |

CUDAProgrammingGuide,Release13.1
(continuedfrompreviouspage)
| ∕∕ record             | the    | stop       | event |             |           |
| --------------------- | ------ | ---------- | ----- | ----------- | --------- |
| cudaEventRecord(stop, |        |            |       | stream);    |           |
| ∕∕ wait               | for    | the stream |       | to complete |           |
| ∕∕ both               | events | will       | have  | been        | triggered |
cudaStreamSynchronize(stream);
| ∕∕ get | the timing |     |     |     |     |
| ------ | ---------- | --- | --- | --- | --- |
float elapsedTime;
| cudaEventElapsedTime(&elapsedTime, |     |     |     |     | start, stop); |
| ---------------------------------- | --- | --- | --- | --- | ------------- |
std::cout << "Kernel execution time: " << elapsedTime << " ms" << std::endl;
| ∕∕ clean | up  |     |     |     |     |
| -------- | --- | --- | --- | --- | --- |
cudaEventDestroy(start);
cudaEventDestroy(stop);
cudaStreamDestroy(stream);
2.3.3.4 CheckingtheStatusofCUDAEvents
Like in the case of checking the status of streams, we can check the status of events in either a
blockingoranon-blockingway.
ThecudaEventSynchronize()functionwillblockuntiltheeventhascompleted. Inthecodesnippet
belowwelaunchakernelintoastream,followedbyaneventandthenbyasecondkernel. Wecanuse
thecudaEventSynchronize()functiontowaitfortheeventafterthefirstkerneltocompleteand
inprinciplelaunchadependenttaskimmediately,potentiallybeforekernel2finishes.
| cudaEvent_t  |     | event;  |     |     |     |
| ------------ | --- | ------- | --- | --- | --- |
| cudaStream_t |     | stream; |     |     |     |
| ∕∕ create    | the | stream  |     |     |     |
cudaStreamCreate(&stream);
| ∕∕ create | the | event |     |     |     |
| --------- | --- | ----- | --- | --- | --- |
cudaEventCreate(&event);
| ∕∕ launch              | a      | kernel    | into       | the stream         |                   |
| ---------------------- | ------ | --------- | ---------- | ------------------ | ----------------- |
| kernel<<<grid,         |        | block,    |            | 0, stream>>>(...); |                   |
| ∕∕ Record              | the    | event     |            |                    |                   |
| cudaEventRecord(event, |        |           |            | stream);           |                   |
| ∕∕ launch              | a      | kernel    | into       | the stream         |                   |
| kernel2<<<grid,        |        | block,    |            | 0, stream>>>(...); |                   |
| ∕∕ Wait                | for    | the event | to         | complete           |                   |
| ∕∕ Kernel              | 1      | will be   | guaranteed |                    | to have completed |
| ∕∕ and                 | we can | launch    | the        | dependent          | task.             |
cudaEventSynchronize(event);
dependentCPUtask();
| ∕∕ Wait | for | the stream |     | to be | empty |
| ------- | --- | ---------- | --- | ----- | ----- |
(continuesonnextpage)
2.3. AsynchronousExecution 61

CUDAProgrammingGuide,Release13.1
(continuedfrompreviouspage)
| ∕∕ Kernel | 2   | is guaranteed |     | to  | have completed |     |     |     |
| --------- | --- | ------------- | --- | --- | -------------- | --- | --- | --- |
cudaStreamSynchronize(stream);
| ∕∕ destroy |     | the event |     |     |     |     |     |     |
| ---------- | --- | --------- | --- | --- | --- | --- | --- | --- |
cudaEventDestroy(event);
| ∕∕ destroy |     | the stream |     |     |     |     |     |     |
| ---------- | --- | ---------- | --- | --- | --- | --- | --- | --- |
cudaStreamDestroy(stream);
CUDA Events can be checked for completion in a non-blocking way using the cudaEventQuery()
function. Intheexamplebelowwelaunch2kernelsintoastream. Thefirstkernel,kernel1generates
some data which we would like to copy to the host, however we also have some CPU side work to
do. Inthecodebelow,weenqueuekernel1followedbyanevent(event)andthenkernel2intostream
stream1. WethengointoaCPUworkloop,butoccasionallytakeapeektoseeiftheeventhascom-
pletedindicatingthatkernel1isdone. Ifso,welaunchahosttodevicecopyintostreamstream2. This
approach allows the overlap of the CPU work with the GPU kernel execution and the device to host
copy.
| cudaEvent_t  |      | event;          |     |     |     |     |     |     |
| ------------ | ---- | --------------- | --- | --- | --- | --- | --- | --- |
| cudaStream_t |      | stream1;        |     |     |     |     |     |     |
| cudaStream_t |      | stream2;        |     |     |     |     |     |     |
| size_t       | size | = LARGE_NUMBER; |     |     |     |     |     |     |
float *d_data;
| ∕∕ Create                   | some    | data      |          |                 |               |        |        |     |
| --------------------------- | ------- | --------- | -------- | --------------- | ------------- | ------ | ------ | --- |
| cudaMalloc(&d_data,         |         |           | size);   |                 |               |        |        |     |
| float                       | *h_data | = (float  |          | *)malloc(size); |               |        |        |     |
| ∕∕ create                   | the     | streams   |          |                 |               |        |        |     |
| cudaStreamCreate(&stream1); |         |           |          |                 | ∕∕ Processing |        | stream |     |
| cudaStreamCreate(&stream2); |         |           |          |                 | ∕∕ Copying    | stream |        |     |
| bool copyStarted            |         |           | = false; |                 |               |        |        |     |
| ∕∕ create                   |         | the event |          |                 |               |        |        |     |
cudaEventCreate(&event);
| ∕∕ launch              | kernel1 |          | into      | the       | stream             |      |            |     |
| ---------------------- | ------- | -------- | --------- | --------- | ------------------ | ---- | ---------- | --- |
| kernel1<<<grid,        |         | block,   |           | 0,        | stream1>>>(d_data, |      | size);     |     |
| ∕∕ enqueue             |         | an event | following |           | kernel1            |      |            |     |
| cudaEventRecord(event, |         |          |           | stream1); |                    |      |            |     |
| ∕∕ launch              | kernel2 |          | into      | the       | stream             |      |            |     |
| kernel2<<<grid,        |         | block,   |           | 0,        | stream1>>>();      |      |            |     |
| ∕∕ while               | the     | kernels  | are       | running   | do some            | work | on the CPU |     |
∕∕ but check if kernel1 has completed because then we will start
| ∕∕ a device |       | to host          | copy | in  | stream2            |     |     |     |
| ----------- | ----- | ---------------- | ---- | --- | ------------------ | --- | --- | --- |
| while       | ( not | allCPUWorkDone() |      |     | || not copyStarted |     | ) { |     |
doNextChunkOfCPUWork();
| ∕∕  | peek | to see | if  | kernel | 1 has completed |     |     |     |
| --- | ---- | ------ | --- | ------ | --------------- | --- | --- | --- |
(continuesonnextpage)
| 62  |     |     |     |     |     |     | Chapter2. | ProgrammingGPUsinCUDA |
| --- | --- | --- | --- | --- | --- | --- | --------- | --------------------- |

CUDAProgrammingGuide,Release13.1
(continuedfrompreviouspage)
| ∕∕  | if    | so enqueue            | a   | non-blocking | copy into      | stream2 |
| --- | ----- | --------------------- | --- | ------------ | -------------- | ------- |
| if  | ( not | copyStarted           |     | ) {          |                |         |
|     | if(   | cudaEventQuery(event) |     |              | == cudaSuccess | ) {     |
cudaMemcpyAsync(h_data, d_data, size, cudaMemcpyDeviceToHost,
,→stream2);
|     |     | copyStarted |     | = true; |     |     |
| --- | --- | ----------- | --- | ------- | --- | --- |
}
}
}
| ∕∕ wait | for | both streams |     | to be | done |     |
| ------- | --- | ------------ | --- | ----- | ---- | --- |
cudaStreamSynchronize(stream1);
cudaStreamSynchronize(stream2);
| ∕∕ destroy |     | the event |     |     |     |     |
| ---------- | --- | --------- | --- | --- | --- | --- |
cudaEventDestroy(event);
| ∕∕ destroy |     | the streams |     | and free | the data |     |
| ---------- | --- | ----------- | --- | -------- | -------- | --- |
cudaStreamDestroy(stream1);
cudaStreamDestroy(stream2);
cudaFree(d_data);
free(h_data);
| 2.3.4. | Callback |     | Functions |     | from | Streams |
| ------ | -------- | --- | --------- | --- | ---- | ------- |
CUDAprovidesamechanismforlaunchingfunctionsonthehostfromwithinastream. Therearecur-
rentlytwofunctionsavailableforthispurpose: cudaLaunchHostFunc()andcudaAddCallback().
However, cudaAddCallback() is slated for deprecation, so applications should use cudaLaunch-
HostFunc().
UsingcudaLaunchHostFunc()
ThesignatureofthecudaLaunchHostFunc()functionisasfollows:
cudaError_t cudaLaunchHostFunc(cudaStream_t stream, void (*func)(void *),
| ,→void | *data); |     |     |     |     |     |
| ------ | ------- | --- | --- | --- | --- | --- |
where
▶
| stream: |                              | Thestreamtolaunchthecallbackfunctioninto. |     |     |     |     |
| ------- | ---------------------------- | ----------------------------------------- | --- | --- | --- | --- |
| ▶ func: | Thecallbackfunctiontolaunch. |                                           |     |     |     |     |
▶
| data: | Apointertothedatatopasstothecallbackfunction. |     |     |     |     |     |
| ----- | --------------------------------------------- | --- | --- | --- | --- | --- |
ThehostfunctionitselfisasimpleCfunctionwiththesignature:
| void | hostFunction(void |     |     | *data); |     |     |
| ---- | ----------------- | --- | --- | ------- | --- | --- |
withthedataparameterpointingtoauserdefineddatastructurewhichthefunctioncaninterpret.
Therearesomecaveatstokeepinmindwhenusingcallbackfunctionslikethis. Inparticular,thehost
functionmaynotcallanyCUDAAPIs.
Forthepurposesofbeingusedwithunifiedmemory,thefollowingexecutionguaranteesareprovided:
- The stream is considered idle for the duration of the function’s execution. Thus, for example, the
2.3. AsynchronousExecution 63

CUDAProgrammingGuide,Release13.1
functionmayalwaysusememoryattachedtothestreamitwasenqueuedin. -Thestartofexecutionof
thefunctionhasthesameeffectassynchronizinganeventrecordedinthesamestreamimmediately
prior to the function. It thus synchronizes streams which have been “joined” prior to the function.
- Adding device work to any stream does not have the effect of making the stream active until all
preceding host functions and stream callbacks have executed. Thus, for example, a function might
use global attached memory even if work has been added to another stream, if the work has been
orderedbehindthefunctioncallwithanevent. -Completionofthefunctiondoesnotcauseastream
tobecomeactiveexceptasdescribedabove. Thestreamwillremainidleifnodeviceworkfollowsthe
function, and will remain idle across consecutive host functions or stream callbacks without device
work in between. Thus, for example, stream synchronization can be done by signaling from a host
functionattheendofthestream.
2.3.4.1 UsingcudaStreamAddCallback()
Note
ThecudaStreamAddCallback()functionisslatedfordeprecationandremovalandisdiscussed
hereforcompletenessandbecauseitmaystillappearinexistingcode. Applicationsshoulduseor
switchtousingcudaLaunchHostFunc().
ThesignatureofthecudaStreamAddCallback()functionisasfollows:
cudaError_t cudaStreamAddCallback(cudaStream_t stream, cudaStreamCallback_t
,→callback, void* userData, unsigned int flags);
where
▶ stream: Thestreamtolaunchthecallbackfunctioninto.
▶ callback: Thecallbackfunctiontolaunch.
▶ userData: Apointertothedatatopasstothecallbackfunction.
▶ flags: Currently,thisparametermustbe0forfuturecompatibility.
The signature of the callback function is a little different from the case when we used the cud-
aLaunchHostFunc()function. InthiscasethecallbackfunctionisaCfunctionwiththesignature:
void callbackFunction(cudaStream_t stream, cudaError_t status, void
,→*userData);
wherethefunctionisnowpassed
▶ stream: Thestreamhandlefromwhichthecallbackfunctionwaslaunched.
▶ status: Thestatusofthestreamoperationthattriggeredthecallback.
▶ userData: Apointertothedatathatwaspassedtothecallbackfunction.
Inparticularthestatusparameterwillcontainthecurrenterrorstatusofthestream,whichmayhave
beensetbypreviousoperations. SimilarlytothecudaLaunchHostFunc()funccase,thestreamwill
notbeactiveandadvancetotasksuntilthehost-functionhascompleted,andnoCUDAfunctionsmay
becalledfromwithinthecallbackfunction.
64 Chapter2. ProgrammingGPUsinCUDA

CUDAProgrammingGuide,Release13.1
2.3.4.2 AsynchronousErrorHandling
Inacudastream,errorsmayoriginatefromanyoperationinthestream,includingforkernellaunches
and memory transfers. These errors may not be propagated back to the user at run-time until the
streamissynchronized,forexample,bywaitingforaneventorcallingcudaStreamSynchronize().
Therearetwowaystofindoutabouterrorswhichmayhaveoccurredinastream.
▶ Using the function - this function returns and clears the last error en-
cudaGetLastError()
counteredinanystreaminthecurrentcontext. AnimmediatesecondcalltocudaGetLastError()
wouldreturncudaSuccessifnoothererroroccurredbetweenthetwocalls.
▶
UsingthefunctioncudaPeekAtLastError()-thisfunctionreturnsthelasterrorinthecurrent
context,butdoesnotclearit.
BothofthesefunctionsreturntheerrorasavalueoftypecudaError_t. Printablenamesnamesof
theerrorscanbegeneratedusingthefunctionscudaGetErrorName()andcudaGetErrorString().
Anexampleofusingthesefunctionsisshownbelow:
|     |     | Listing |     | 1: Example |     | of using | cudaGetLastError() | and cud- |
| --- | --- | ------- | --- | ---------- | --- | -------- | ------------------ | -------- |
aPeekAtLastError()
| ∕∕ Some | work | occurs | in  | streams. |     |     |     |     |
| ------- | ---- | ------ | --- | -------- | --- | --- | --- | --- |
cudaStreamSynchronize(stream);
| ∕∕ Look       | at the          | last  | error                  | but   | do     | not clear                 | it  |     |
| ------------- | --------------- | ----- | ---------------------- | ----- | ------ | ------------------------- | --- | --- |
| cudaError_t   |                 | err = | cudaPeekAtLastError(); |       |        |                           |     |     |
| if (err       | != cudaSuccess) |       |                        | {     |        |                           |     |     |
| printf("Error |                 |       | with                   | name: | %s\n", | cudaGetErrorName(err));   |     |     |
| printf("Error |                 |       | description:           |       | %s\n", | cudaGetErrorString(err)); |     |     |
}
| ∕∕ Look       | at the | last         | error                 | and   | clear  | it                         |     |     |
| ------------- | ------ | ------------ | --------------------- | ----- | ------ | -------------------------- | --- | --- |
| cudaError_t   |        | err2         | = cudaGetLastError(); |       |        |                            |     |     |
| if (err2      | !=     | cudaSuccess) |                       | {     |        |                            |     |     |
| printf("Error |        |              | with                  | name: | %s\n", | cudaGetErrorName(err2));   |     |     |
| printf("Error |        |              | description:          |       | %s\n", | cudaGetErrorString(err2)); |     |     |
}
| if (err2 | !=  | err) | {   |     |     |     |     |     |
| -------- | --- | ---- | --- | --- | --- | --- | --- | --- |
printf("As expected, cudaPeekAtLastError() did not clear the error\n");
}
| ∕∕ Check    | again |              |                       |     |     |     |     |     |
| ----------- | ----- | ------------ | --------------------- | --- | --- | --- | --- | --- |
| cudaError_t |       | err3         | = cudaGetLastError(); |     |     |     |     |     |
| if (err3    | ==    | cudaSuccess) |                       | {   |     |     |     |     |
printf("As expected, cudaGetLastError() cleared the error\n");
}
Tip
Whenanerrorappearsatasynchronization,especiallyinastreamwithmanyoperations,itisoften
difficulttopinpointexactlywhereinthestreamtheerrormayhaveoccurred. Todebugsuchasit-
uationausefultrickmaybetosettheenvironmentvariableCUDA_LAUNCH_BLOCKING=1andthen
runtheapplication. Theeffectofthisenvironmentvariableistosynchronizeaftereverysingleker-
2.3. AsynchronousExecution 65

CUDAProgrammingGuide,Release13.1
nellaunch. Thiscanaidintrackingdownwhichkernel,ortransfercausedtheerror. Synchronization
canbeexpensive;applicationsmayrunsubstantiallyslowerwhenthisenvironmentvariableisset.
2.3.5. CUDA Stream Ordering
Nowthatwehavediscussedthebasicmechanismsofstreams,eventsandcallbackfunctionsitisim-
portanttoconsidertheorderingsemanticsofasynchronousoperationsinastream. Thesesemantics
aretoallowapplicationprogrammerstothinkabouttheorderingofoperationsinastreaminasafe
way. Therearesomespecialcaseswherethesesemanticsmayberelaxedforpurposesofperformance
optimization such as in the case of a ProgrammaticDependentKernelLaunch scenario, which allows
the overlap of two kernels through the use of special attributes and kernel launch mechanisms, or
in the case of batching memory transfers using the cudaMemcpyBatchAsync() function when the
runtimecanperformnon-overlappingbatchcopiesconcurrently. Wewilldiscusstheseoptimizations
lateronlinkneeded.
MostimportantlyCUDAstreamsarewhatareknownasin-orderstreams. Thismeansthattheorder
of execution of the operations in a stream is the same as the order in which those operations were
enqueued. Anoperationinastreamcannotleap-frogotheroperations. Memoryoperations(suchas
copies)aretrackedbytheruntimeandwillalwayscompletebeforethenextoperationinordertoallow
dependentkernelssafeaccesstothedatabeingtransferred.
2.3.6. Blocking and non-blocking streams and the default
stream
In CUDA there are two types of streams: blocking and non-blocking. The name can be a little mis-
leading as the blocking and non-blocking semantics refer only to how the streams synchronize with
thedefaultstream. Bydefault,streamscreatedwithcudaStreamCreate()areblockingstreams. In
ordertocreateanon-blockingstream,thecudaStreamCreateWithFlags()functionmustbeused
withthecudaStreamNonBlockingflag:
cudaStream_t stream;
cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking);
andnon-blockingstreamscanbedestroyedintheusualwaywithcudaStreamDestroy().
2.3.6.1 LegacyDefaultStream
Thekeydifferencebetweentheblockingandnon-blockingstreamsishowtheysynchronizewiththe
defaultstream. CUDAprovidesalegacydefaultstream(alsoknownastheNULLstreamorthestream
withstreamID0)whichisusedwhennostreamisspecifiedinkernellaunchesorinblockingcudaMem-
cpy() calls. This default stream, which was shared amongst all host threads, is a blocking stream.
When an operation is launched into this default stream, it will synchronize with all other blocking
streams,inotherwordsitwillwaitforallotherblockingstreamstocompletebeforeitcanexecute.
cudaStream_t stream1, stream2;
cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);
kernel1<<<grid, block, 0, stream1>>>(...);
kernel2<<<grid, block>>>(...);
(continuesonnextpage)
66 Chapter2. ProgrammingGPUsinCUDA

CUDAProgrammingGuide,Release13.1
(continuedfrompreviouspage)
kernel3<<<grid, block, 0, stream2>>>(...);
cudaDeviceSynchronize();
Thedefaultstreambehaviormeansthatintheabovecodesnippetabove,kernel2willwaitforkernel1
to complete, and kernel3 will wait for kernel2 to complete, even if in principle all three kernels could
executeconcurrently. Bycreatinganon-blockingstreamwecanavoidthissynchronizationbehavior.
Inthecodesnippetbelowwecreatetwonon-blockingstreams. Thedefaultstreamwillnolongersyn-
chronizewiththesestreamsandinprincipleallthreekernelscouldexecuteconcurrently. Assuchwe
cannotassumeanyorderingofexecutionofthekernelsandshouldperformexplicitsynchronization
(suchaswiththeratherheavyhandedcudaDeviceSynchronize()call)inordertoensurethatthe
kernelshavecompleted.
cudaStream_t stream1, stream2;
cudaStreamCreateWithFlags(&stream1, cudaStreamNonBlocking);
cudaStreamCreateWithFlags(&stream2, cudaStreamNonBlocking);
kernel1<<<grid, block, 0, stream1>>>(...);
kernel2<<<grid, block>>>(...);
kernel3<<<grid, block, 0, stream2>>>(...);
cudaDeviceSynchronize();
2.3.6.2 Per-threadDefaultStream
Starting in CUDA-7, CUDA allows for each host thread to have its own independent default
stream, rather than the shared legacy default stream. In order to enable this behavior one
must either use the nvcc compiler option --default-stream per-thread or define the
CUDA_API_PER_THREAD_DEFAULT_STREAMpreprocessormacro. Whenthisbehaviorisenabled,each
hostthreadwillhaveitsownindependentdefaultstreamwhichwillnotsynchronizewithotherstreams
inthesamewaythelegacydefaultstreamdoes. Insuchasituationthelegacydefaultstreamexample
willnowexhibitthesamesynchronizationbehaviorasthenon-blockingstreamexample.
2.3.7. Explicit Synchronization
Therearevariouswaystoexplicitlysynchronizestreamswitheachother.
cudaDeviceSynchronize() waits until all preceding commands in all streams of all host threads
havecompleted.
cudaStreamSynchronize()takesastreamasaparameterandwaitsuntilallprecedingcommands
in the given stream have completed. It can be used to synchronize the host with a specific stream,
allowingotherstreamstocontinueexecutingonthedevice.
cudaStreamWaitEvent()takes a stream and an event as parameters (see CUDA Events for a de-
scription of events)and makes all the commands added to the given stream after the call to cud-
aStreamWaitEvent()delaytheirexecutionuntilthegiveneventhascompleted.
cudaStreamQuery()providesapplicationswithawaytoknowifallprecedingcommandsinastream
havecompleted.
2.3. AsynchronousExecution 67

CUDAProgrammingGuide,Release13.1
| 2.3.8. | Implicit | Synchronization |     |     |     |     |
| ------ | -------- | --------------- | --- | --- | --- | --- |
Two operations from different streams cannot run concurrently if any CUDA operation on the NULL
streamissubmittedin-betweenthem,unlessthestreamsarenon-blockingstreams(createdwiththe
cudaStreamNonBlockingflag).
Applicationsshouldfollowtheseguidelinestoimprovetheirpotentialforconcurrentkernelexecution:
▶ Allindependentoperationsshouldbeissuedbeforedependentoperations,
▶ Synchronizationofanykindshouldbedelayedaslongaspossible.
| 2.3.9. | Miscellaneous |     | and | Advanced | topics |     |
| ------ | ------------- | --- | --- | -------- | ------ | --- |
2.3.9.1 StreamPrioritization
Asmentionedpreviously,developerscanassignprioritiestoCUDAstreams. Prioritizedstreamsneed
tobecreatedusingthecudaStreamCreateWithPriority()function. Thefunctiontakestwopa-
rameters: thestreamhandleandtheprioritylevel. Thegeneralschemeisthatlowernumberscorre-
spondtohigherpriorities. Thegivenpriorityrangeforagivendeviceandcontextcanbequeriedusing
thecudaDeviceGetStreamPriorityRange()function. Thedefaultpriorityofastreamis0.
| int minPriority, |              | maxPriority; |         |        |     |     |
| ---------------- | ------------ | ------------ | ------- | ------ | --- | --- |
| ∕∕ Query         | the priority | range        | for the | device |     |     |
cudaDeviceGetStreamPriorityRange(&minPriority, &maxPriority);
| ∕∕ Create | two streams | with | different | priorities |     |     |
| --------- | ----------- | ---- | --------- | ---------- | --- | --- |
∕∕ cudaStreamDefault indicates the stream should be created with default flags
∕∕ in other words they will be blocking streams with respect to the legacy
| ,→default | stream |     |     |     |     |     |
| --------- | ------ | --- | --- | --- | --- | --- |
∕∕ One could also use the option `cudaStreamNonBlocking` here to create a non-
| ,→blocking   | streams  |          |     |     |     |     |
| ------------ | -------- | -------- | --- | --- | --- | --- |
| cudaStream_t | stream1, | stream2; |     |     |     |     |
cudaStreamCreateWithPriority(&stream1, cudaStreamDefault, minPriority); ∕∕
| ,→Lowest | priority |     |     |     |     |     |
| -------- | -------- | --- | --- | --- | --- | --- |
cudaStreamCreateWithPriority(&stream2, cudaStreamDefault, maxPriority); ∕∕
| ,→Highest | priority |     |     |     |     |     |
| --------- | -------- | --- | --- | --- | --- | --- |
Weshouldnotethatapriorityofastreamisonlyahinttotheruntimeandgenerallyappliesprimarily
tokernellaunches,andmaynotberespectedformemorytransfers. Streamprioritieswillnotpreempt
alreadyexecutingwork,orguaranteeanyspecificexecutionorder.
2.3.9.2 IntroductiontoCUDAGraphswithStreamCapture
CUDAstreamsallowprogramstospecifyasequenceofoperations,kernelsormemorycopies,inorder.
Usingmultiplestreamsandcross-streamdependencieswithcudaStreamWaitEvent,anapplication
canspecifyafulldirectedacyclicgraph(DAG)ofoperations. Someapplicationsmayhaveasequence
orDAGofoperationsthatneedstoberunmanytimesthroughoutexecution.
For this situation, CUDA provides a feature known as CUDA graphs. This section introduces CUDA
graphs and one mechanism of creating them called stream capture. A more detailed discussion of
CUDAgraphsispresentedinCUDAGraphs. Capturingorcreatingagraphcanhelpreducelatencyand
CPU overhead of repeatedly invoking the same chain of API calls from the host thread. Instead, the
| 68  |     |     |     |     | Chapter2. | ProgrammingGPUsinCUDA |
| --- | --- | --- | --- | --- | --------- | --------------------- |

CUDAProgrammingGuide,Release13.1
APIstospecifythegraphoperationscanbecalledonce,andthentheresultinggraphexecutedmany
times.
CUDAGraphsworkinthefollowingway:
i) Thegraphiscapturedbytheapplication. Thisstepisdoneoncethefirsttimethegraphisexe-
cuted. ThegraphcanalsobemanuallycomposedusingtheCUDAgraphAPI.
ii) Thegraphisinstantiated. Thisstepisdoneonetime,afterthegraphiscaptured. Thisstepcan
setupallthevariousruntimestructuresneededtoexecutethegraph,inordertomakelaunching
itscomponentsasfastaspossible.
iii) Intheremainingsteps,thepre-instantiatedgraphisexecutedasmanytimesasrequired. Since
alltheruntimestructuresneededtoexecutethegraphoperationsarealreadyinplace,theCPU
overheadsofthegraphexecutionareminimized.
|     |     | Listing2: | Thestagesofcapturing,instantiatingandexecuting |       |       |             |       |      |        |     |
| --- | --- | --------- | ---------------------------------------------- | ----- | ----- | ----------- | ----- | ---- | ------ | --- |
|     |     | a simple  | linear                                         | graph | using | CUDA Graphs | (from | CUDA | Devel- |     |
operTechnicalBlog,A.Gray,2019)
#define N 500000 ∕∕ tuned such that kernel takes a few microseconds
| ∕∕ A       | very                                   | lightweight                | kernel |     |          |       |     |        |     |     |
| ---------- | -------------------------------------- | -------------------------- | ------ | --- | -------- | ----- | --- | ------ | --- | --- |
| __global__ |                                        | void shortKernel(float     |        |     | * out_d, | float | *   | in_d){ |     |     |
| int        | idx=blockIdx.x*blockDim.x+threadIdx.x; |                            |        |     |          |       |     |        |     |     |
| if(idx<N)  |                                        | out_d[idx]=1.23*in_d[idx]; |        |     |          |       |     |        |     |     |
}
bool graphCreated=false;
| cudaGraph_t     |          | graph;       |          |           |       |     |     |     |     |     |
| --------------- | -------- | ------------ | -------- | --------- | ----- | --- | --- | --- | --- | --- |
| cudaGraphExec_t |          | instance;    |          |           |       |     |     |     |     |     |
| ∕∕ The          | graph    | will be      | executed | NSTEP     | times |     |     |     |     |     |
| for(int         | istep=0; | istep<NSTEP; |          | istep++){ |       |     |     |     |     |     |
if(!graphCreated){
|     | ∕∕                             | Capture               | the graph |                |          |                               |                  |     |     |        |
| --- | ------------------------------ | --------------------- | --------- | -------------- | -------- | ----------------------------- | ---------------- | --- | --- | ------ |
|     | cudaStreamBeginCapture(stream, |                       |           |                |          | cudaStreamCaptureModeGlobal); |                  |     |     |        |
|     | ∕∕                             | Launch NKERNEL        |           | kernels        |          |                               |                  |     |     |        |
|     | for(int                        | ikrnl=0;              |           | ikrnl<NKERNEL; |          | ikrnl++){                     |                  |     |     |        |
|     |                                | shortKernel<<<blocks, |           |                | threads, | 0,                            | stream>>>(out_d, |     |     | in_d); |
}
|     | ∕∕                              | End the     | capture |       |     |          |       |       |     |     |
| --- | ------------------------------- | ----------- | ------- | ----- | --- | -------- | ----- | ----- | --- | --- |
|     | cudaStreamEndCapture(stream,    |             |         |       |     | &graph); |       |       |     |     |
|     | ∕∕                              | Instantiate | the     | graph |     |          |       |       |     |     |
|     | cudaGraphInstantiate(&instance, |             |         |       |     | graph,   | NULL, | NULL, | 0); |     |
graphCreated=true;
}
| ∕∕                        | Launch      | the graph |            |          |     |     |     |     |     |     |
| ------------------------- | ----------- | --------- | ---------- | -------- | --- | --- | --- | --- | --- | --- |
| cudaGraphLaunch(instance, |             |           |            | stream); |     |     |     |     |     |     |
| ∕∕                        | Synchronize |           | the stream |          |     |     |     |     |     |     |
(continuesonnextpage)
2.3. AsynchronousExecution 69

CUDAProgrammingGuide,Release13.1
(continuedfrompreviouspage)
cudaStreamSynchronize(stream);
}
MuchmoredetailonCUDAgraphisprovidedinCUDAGraphs.
2.3.10. Summary of Asynchronous Execution
Thekeypointsofthissectionare:
▶ Asynchronous APIs allow us to express concurrent execution of tasks providing the way to ex-
pressoverlappingofvariousoperations. Theactualconcurrencyachievedisdependentonavail-
ablehardwareresourcesandcompute-capabilities.
▶ ThekeyabstractionsinCUDAforasynchronousexecutionarestreams,eventsandcallbackfunc-
tions.
▶ Synchronizationispossibleattheevent,streamanddevicelevel
▶ Thedefaultstreamisablockingstreamwhichsynchronizeswithallotherblockingstreams,but
doesnotsynchronizewithnon-blockingstreams
▶ The default stream behavior can be avoided using per-thread default
streams via the --default-stream per-thread compiler option or the
CUDA_API_PER_THREAD_DEFAULT_STREAMpreprocessormacro.
▶ Streamscanbecreatedwithdifferentpriorities,whicharehintstotheruntimeandmaynotbe
respectedformemorytransfers.
▶ CUDA provides API functions to reduce, or overlap overheads of kernel launches and memory
transferssuchasCUDAGraphs,BatchedMemoryTransfersandProgrammaticDependentKernel
Launch.
2.4. Unified and System Memory
Heterogeneoussystemshavemultiplephysicalmemorieswheredatacanbestored. ThehostCPUhas
attached DRAM, and every GPU in a system has its own attached DRAM. Performance is best when
dataisresidentinthememoryoftheprocessoraccessingit. CUDAprovidesAPIstoexplicitlymanage
memoryplacement,butthiscanbeverboseandcomplicatesoftwaredesign. CUDAprovidesfeatures
andcapabilitiesaimedateasingallocation,placement,andmigrationofdatabetweendifferentphys-
icalmemories.
The purpose of this chapter is to introduce and explain these features and what they mean to ap-
plication developers for both functionality and performance. Unified memory has several different
manifestationswhichdependupontheOS,driverversion,andGPUused. Thischapterwillshowhow
todeterminewhichunifiedmemoryparadigmappliesandhowthefeaturesofunifiedmemorybehave
ineach. Thelaterchapteronunifiedmemoryexplainsunifiedmemoryinmoredetail.
Thefollowingconceptswillbedefinedandexplainedinthischapter:
▶ UnifiedVirtualAddressSpace-CPUmemoryandeachGPU’smemoryhaveadistinctrangewithin
asinglevirtualaddressspace
▶ Unified Memory - A CUDA feature that enables managed memory which can be automatically
migratedbetweenCPUandGPUs
▶ LimitedUnifiedMemory-Aunifiedmemoryparadigmwithsomelimitations
70 Chapter2. ProgrammingGPUsinCUDA

CUDAProgrammingGuide,Release13.1
▶ FullUnifiedMemory-Fullsupportforunifiedmemoryfeatures
▶ FullUnifiedMemorywithHardwareCoherency -Fullsupportforunifiedmemoryusinghard-
warecapabilities
▶ Unifiedmemoryhints-APIstoguideunifiedmemorybehaviorforspecificallocations
▶ Page-lockedHostMemory - Non-pageable system memory, which is necessary for some CUDA
operations
▶ Mappedmemory-Amechanism(differentfromunifiedmemory)foraccessinghostmemory
directlyfromakernel
Additionally, the following terms used when discussing unified and system memory are introduced
here:
▶ HeterogeneousManagedMemory (HMM) - A feature of the Linux kernel that enables software
coherencyforfullunifiedmemory
▶ AddressTranslationServices (ATS) - A hardware feature, available when GPUs are connected to
theCPUbytheNVLinkChip-to-Chip(C2C)interconnect,whichprovideshardwarecoherencyfor
fullunifiedmemory
2.4.1. Unified Virtual Address Space
A single virtual address space is used for all host memory and all global memory on all GPUs in the
system within a single OS process. All memory allocations on the host and on all devices lie in this
virtual address space. This is true whether allocations are made with CUDA APIs (e.g. cudaMalloc,
cudaMallocHost) or with system allocation APIs (e.g. new, malloc, mmap). The CPU and each GPU
hasauniquerangewithintheunifiedvirtualaddressspace.
Thismeans:
▶ The location of any memory (that is, CPU or which GPU’s memory it lies in) can be determined
fromthevalueofapointerusingcudaPointerGetAttributes()
▶ ThecudaMemcpyKindparameterofcudaMemcpy*()canbesettocudaMemcpyDefaulttoau-
tomaticallydeterminethecopytypefromthepointers
2.4.2. Unified Memory
Unifiedmemory isaCUDAmemoryfeaturewhichallowsmemoryallocationscalledmanagedmemory
to be accessed from code running on either the CPU or the GPU. Unified memory was shown in the
introtoCUDAinC++. UnifiedmemoryisavailableonallsystemssupportedbyCUDA.
Onsomesystems,managedmemorymustbeexplicitlyallocated. Managedmemorycanbeexplicitly
allocatedinCUDAinafewdifferentways:
▶ TheCUDAAPIcudaMallocManaged
▶ The CUDA API cudaMallocFromPoolAsync with a pool created with allocType set to cud-
aMemAllocationTypeManaged
▶ Globalvariableswiththe__managed__specifier(seeMemorySpaceSpecifiers)
OnsystemswithHMMorATS,allsystemmemoryisimplicitlymanagedmemory,regardlessofhowit
isallocated. Nospecialallocationisneeded.
2.4. UnifiedandSystemMemory 71

CUDAProgrammingGuide,Release13.1
2.4.2.1 UnifiedMemoryParadigms
The features and behavior of unified memory vary between operating systems, kernel versions on
Linux, GPU hardware, and the GPU-CPU interconnect. The form of unified memory available can be
determinedbyusingcudaDeviceGetAttributetoqueryafewattributes:
▶ cudaDevAttrConcurrentManagedAccess-1forfullunifiedmemorysupport,0forlimitedsup-
port
▶ cudaDevAttrPageableMemoryAccess-1meansallsystemmemoryisfully-supportedunified
memory,0meansonlymemoryexplicitlyallocatedasmanagedmemoryisfully-supportedunified
memory
▶ cudaDevAttrPageableMemoryAccessUsesHostPageTables - Indicates the mechanism of
CPU/GPUcoherence: 1ishardware,0issoftware.
Figure18illustrateshowtodeterminetheunifiedmemoryparadigmvisuallyandisfollowedbyacode
sampleimplementingthesamelogic.
Therearefourparadigmsofunifiedmemoryoperation:
▶ Fullsupportforexplicitmanagedmemoryallocations
▶ Fullsupportforallallocationswithsoftwarecoherence
▶ Fullsupportforallallocationswithhardwarecoherence
▶ Limitedunifiedmemorysupport
Whenfullsupportisavailable,itcaneitherrequireexplicitallocations,orallsystemmemorymayim-
plicitlybeunifiedmemory. Whenallmemoryisimplicitlyunified,thecoherencemechanismcaneither
besoftwareorhardware. WindowsandsomeTegradeviceshavelimitedsupportforunifiedmemory.
Figure 18: All current GPUs use a unified virtual address space and have unified memory available.
When cudaDevAttrConcurrentManagedAccess is 1, full unified memory support is available, oth-
erwiseonlylimitedsupportisavailable. Whenfullsupportisavailable,ifcudaDevAttrPageableMem-
oryAccess is also 1, then all system memory is unified memory. Otherwise, only memory allocated
with CUDA APIs (such as cudaMallocManaged) is unified memory. When all system memory is uni-
fied, cudaDevAttrPageableMemoryAccessUsesHostPageTables indicates whether coherence is
providedbyhardware(whenvalueis1)orsoftware(whenvalueis0).
Table3showsthesameinformationasFigure18asatablewithlinkstotherelevantsectionsofthis
chapterandmorecompletedocumentationinalatersectionofthisguide.
72 Chapter2. ProgrammingGPUsinCUDA

CUDAProgrammingGuide,Release13.1
Table3: OverviewofUnifiedMemoryParadigms
UnifiedMemoryParadigm DeviceAttributes FullDocumentation
Limitedunifiedmemorysupport
UnifiedMemoryonWindows,
cudaDevAttrConcurrentManagedAccess
WSL,andTegra
is0
CUDAforTegraMemory
Management
UnifiedmemoryonTegra
Fullsupportforexplicitmanaged
memoryallocations
UnifiedMemoryonDeviceswith
cudaDevAttrPageableMemoryAccess
onlyCUDAManagedMemory
is0
Support
andcudaDevAttrConcur-
rentManagedAccessis1
Full support for all allocations
withsoftwarecoherence
UnifiedMemoryonDeviceswith
cudaDevAttrPageableMemoryAccessUsesHostPageTables
FullCUDAUnifiedMemory
is0
Support
andcudaDevAttrPageable-
MemoryAccessis1
andcudaDevAttrConcur-
rentManagedAccessis1
Full support for all allocations
withhardwarecoherence
UnifiedMemoryonDeviceswith
cudaDevAttrPageableMemoryAccessUsesHostPageTables
FullCUDAUnifiedMemory
is1
Support
andcudaDevAttrPageable-
MemoryAccessis1
andcudaDevAttrConcur-
rentManagedAccessis1
2.4.2.1.1 UnifiedMemoryParadigm: CodeExample
Thefollowingcodeexampledemonstratesqueryingthedeviceattributesanddeterminingtheunified
memoryparadigm,followingthelogicofFigure18,foreachGPUinasystem.
void queryDevices()
{
int numDevices = 0;
cudaGetDeviceCount(&numDevices);
for(int i=0; i<numDevices; i++)
{
cudaSetDevice(i);
(continuesonnextpage)
2.4. UnifiedandSystemMemory 73

CUDAProgrammingGuide,Release13.1
(continuedfrompreviouspage)
|                                       | cudaInitDevice(0,                          |     | 0,   | 0);                         |     |     |     |
| ------------------------------------- | ------------------------------------------ | --- | ---- | --------------------------- | --- | --- | --- |
|                                       | int deviceId                               |     | = i; |                             |     |     |     |
|                                       | int concurrentManagedAccess                |     |      | = -1;                       |     |     |     |
|                                       | cudaDeviceGetAttribute                     |     |      | (&concurrentManagedAccess, |     |     |     |
| ,→cudaDevAttrConcurrentManagedAccess, |                                            |     |      | deviceId);                  |     |     |     |
|                                       | int pageableMemoryAccess                   |     |      | = -1;                       |     |     |     |
|                                       | cudaDeviceGetAttribute                     |     |      | (&pageableMemoryAccess,    |     |     |     |
| ,→cudaDevAttrPageableMemoryAccess,    |                                            |     |      | deviceId);                  |     |     |     |
|                                       | int pageableMemoryAccessUsesHostPageTables |     |      |                             | =   | -1; |     |
cudaDeviceGetAttribute (&pageableMemoryAccessUsesHostPageTables,
,→cudaDevAttrPageableMemoryAccessUsesHostPageTables, deviceId);
|     | printf("Device |     | %d has | ", deviceId); |     |     |     |
| --- | -------------- | --- | ------ | ------------- | --- | --- | --- |
if(concurrentManagedAccess){
if(pageableMemoryAccess){
|     |     | printf("full |     | unified memory | support"); |     |     |
| --- | --- | ------------ | --- | -------------- | ---------- | --- | --- |
if( pageableMemoryAccessUsesHostPageTables)
|     |     |     | { printf(" | with hardware | coherency\n"); | }   |     |
| --- | --- | --- | ---------- | ------------- | -------------- | --- | --- |
else
|     |     |     | { printf(" | with software | coherency\n"); | }   |     |
| --- | --- | --- | ---------- | ------------- | -------------- | --- | --- |
}
else
|                    |     | { printf("full |     | unified memory | support | for CUDA-made | managed |
| ------------------ | --- | -------------- | --- | -------------- | ------- | ------------- | -------- |
| ,→allocations\n"); |     | }              |     |                |         |               |          |
}
else
{ printf("limited unified memory support: Windows, WSL, or Tegra\n
| ,→"); | }   |     |     |     |     |     |     |
| ----- | --- | --- | --- | --- | --- | --- | --- |
}
}
2.4.2.2 FullUnifiedMemoryFeatureSupport
Most Linux systems have full unified memory support. If device attribute cudaDevAttrPageable-
MemoryAccessis1,thenallsystemmemory,whetherallocatedbyCUDAAPIsorsystemAPIs,oper-
atesasunifiedmemorywithfullfeaturesupport. Thisincludesfile-backedmemoryallocationscreated
withmmap.
If cudaDevAttrPageableMemoryAccess is 0, then only memory allocated as managed memory by
CUDA behaves as unified memory. Memory allocated with system APIs is not managed and is not
necessarilyaccessiblefromGPUkernels.
Ingeneral,forunifiedallocationswithfullsupport:
▶ Managed memory is usually allocated in the memory space of the processor where it is first
touched
▶ Managed memory is usually migrated when it is used by a processor other than the processor
whereitcurrentlyresides
▶ Managedmemoryismigratedoraccessedatthegranularityofmemorypages(softwarecoher-
ence)orcachelines(hardwarecoherence)
| 74  |     |     |     |     | Chapter2. | ProgrammingGPUsinCUDA |     |
| --- | --- | --- | --- | --- | --------- | --------------------- | --- |

CUDAProgrammingGuide,Release13.1
▶ Oversubscription is allowed: an application may allocate more managed memory than is physi-
callyavailableontheGPU
Allocationandmigrationbehaviorcandeviatefromtheabove. Thiscanbyinfluencedtheprogrammer
usinghintsandprefetches. FullcoverageoffullunifiedmemorysupportcanbefoundinUnifiedMemory
onDeviceswithFullCUDAUnifiedMemorySupport.
2.4.2.2.1 FullUnifiedMemorywithHardwareCoherency
OnhardwaresuchasGraceHopperandGraceBlackwell,whereanNVIDIACPUisusedandtheinter-
connectbetweentheCPUandGPUisNVLinkChip-to-Chip(C2C),addresstranslationservices(ATS)are
available. cudaDevAttrPageableMemoryAccessUsesHostPageTablesis1whenATSisavailable.
WithATS,inadditiontofullunifiedmemorysupportforallhostallocations:
▶ GPU allocations (e.g. cudaMalloc) can be accessed from the CPU
(cudaDevAttrDirectManagedMemAccessFromHostwillbe1)
▶ ThelinkbetweenCPUandGPUsupportsnativeatomics(cudaDevAttrHostNativeAtomicSupported
willbe1)
▶ Hardwaresupportforcoherencecanimproveperformancecomparedtosoftwarecoherence
ATSprovidesallcapabilitiesofHMM.WhenATSisavailable,HMMisautomaticallydisabled. Furtherdis-
cussionofhardwarevs. softwarecoherencyisfoundinCPUandGPUPageTables:HardwareCoherency
vs. SoftwareCoherency.
2.4.2.2.2 HMM-FullUnifiedMemorywithSoftwareCoherency
HeterogeneousMemoryManagement (HMM) is a feature available on Linux operating systems (with
appropriate kernel versions) which enables software-coherent full unified memory support. Hetero-
geneous memory management brings some of the capabilities and convenience provided by ATS to
PCIe-connectedGPUs.
OnLinuxwithatleastLinuxKernel6.1.24,6.2.11,or6.3orlater,heterogeneousmemorymanagement
(HMM)maybeavailable. ThefollowingcommandcanbeusedtofindiftheaddressingmodeisHMM.
$ nvidia-smi -q | grep Addressing
Addressing Mode : HMM
WhenHMMisavailable,fullunifiedmemory issupportedandallsystemallocationsareimplicitlyuni-
fied memory. If a system also has ATS, HMM is disabled and ATS is used, since ATS provides all the
capabilitiesofHMMandmore.
2.4.2.3 LimitedUnifiedMemorySupport
On Windows, including Windows Subsystem for Linux (WSL), and on some Tegra systems, a limited
subsetofunifiedmemoryfunctionalityisavailable. Onthesesystems,managedmemoryisavailable,
butmigrationbetweenCPUandGPUsbehavesdifferently.
▶ ManagedmemoryisfirstallocatedintheCPU’sphysicalmemory
▶ Managedmemoryismigratedinlargergranularitythanvirtualmemorypages
▶ ManagedmemoryismigratedtotheGPUwhentheGPUbeginsexecuting
▶ TheCPUmustnotaccessmanagedmemorywhiletheGPUisactive
▶ ManagedmemoryismigratedbacktotheCPUwhentheGPUissynchronized
2.4. UnifiedandSystemMemory 75

CUDAProgrammingGuide,Release13.1
▶ OversubscriptionofGPUmemoryisnotallowed
▶ OnlymemoryexplicitlyallocatedbyCUDAasmanagedmemoryisunified
FullcoverageofthisparadigmcanbefoundinUnifiedMemoryonWindows,WSL,andTegra.
2.4.2.4 MemoryAdviseandPrefetch
TheprogrammercanprovidehintstotheNVIDIADrivermanagingunifiedmemorytohelpitmaximize
applicationperformance. TheCUDAAPIcudaMemAdviseallowstheprogrammertospecifyproperties
of allocations that affect where they are placed and whether or not the memory is migrated when
accessedfromanotherdevice.
cudaMemPrefetchAsyncallowstheprogrammertosuggestanasynchronousmigrationofaspecific
allocationtoadifferentlocationbestarted. Acommonuseisstartingthetransferofdataakernelwill
usebeforethekernelislaunched. ThisenablesthecopyofdatatooccurwhileotherGPUkernelsare
executing.
The section on PerformanceHints covers the different hints that can be passed to cudaMemAdvise
andshowsexamplesofusingcudaMemPrefetchAsync.
2.4.3. Page-Locked Host Memory
Inintroductorycodeexamples,cudaMallocHostwasusedtoallocatememoryontheCPU.Thisallo-
catespage-lockedmemory(alsoknownaspinnedmemory)onthehost. Hostallocationsmadethrough
traditional allocation mechanisms like malloc, new, or mmap are not page-locked, which means they
maybeswappedtodiskorphysicallyrelocatedbytheoperatingsystem.
Page-lockedhostmemoryisrequiredforasynchronouscopiesbetweentheCPUandGPU.Page-locked
hostmemoryalsoimprovesperformanceofsynchronouscopies. Page-lockedmemorycanbemapped
totheGPUfordirectaccessfromGPUkernels.
TheCUDAruntimeprovidesAPIstoallocatepage-lockedhostmemoryortopage-lockexistingallo-
cations:
▶ cudaMallocHostallocatespage-lockedhostmemory
▶ cudaHostAlloc defaults to the same behavior as cudaMallocHost, but also takes flags to
specifyothermemoryparameters
▶ cudaFreeHostfreesmemoryallocatedwithcudaMallocHostorcudaHostAlloc
▶ cudaHostRegister page-locks a range of existing memory allocated outside the CUDA API,
suchaswithmallocormmap
cudaHostRegisterenableshostmemoryallocatedby3rdpartylibrariesorothercodeoutsideofa
developer’scontroltobepage-lockedsothatitcanbeusedinasynchronouscopiesormapped.
Note
Page-lockedhostmemorycanbeusedforasynchronouscopiesandmapped-memorybyallGPUs
inthesystem.
Page-lockedhostmemoryisnotcachedonnonI/OcoherentTegradevices. Also,cudaHostReg-
ister()isnotsupportedonnonI/OcoherentTegradevices.
76 Chapter2. ProgrammingGPUsinCUDA

CUDAProgrammingGuide,Release13.1
2.4.3.1 MappedMemory
On systems with HMM or ATS, all host memory is directly accessible from the GPU using the host
pointers. WhenATSorHMMarenotavailable,hostallocationscanbemadeaccessibletotheGPUby
mappingthememoryintotheGPU’smemoryspace. Mappedmemoryisalwayspage-locked.
The code examples which follow will illustrate the following array copy kernel operating directly on
mappedhostmemory.
| __global__ | void | copyKernel(float* |     |     | a, float* | b)  |
| ---------- | ---- | ----------------- | --- | --- | --------- | --- |
{
|     | int idx | = threadIdx.x |     | + blockDim.x |     | * blockIdx.x; |
| --- | ------- | ------------- | --- | ------------ | --- | ------------- |
|     | a[idx]  | = b[idx];     |     |              |     |               |
}
While mapped memory may be useful in some cases where certain data which is not copied to the
GPUneedstobeaccessedfromakernel,accessingmappedmemoryinakernelrequirestransactions
across the CPU-GPU interconnect, PCIe, or NVLink C2C. These operations have higher latency and
lowerbandwidthcomparedtoaccessingdevicememory. Mappedmemoryshouldnotbeconsidereda
performantalternativetounifiedmemoryorexplicitmemorymanagementforthemajorityofakernel’s
memoryneeds.
| 2.4.3.1.1 | cudaMallocHostandcudaHostAlloc |     |     |     |     |     |
| --------- | ------------------------------ | --- | --- | --- | --- | --- |
Host memory allocated with cudaHostMalloc or cudaHostAlloc is automatically mapped. The
pointers returned by these APIs can be directly used in kernel code to access the memory on the
host. ThehostmemoryisaccessedovertheCPU-GPUinterconnect.
cudaMallocHost
| void                          | usingMallocHost() |                      | {             |                       |     |     |
| ----------------------------- | ----------------- | -------------------- | ------------- | --------------------- | --- | --- |
| float*                        | a =               | nullptr;             |               |                       |     |     |
| float*                        | b =               | nullptr;             |               |                       |     |     |
| CUDA_CHECK(cudaMallocHost(&a, |                   |                      |               | vLen*sizeof(float))); |     |     |
| CUDA_CHECK(cudaMallocHost(&b, |                   |                      |               | vLen*sizeof(float))); |     |     |
| initVector(b,                 |                   | vLen);               |               |                       |     |     |
| memset(a,                     | 0,                | vLen*sizeof(float)); |               |                       |     |     |
| int                           | threads           | = 256;               |               |                       |     |     |
| int                           | blocks            | = vLen∕threads;      |               |                       |     |     |
| copyKernel<<<blocks,          |                   |                      | threads>>>(a, |                       | b); |     |
CUDA_CHECK(cudaGetLastError());
CUDA_CHECK(cudaDeviceSynchronize());
| printf("Using |     | cudaMallocHost: |     | "); |     |     |
| ------------- | --- | --------------- | --- | --- | --- | --- |
checkAnswer(a,b);
}
2.4. UnifiedandSystemMemory 77

CUDAProgrammingGuide,Release13.1
cudaAllocHost
| void usingCudaHostAlloc() | {        |     |     |     |
| ------------------------- | -------- | --- | --- | --- |
| float* a =                | nullptr; |     |     |     |
| float* b =                | nullptr; |     |     |     |
CUDA_CHECK(cudaHostAlloc(&a, vLen*sizeof(float), cudaHostAllocMapped));
CUDA_CHECK(cudaHostAlloc(&b, vLen*sizeof(float), cudaHostAllocMapped));
| initVector(b,        | vLen);               |     |     |     |
| -------------------- | -------------------- | --- | --- | --- |
| memset(a, 0,         | vLen*sizeof(float)); |     |     |     |
| int threads          | = 256;               |     |     |     |
| int blocks           | = vLen∕threads;      |     |     |     |
| copyKernel<<<blocks, | threads>>>(a,        |     | b); |     |
CUDA_CHECK(cudaGetLastError());
CUDA_CHECK(cudaDeviceSynchronize());
| printf("Using  | cudaAllocHost: | "); |     |     |
| -------------- | -------------- | --- | --- | --- |
| checkAnswer(a, | b);            |     |     |     |
}
2.4.3.1.2 cudaHostRegister
WhenATSandHMMarenotavailable,allocationsmadebysystemallocatorscanstillbemappedfor
accessdirectlyfromGPUkernelsusingcudaHostRegister. UnlikememorycreatedwithCUDAAPIs,
however, the memory cannot be accessed from the kernel using the host pointer. A pointer in the
device’s memory region must be obtained using cudaHostGetDevicePointer(), and that pointer
mustbeusedforaccessesinkernelcode.
| void usingRegister() | {          |     |     |     |
| -------------------- | ---------- | --- | --- | --- |
| float* a =           | nullptr;   |     |     |     |
| float* b =           | nullptr;   |     |     |     |
| float* devA          | = nullptr; |     |     |     |
| float* devB          | = nullptr; |     |     |     |
a = (float*)malloc(vLen*sizeof(float));
b = (float*)malloc(vLen*sizeof(float));
| CUDA_CHECK(cudaHostRegister(a, |     | vLen*sizeof(float), |     | 0 )); |
| ------------------------------ | --- | ------------------- | --- | ----- |
| CUDA_CHECK(cudaHostRegister(b, |     | vLen*sizeof(float), |     | 0 )); |
CUDA_CHECK(cudaHostGetDevicePointer((void**)&devA, (void*)a, 0));
CUDA_CHECK(cudaHostGetDevicePointer((void**)&devB, (void*)b, 0));
| initVector(b,        | vLen);               |     |        |     |
| -------------------- | -------------------- | --- | ------ | --- |
| memset(a, 0,         | vLen*sizeof(float)); |     |        |     |
| int threads          | = 256;               |     |        |     |
| int blocks           | = vLen∕threads;      |     |        |     |
| copyKernel<<<blocks, | threads>>>(devA,     |     | devB); |     |
CUDA_CHECK(cudaGetLastError());
CUDA_CHECK(cudaDeviceSynchronize());
(continuesonnextpage)
| 78  |     |     | Chapter2. | ProgrammingGPUsinCUDA |
| --- | --- | --- | --------- | --------------------- |

CUDAProgrammingGuide,Release13.1
(continuedfrompreviouspage)
printf("Using cudaHostRegister: ");
checkAnswer(a, b);
}
2.4.3.1.3 ComparingUnifiedMemoryandMappedMemory
MappedmemorymakesCPUmemoryaccessiblefromtheGPU,butdoesnotguaranteethatalltypes
of access, for example atomics, are supported on all systems. Unified memory guarantees that all
accesstypesaresupported.
MappedmemoryremainsinCPUmemory, whichmeansallGPUaccessesmustgothroughthecon-
nectionbetweentheCPUandGPU:PCIeorNVLink. Latencyofaccessesmadeacrosstheselinksare
significantlyhigherthanaccesstoGPUmemory,andtotalavailablebandwidthislower. Assuch,using
mappedmemoryforallkernelmemoryaccessesisunlikelytofullyutilizeGPUcomputingresources.
Unifiedmemoryismostoftenmigratedtothephysicalmemoryoftheprocessoraccessingit. After
thefirstmigration,repeatedaccesstothesamememorypageorcachelinebyakernelcanutilizethe
fullGPUmemorybandwidth.
Note
Mappedmemoryhasalsobeenreferredtoaszero-copymemoryinpreviousdocuments.
PriortoallCUDAapplicationsusingaunifiedvirtualaddressspace,additionalAPIswereneededto
enable memory mapping (cudaSetDeviceFlags with cudaDeviceMapHost). These APIs are no
longerneeded.
Atomicfunctions(seeAtomicFunctions)operatingonmappedhostmemoryarenotatomicfrom
thepointofviewofthehostorotherGPUs.
CUDAruntimerequiresthat1-byte,2-byte,4-byte,8-byte,and16-bytenaturallyalignedloadsand
storestohostmemoryinitiatedfromthedevicearepreservedassingleaccessesfromthepoint
ofviewofthehostandotherdevices. Onsomeplatforms,atomicstomemorymaybebrokenby
thehardwareintoseparateloadandstoreoperations. Thesecomponentloadandstoreoperations
havethesamerequirementsonpreservationofnaturallyalignedaccesses. TheCUDAruntimedoes
notsupportaPCIExpressbustopologywhereaPCIExpressbridgesplits8-bytenaturallyaligned
operationsandNVIDIAisnotawareofanytopologythatsplits16-bytenaturallyalignedoperations.
2.4.4. Summary
▶ OnLinuxplatformswithheterogeneousmemorymanagement(HMM)oraddresstranslationser-
vices(ATS),allsystem-allocatedmemoryismanagedmemory
▶ On Linux platforms without HMM or ATS, on Tegra processors, and on all Windows platforms,
managedmemorymustbeallocatedusingCUDA:
▶ cudaMallocManagedor
▶ cudaMallocFromPoolAsync with a pool created with alloc-
Type=cudaMemAllocationTypeManaged
▶ Globalvariableswith__managed__specifier
▶ OnWindowsandTegraprocessors,unifiedmemoryhaslimitations
2.4. UnifiedandSystemMemory 79

CUDAProgrammingGuide,Release13.1
▶
On NVLINK C2C connected systems with ATS, device memory allocated with cudaMalloc can
bedirectlyaccessedfromtheCPUorotherGPUs
| 2.5. | NVCC: | The | NVIDIA | CUDA | Compiler |     |
| ---- | ----- | --- | ------ | ---- | -------- | --- |
TheNVIDIACUDACompilernvccisatoolchainfromNVIDIAforcompilingCUDAC/C++aswellasPTX
code. ThetoolchainispartoftheCUDAToolkitandconsistsofseveraltools, includingthecompiler,
linker,andthePTXandCubinassemblers. Thetop-levelnvcctoolcoordinatesthecompilationprocess,
invokingtheappropriatetoolforeachstageofcompilation.
nvccdrivesofflinecompilationofCUDAcode, incontrasttoonlineorJust-in-Time(JIT)compilation
drivenbytheCUDAruntimecompilernvrtc.
Thischaptercoversthemostcommonusesanddetailsofnvccneededforbuildingapplications. Full
coverageofnvccisfoundinthenvccdocumentation.
| 2.5.1. | CUDA | Source | Files | and Headers |     |     |
| ------ | ---- | ------ | ----- | ----------- | --- | --- |
Sourcefilescompiledwithnvccmaycontainacombinationofhostcode,whichexecutesontheCPU,
anddevicecodethatexecutesontheGPU. nvcc acceptsthecommonC/C++ sourcefileextensions
.c,.cpp,.cc,.cxxforhost-onlycodeand.cuforfilesthatcontaindevicecodeoramixofhostand
devicecode. Headerscontainingdevicecodetypicallyadoptthe.cuhextensiontodistinguishthem
fromhost-onlycodeheaders.h,.hpp,.hh,.hxx,etc.
| FileExtension |     | Description   |     | Content       |     |     |
| ------------- | --- | ------------- | --- | ------------- | --- | --- |
| .c            |     | Csourcefile   |     | Host-onlycode |     |     |
| .cpp,.cc,.cxx |     | C++sourcefile |     | Host-onlycode |     |     |
.h,.hpp,.hh,.hxx C/C++headerfile Devicecode,hostcode,mixofhost/devicecode
| .cu    |      | CUDAsourcefile |     | Devicecode,hostcode,mixofhost/devicecode |     |     |
| ------ | ---- | -------------- | --- | ---------------------------------------- | --- | --- |
| .cuh   |      | CUDAheaderfile |     | Devicecode,hostcode,mixofhost/devicecode |     |     |
| 2.5.2. | NVCC | Compilation    |     | Workflow                                 |     |     |
Intheinitialphase,nvccseparatesthedevicecodefromthehostcodeanddispatchestheircompila-
tiontotheGPUandthehostcompilers,respectively.
Tocompilethehostcode,theCUDAcompilernvccrequiresacompatiblehostcompilertobe
available. TheCUDAToolkitdefinesthehostcompilersupportpolicyforLinuxandWindows
platforms.
Filescontainingonlyhostcodecanbebuiltusingeithernvccorthehostcompilerdirectly. The
resultingobjectfilescanbecombinedwithobjectfilesfromnvccwhichcontainGPUcodeat
link-time.
TheGPUcompilercompilestheC/C++devicecodetoPTXassemblycode. TheGPUcompilerisrun
| 80  |     |     |     |     | Chapter2. | ProgrammingGPUsinCUDA |
| --- | --- | --- | --- | --- | --------- | --------------------- |

CUDAProgrammingGuide,Release13.1
foreachvirtualmachineinstructionsetarchitecture(e.g. compute_90)specifiedinthecompilation
commandline.
IndividualPTXcodeisthenpassedtotheptxastool,whichgeneratesCubinforthetargethardware
ISAs. ThehardwareISAisidentifiedbyitsSMversion.
ItispossibletoembedmultiplePTXandCubintargetsintoasinglebinaryFatbincontainerwithinan
applicationorlibrarysothatasinglebinarycansupportmultiplevirtualandtargethardwareISAs.
Theinvocationandcoordinationofthetoolsdescribedabovearedoneautomaticallybynvcc. The-v
optioncanbeusedtodisplaythefullcompilationworkflowandtoolinvocation. The-keepoptioncan
beusedtosavetheintermediatefilesgeneratedduringthecompilationinthecurrentdirectoryorin
thedirectoryspecifiedby--keep-dirinstead.
ThefollowingexampleillustratesthecompilationworkflowforaCUDAsourcefileexample.cu:
| ∕∕ -----      | example.cu | -----    |             |
| ------------- | ---------- | -------- | ----------- |
| #include      | <stdio.h>  |          |             |
| __global__    | void       | kernel() | {           |
| printf("Hello |            | from     | kernel\n"); |
}
| void kernel_launcher() |     |         | {   |
| ---------------------- | --- | ------- | --- |
| kernel<<<1,            |     | 1>>>(); |     |
cudaDeviceSynchronize();
}
| int main() | {   |     |     |
| ---------- | --- | --- | --- |
kernel_launcher();
| return | 0;  |     |     |
| ------ | --- | --- | --- |
}
nvccbasiccompilationworkflow:
nvcccompilationworkflowwithmultiplePTXandCubinarchitectures:
Amoredetaileddescriptionofthenvcccompilationworkflowcanbefoundinthecompilerdocumen-
tation.
| 2.5.3. | NVCC | Basic | Usage |
| ------ | ---- | ----- | ----- |
ThebasiccommandtocompileaCUDAsourcefilewithnvccis:
| nvcc <source_file>.cu |     |     | -o <output_file> |
| --------------------- | --- | --- | ---------------- |
nvcc accepts common compiler flags used for specifying include directories -I <path> and
library paths -L <path>, linking against other libraries -l<library>, and defining macros
-D<macro>=<value>.
nvcc example.cu -I path_to_include∕ -L path_to_library∕ -lcublas -o <output_
,→file>
2.5. NVCC:TheNVIDIACUDACompiler 81

CUDAProgrammingGuide,Release13.1
| 82  | Chapter2. | ProgrammingGPUsinCUDA |
| --- | --------- | --------------------- |

CUDAProgrammingGuide,Release13.1
2.5.3.1 NVCCPTXandCubinGeneration
Bydefault,nvccgeneratesPTXandCubinfortheearliestGPUarchitecture(lowestcompute_XYand
sm_XYversion)supportedbytheCUDAToolkittomaximizecompatibility.
▶
The-archoptioncanbeusedtogeneratePTXandCubinforaspecificGPUarchitecture.
▶ The-gencodeoptioncanbeusedtogeneratePTXandCubinformultipleGPUarchitectures.
The complete list of supported virtual and real GPU architectures can be obtained by passing the
--list-gpu-codeand--list-gpu-archflagsrespectively,orbyreferringtotheVirtualArchitec-
tureListandtheGPUArchitectureListsectionswithinthenvccdocumentation.
nvcc --list-gpu-code # list all supported real GPU architectures
nvcc --list-gpu-arch # list all supported virtual GPU architectures
nvcc example.cu -arch=compute_<XY> # e.g. -arch=compute_80 for NVIDIA Ampere
| ,→GPUs and | later |     |     |             |     |         |            |
| ---------- | ----- | --- | --- | ----------- | --- | ------- | ---------- |
|            |       |     |     | # PTX-only, | GPU | forward | compatible |
nvcc example.cu -arch=sm_<XY> # e.g. -arch=sm_80 for NVIDIA Ampere GPUs
,→and later
|     |     |     |     | # PTX | and Cubin, | GPU | forward compatible |
| --- | --- | --- | --- | ----- | ---------- | --- | ------------------ |
nvcc example.cu -arch=native # automatically detects and generates Cubin
| ,→for the | current | GPU |     |      |         |             |               |
| --------- | ------- | --- | --- | ---- | ------- | ----------- | ------------- |
|           |         |     |     | # no | PTX, no | GPU forward | compatibility |
nvcc example.cu -arch=all # generate Cubin for all supported GPU
,→architectures
|           |               |     |     | # also | includes | the latest | PTX for GPU |
| --------- | ------------- | --- | --- | ------ | -------- | ---------- | ------------ |
| ,→forward | compatibility |     |     |        |          |            |              |
nvcc example.cu -arch=all-major # generate Cubin for all major supported
| ,→GPU architectures, |               |     | e.g. sm_80, | sm_90, |          |            |              |
| -------------------- | ------------- | --- | ----------- | ------ | -------- | ---------- | ------------ |
|                      |               |     |             | # also | includes | the latest | PTX for GPU |
| ,→forward            | compatibility |     |             |        |          |            |              |
MoreadvancedusageallowsPTXandCubintargetstobespecifiedindividually:
# generate PTX for virtual architecture compute_80 and compile it to Cubin for
| ,→real architecture |     | sm_86, | keep | compute_80 | PTX |     |     |
| ------------------- | --- | ------ | ---- | ---------- | --- | --- | --- |
nvcc example.cu -arch=compute_80 -gpu-code=sm_86,compute_80 # (PTX and Cubin)
# generate PTX for virtual architecture compute_80 and compile it to Cubin for
| ,→real architecture |     | sm_86, | sm_89 |     |     |     |     |
| ------------------- | --- | ------ | ----- | --- | --- | --- | --- |
nvcc example.cu -arch=compute_80 -gpu-code=sm_86,sm_89 # (no PTX)
nvcc example.cu -gencode=arch=compute_80,code=sm_86,sm_89 # same as above
# (1) generate PTX for virtual architecture compute_80 and compile it to Cubin
| ,→for real | architecture |     | sm_86, | sm_89 |     |     |     |
| ---------- | ------------ | --- | ------ | ----- | --- | --- | --- |
# (2) generate PTX for virtual architecture compute_90 and compile it to Cubin
| ,→for real      | architecture |                                           | sm_90 |     |     |     |     |
| --------------- | ------------ | ----------------------------------------- | ----- | --- | --- | --- | --- |
| nvcc example.cu |              | -gencode=arch=compute_80,code=sm_86,sm_89 |       |     |     |     | -   |
,→gencode=arch=compute_90,code=sm_90
2.5. NVCC:TheNVIDIACUDACompiler 83

CUDAProgrammingGuide,Release13.1
The full reference of nvcc command-line options for steering GPU code generation can be found in
thenvccdocumentation.
2.5.3.2 HostCodeCompilationNotes
Compilationunits,namelyasourcefileanditsheaders,thatdonotcontaindevicecodeorsymbolscan
be compiled directly with a host compiler. If any compilation unit uses CUDA runtime API functions,
theapplicationmustbelinkedwiththeCUDAruntimelibrary. TheCUDAruntimeisavailableasboth
astaticandasharedlibrary,libcudart_staticandlibcudart,respectively. Bydefault,nvcclinks
againstthestaticCUDAruntimelibrary. TousethesharedlibraryversionoftheCUDAruntime,pass
theflag--cudart=sharedtonvcconthecompileorlinkcommand.
nvcc allowsthe hostcompiler usedforhostfunctionstobe specifiedvia the -ccbin <compiler>
argument. The environment variable NVCC_CCBIN can also be defined to specify the host compiler
usedbynvcc. The-Xcompilerargumenttonvccpassesthroughargumentstothehostcompiler.
Forexample,intheexamplebelow,the-O3argumentispassedtothehostcompilerbynvcc.
| nvcc example.cu |                  | -ccbin=clang++ |     |     |     |     |     |
| --------------- | ---------------- | -------------- | --- | --- | --- | --- | --- |
| export          | NVCC_CCBIN='gcc' |                |     |     |     |     |     |
| nvcc example.cu |                  | -Xcompiler=-O3 |     |     |     |     |     |
2.5.3.3 SeparateCompilationofGPUCode
nvccdefaultstowhole-programcompilation,whichexpectsallGPUcodeandsymbolstobepresentin
thecompilationunitthatusesthem. CUDAdevicefunctionsmaycalldevicefunctionsoraccessdevice
variablesdefinedinothercompilationunits, buteitherthe-rdc=trueoritsaliasthe-dcflagmust
be specified on the nvcc command line to enable linking of device code from different compilation
units. The ability to link device code and symbols from different compilation units is called separate
compilation.
Separatecompilationallowsmoreflexiblecodeorganization,canimprovecompiletime,andcanleadto
smaller binaries. Separate compilation may involve some build-time complexity compared to whole-
program compilation. Performance can be affected by the use of device code linking, which is why
it is not used by default. Link-TimeOptimization(LTO) can help reduce the performance overhead of
separatecompilation.
Separatecompilationrequiresthefollowingconditions:
▶ Non-constdevicevariablesdefinedinonecompilationunitmustbereferredtowiththeextern
keywordinothercompilationunits.
▶ Allconstdevicevariablesmustbedefinedandreferredtowiththeexternkeyword.
▶ AllCUDAsourcefiles.cumustbecompiledwiththe-dcor-rdc=trueflags.
Host and device functions have external linkage by default and do not require the ex-
| keyword. |     | Note | that starting from | CUDA | 13,        | functions | and    |
| -------- | --- | ---- | ------------------ | ---- | ---------- | --------- | ------ |
| tern     |     |      |                    |      | __global__ |           | __man- |
aged__/__device__/__constant__variableshaveinternallinkagebydefault.
Inthefollowingexample,definition.cudefinesavariableandafunction,whileexample.curefers
tothem. Bothfilesarecompiledseparatelyandlinkedintothefinalbinary.
| ∕∕ -----   | definition.cu |     | -----             |          |           |                       |     |
| ---------- | ------------- | --- | ----------------- | -------- | --------- | --------------------- | --- |
| extern     | __device__    | int | device_variable   | = 5;     |           |                       |     |
| __device__ |               | int | device_function() | { return | 10;       | }                     |     |
| 84         |               |     |                   |          | Chapter2. | ProgrammingGPUsinCUDA |     |

CUDAProgrammingGuide,Release13.1
| ∕∕ -----        | example.cu | -----                  |        |     |
| --------------- | ---------- | ---------------------- | ------ | --- |
| extern          | __device__ | int device_variable;   |        |     |
| __device__      |            | int device_function(); |        |     |
| __global__      | void       | kernel(int*            | ptr) { |     |
| device_variable |            | = 0;                   |        |     |
| *ptr            |            | = device_function();   |        |     |
}
| nvcc -dc          | definition.cu | -o        | definition.o |     |
| ----------------- | ------------- | --------- | ------------ | --- |
| nvcc -dc          | example.cu    | -o        | example.o    |     |
| nvcc definition.o |               | example.o | -o program   |     |
| 2.5.4.            | Common        | Compiler  | Options      |     |
This section presents the most relevant compiler options that can be used with nvcc, covering lan-
guagefeatures,optimization,debugging,profiling,andbuildaspects. Thefulldescriptionofalloptions
canbefoundinthenvccdocumentation.
2.5.4.1 LanguageFeatures
nvccsupportstheC++corelanguagefeatures,fromC++03toC++20. The-stdflagcanbeusedto
specifythelanguagestandardtouse:
▶
--std={c++03|c++11|c++14|c++17|c++20}
Inaddition,nvccsupportsthefollowinglanguageextensions:
▶
-restrict: Assertthatallkernelpointerparametersarerestrictpointers.
▶ -extended-lambda: Allow__host__,__device__annotationsinlambdadeclarations.
▶ -expt-relaxed-constexpr: (Experimentalflag)Allowhostcodetoinvoke__device__ con-
stexprfunctions,anddevicecodetoinvoke__host__ constexprfunctions.
Moredetailonthesefeaturescanbefoundintheextendedlambdaandconstexprsections.
2.5.4.2 DebuggingOptions
nvccsupportsthefollowingoptionstogeneratedebuginformation:
▶ -g: Generate debug information for host code. gdb∕lldb and similar tools rely on such infor-
mationforhostcodedebugging.
▶ -G:Generatedebuginformationfordevicecode. cuda-gdbreliesonsuchinformationfordevice-
| codedebugging. |     | Theflagalsodefinesthe__CUDACC_DEBUG__macro. |     |     |
| -------------- | --- | ------------------------------------------- | --- | --- |
▶ -lineinfo: Generateline-numberinformationfordevicecode. Thisoptiondoesnotaffectex-
ecution performance and is useful in conjunction with the compute-sanitizer tool to trace the
kernelexecution.
uses the highest optimization level for GPU code by default. The debug flag prevents
| nvcc |     |     | -O3 | -G  |
| ---- | --- | --- | --- | --- |
some compiler optimizations, and so debug code is expected to have lower performance than non-
debugcode. The-DNDEBUGflagcanbedefinedtodisableruntimeassertions,asthesecanalsoslow
downexecution.
2.5. NVCC:TheNVIDIACUDACompiler 85

CUDAProgrammingGuide,Release13.1
2.5.4.3 OptimizationOptions
nvccprovidesmanyoptionsforoptimizingperformance. Thissectionaimstoprovideabriefsurveyof
someoftheoptionsavailablethatdevelopersmayfinduseful,aswellaslinkstofurtherinformation.
Completecoveragecanbefoundinthenvccdocumentation.
▶ -XptxaspassesargumentstothePTXassemblertoolptxas. Thenvccdocumentationprovides
a list of useful arguments for ptxas. For example, specifies the
-Xptxas=-maxrregcount=N
maximumnumberofregisterstouse,perthread.
▶ -extra-device-vectorization: Enablesmoreaggressivedevicecodevectorization.
▶ Additionalflagswhichprovidefine-grainedcontroloverfloatingpointbehaviorarecoveredinthe
Floating-PointComputationsectionandinthenvccdocumentation.
Thefollowingflagsgetoutputfromthecompilerwhichcanbeusefulinmoreadvancedcodeoptimiza-
tion:
▶ -res-usage: Printresourceusagereportaftercompilation. Itincludesthenumberofregisters,
sharedmemory,constantmemory,andlocalmemoryallocatedforeachkernelfunction.
| ▶ -opt-info=inline: |     | Printinformationaboutinlinedfunctions. |     |     |     |
| ------------------- | --- | -------------------------------------- | --- | --- | --- |
▶
| -Xptxas=-warn-lmem-usage: |     | Warniflocalmemoryisused. |     |     |     |
| ------------------------- | --- | ------------------------ | --- | --- | --- |
▶ -Xptxas=-warn-spills: Warnifregistersarespilledtolocalmemory.
2.5.4.4 Link-TimeOptimization(LTO)
Separatecompilationcanresultinlowerperformancethanwhole-programcompilationduetolimited
cross-fileoptimizationopportunities. Link-TimeOptimization(LTO)addressesthisbyperformingop-
timizationsacrossseparatelycompiledfilesatlinktime,atthecostofincreasedcompilationtime. LTO
canrecovermuchoftheperformanceofwhole-programcompilationwhilemaintainingtheflexibility
ofseparatecompilation.
nvccrequiresthe-dltoflagorlto_<SM version>link-timeoptimizationtargetstoenableLTO:
| nvcc -dc   | -dlto -arch=sm_100 | definition.cu | -o              | definition.o |     |
| ---------- | ------------------ | ------------- | --------------- | ------------ | --- |
| nvcc -dc   | -dlto -arch=sm_100 | example.cu    | -o              | example.o    |     |
| nvcc -dlto | definition.o       | example.o     | -o program      |              |     |
| nvcc -dc   | -arch=lto_100      | definition.cu | -o definition.o |              |     |
| nvcc -dc   | -arch=lto_100      | example.cu    | -o example.o    |              |     |
| nvcc -dlto | definition.o       | example.o     | -o program      |              |     |
2.5.4.5 ProfilingOptions
ItispossibletodirectlyprofileaCUDAapplicationusingtheNsightComputeandNsightSystemstools
withouttheneedforadditionalflagsduringthecompilationprocess. However,additionalinformation
which can be generated by nvcc can assist profiling by correlating source files with the generated
code:
▶
-lineinfo: Generate line-number information for device code; this allows viewing the source
codeintheprofilingtools. Profilingtoolsrequiretheoriginalsourcecodetobeavailableinthe
samelocationwherethecodewascompiled.
▶ -src-in-ptx: KeeptheoriginalsourcecodeinthePTX,avoidingthelimitationsof-lineinfo
| mentionedabove. | Requires-lineinfo. |     |     |           |                       |
| --------------- | ------------------ | --- | --- | --------- | --------------------- |
| 86              |                    |     |     | Chapter2. | ProgrammingGPUsinCUDA |

CUDAProgrammingGuide,Release13.1
2.5.4.6 FatbinCompression
nvcccompressesthefatbinsstoredinapplicationorlibrarybinariesbydefault. Fatbincompression
canbecontrolledusingthefollowingoptions:
▶ -no-compress: Disablethecompressionofthefatbin.
▶ --compress-mode={default|size|speed|balance|none}: Set the compression mode.
speed focuses on fast decompression time, while size aims at reducing the fatbin size. bal-
ance provides a trade-off between speed and size. The default mode is speed. none disables
compression.
2.5.4.7 CompilerPerformanceControls
nvccprovidesoptionstoanalyzeandacceleratethecompilationprocessitself:
▶ -t <N>: ThenumberofCPUthreadsusedtoparallelizethecompilationofasinglecompilation
unitformultipleGPUarchitectures.
▶ -split-compile <N>: ThenumberofCPUthreadsusedtoparallelizetheoptimizationphase.
▶ -split-compile-extended <N>: Moreaggressiveformofsplitcompilation. Requireslink-time
optimization.
▶ -Ofc <N>: Levelofdevicecodecompilationspeed.
▶ -time <filename>: Generateacomma-separatedvalue(CSV)tablewiththetimetakenbyeach
compilationphase.
▶ -fdevice-time-trace: Generateatimetracefordevicecodecompilation.
2.5. NVCC:TheNVIDIACUDACompiler 87

CUDAProgrammingGuide,Release13.1
| 88  | Chapter2. | ProgrammingGPUsinCUDA |
| --- | --------- | --------------------- |

Chapter 3. Advanced CUDA
3.1. Advanced CUDA APIs and Features
ThissectionwillcoveruseofmoreadvancedCUDAAPIsandfeatures. Thesetopicscovertechniques
or features that do not usually require CUDA kernel modifications, but can still influence, from the
host-side,application-levelbehavior,bothintermsofGPUworkexecutionandperformanceaswellas
CPU-sideperformance.
3.1.1. cudaLaunchKernelEx
Whenthetriplechevronnotationwasintroducedinfirstversionsof,theKernelConfigurationofaker-
nel had only four programmable parameters: - thread block dimensions - grid dimensions - dynamic
shared-memory(optional,0ifunspecified)-stream(defaultstreamusedifunspecified)
SomeCUDAfeaturescanbenefitfromadditionalattributesandhintsprovidedwithakernellaunch.
The cudaLaunchKernelEx enables a program to set the above mentioned execution configuration
parameters via the cudaLaunchConfig_t structure. In addition, the cudaLaunchConfig_t struc-
tureallowstheprogramtopassinzeroormorecudaLaunchAttributestocontrolorsuggestother
parametersforthekernellaunch. Forexample,thecudaLaunchAttributePreferredSharedMem-
oryCarveoutdiscussedlaterinthischapter(seeConfiguringL1/SharedMemoryBalance)isspecified
using cudaLaunchKernelEx. The cudaLaunchAttributeClusterDimension attribute, discussed
laterinthischapter,isusedtospecifythedesiredclustersizeforthekernellaunch.
The complete list of supported attributes and their meaning is captured in the CUDA Runtime API
ReferenceDocumentation.
3.1.2. Launching Clusters:
Threadblockclusters, introducedinprevioussections, areanoptionallevelofthreadblockorganiza-
tionavailableincomputecapability9.0andhigherwhichenableapplicationstoguaranteethatthread
blocksofaclusteraresimultaneouslyexecutedonsingleGPC.Thisenableslargergroupsofthreads
thanthosethatfitinasingleSMtoexchangedataandsynchronizewitheachother.
SectionSection2.1.10.1showedhowakernelwhichusesclusterscanbespecifiedandlaunchedusing
triplechevronnotation. Inthissection, the__cluster_dims__annotationwasusedtospecifythe
dimensionsoftheclusterwhichmustbeusedtolaunchthekernel. Whenusingtriplechevronnotation,
thesizeoftheclustersisdeterminedimplicitly.
89

CUDAProgrammingGuide,Release13.1
3.1.2.1 LaunchingwithClustersusingcudaLaunchKernelEx
Unlikelaunchingkernelsusingclusterswithtriplechevronnotation,thesizeofthethreadblockcluster
canbeconfiguredonaper-launchbasis. Thecodeexamplebelowshowshowtolaunchaclusterkernel
usingcudaLaunchKernelEx.
| ∕∕ Kernel  | definition |                           |           |     |          |                |         |     |     |     |
| ---------- | ---------- | ------------------------- | --------- | --- | -------- | -------------- | ------- | --- | --- | --- |
| ∕∕ No      | compile    | time                      | attribute |     | attached | to the kernel  |         |     |     |     |
| __global__ |            | void cluster_kernel(float |           |     |          | *input, float* | output) |     |     |     |
{
}
int main()
{
| float | *input,             |     | *output; |     |      |     |     |     |     |     |
| ----- | ------------------- | --- | -------- | --- | ---- | --- | --- | --- | --- | --- |
| dim3  | threadsPerBlock(16, |     |          |     | 16); |     |     |     |     |     |
dim3 numBlocks(N ∕ threadsPerBlock.x, N ∕ threadsPerBlock.y);
| ∕∕  | Kernel | invocation |     | with | runtime | cluster size |     |     |     |     |
| --- | ------ | ---------- | --- | ---- | ------- | ------------ | --- | --- | --- | --- |
{
|     | cudaLaunchConfig_t |     |     |     | config | = {0}; |     |     |     |     |
| --- | ------------------ | --- | --- | --- | ------ | ------ | --- | --- | --- | --- |
∕∕ The grid dimension is not affected by cluster launch, and is still
,→enumerated
|     | ∕∕                  | using    | number    | of                                     | blocks.       |               |     |         |       |     |
| --- | ------------------- | -------- | --------- | -------------------------------------- | ------------- | ------------- | --- | ------- | ----- | --- |
|     | ∕∕                  | The grid | dimension |                                        | should        | be a multiple | of  | cluster | size. |     |
|     | config.gridDim      |          |           | = numBlocks;                           |               |               |     |         |       |     |
|     | config.blockDim     |          |           | = threadsPerBlock;                     |               |               |     |         |       |     |
|     | cudaLaunchAttribute |          |           |                                        | attribute[1]; |               |     |         |       |     |
|     | attribute[0].id     |          |           | = cudaLaunchAttributeClusterDimension; |               |               |     |         |       |     |
attribute[0].val.clusterDim.x = 2; ∕∕ Cluster size in X-dimension
|     | attribute[0].val.clusterDim.y |     |     |            |     | = 1;            |        |     |          |     |
| --- | ----------------------------- | --- | --- | ---------- | --- | --------------- | ------ | --- | -------- | --- |
|     | attribute[0].val.clusterDim.z |     |     |            |     | = 1;            |        |     |          |     |
|     | config.attrs                  |     | =   | attribute; |     |                 |        |     |          |     |
|     | config.numAttrs               |     |     | = 1;       |     |                 |        |     |          |     |
|     | cudaLaunchKernelEx(&config,   |     |     |            |     | cluster_kernel, | input, |     | output); |     |
}
}
TherearetwocudaLaunchAttributetypeswhicharerelevanttothreadblockclustersclusters: cu-
daLaunchAttributeClusterDimension and cudaLaunchAttributePreferredClusterDimen-
sion.
TheattributeidcudaLaunchAttributeClusterDimensionspecifiestherequireddimensionswith
whichtoexecutethecluster. Thevalueforthisattribute,clusterDim,isa3-dimensionalvalue. The
corresponding dimensions of the grid (x, y, and z) must be divisible by the respective dimensions of
thespecifiedclusterdimension. Settingthisissimilartousingthe__cluster_dims__attributeon
the kernel definition at compile time as shown in LaunchingwithClustersinTripleChevronNotation,
butcanbechangedatruntimefordifferentlaunchesofthesamekernel.
On GPUs with compute capability of 10.0 and higher, another attribute id cudaLaunchAt-
tributePreferredClusterDimensionallowstheapplicationtoadditionallyspecifyapreferreddi-
mensionforthecluster. Thepreferreddimensionmustbeanintegermultipleoftheminimumclus-
terdimensionsspecifiedbythe__cluster_dims__attributeonthekernelorthecudaLaunchAt-
| 90  |     |     |     |     |     |     |     | Chapter3. |     | AdvancedCUDA |
| --- | --- | --- | --- | --- | --- | --- | --- | --------- | --- | ------------ |

CUDAProgrammingGuide,Release13.1
tributeClusterDimensionattributetocudaLaunchKernelEx. Thatis,aminimumclusterdimen-
sionmustbespecifiedinadditiontothepreferredclusterdimension. Thecorrespondingdimensions
ofthegrid(x,y,andz)mustbedivisiblebytherespectivedimensionofthespecifiedpreferredcluster
dimension.
Allthreadblockswillexecuteinclustersofatleasttheminimumclusterdimension. Wherepossible,
clusters of the preferred dimension will be used, but not all clusters are guaranteed to execute with
the preferred dimensions. All thread blocks will execute in clusters with either the minimum or pre-
ferredclusterdimension. Kernelswhichuseapreferredclusterdimensionmustbewrittentooperate
correctlyineithertheminimumorthepreferredclusterdimension.
3.1.2.2 BlocksasClusters
Whenakernelisdefinedwiththe__cluster_dims__annotation,thenumberofclustersinthegrid
isimplicitandcanbecalculatedfromthesizeofthegriddividedintothespecifiedclustersize.
__cluster_dims__((2, 2, 2)) __global__ void foo();
∕∕ 8x8x8 clusters each with 2x2x2 thread blocks.
foo<<<dim3(16, 16, 16), dim3(1024, 1, 1)>>>();
Intheaboveexample,thekernelislaunchedasagridof16x16x16threadblocks,whichmeansagrid
ofof8x8x8clustersisused.
A kernel can alternatively use the __block_size__ annotation, which specifies both the required
blocksizeandclustersizeatthetimethekernelisdefined. Whenthisannotationisused, thetriple
chevronlaunchbecomesthegriddimensionintermsofclustersratherthanthreadblocks,asshown
below.
∕∕ Implementation detail of how many threads per block and blocks per cluster
∕∕ is handled as an attribute of the kernel.
__block_size__((1024, 1, 1), (2, 2, 2)) __global__ void foo();
∕∕ 8x8x8 clusters.
foo<<<dim3(8, 8, 8)>>>();
__block_size__requirestwofieldseachbeingatupleof3elements. Thefirsttupledenotesblock
dimensionandsecondclustersize. Thesecondtupleisassumedtobe(1,1,1)ifit’snotpassed. To
specify the stream, one must pass 1 and 0 as the second and third arguments within <<<>>> and
lastlythestream. Passingothervalueswouldleadtoundefinedbehavior.
Notethatitisillegalforthesecondtupleof__block_size__and__cluster_dims__tobespecified
atthesametime. It’salsoillegaltouse__block_size__withanempty__cluster_dims__. When
the second tuple of __block_size__ is specified, it implies the “Blocks as Clusters” being enabled
andthecompilerwouldrecognizethefirstargumentinside<<<>>>asthenumberofclustersinstead
ofthreadblocks.
3.1.3. More on Streams and Events
CUDAStreams introduced the basics of CUDA streams. By default, operations submitted on a given
CUDAstreamareserialized: onecannotstartexecutinguntilthepreviousonehascompleted. Theonly
exceptionistherecentlyaddedProgrammaticDependentLaunchandSynchronizationfeature. Having
multiple CUDA streams is a way to enable concurrent execution; another way is using CUDAGraphs.
Thetwoapproachescanalsobecombined.
3.1. AdvancedCUDAAPIsandFeatures 91

CUDAProgrammingGuide,Release13.1
WorksubmittedondifferentCUDAstreamsmay executeconcurrentlyunderspecificcircumstances,
e.g., if there are no event dependencies, if there is no implicit synchronization, if there are sufficient
resources,etc.
Independent operations from different CUDA streams cannot run concurrently if any CUDA opera-
tion on the NULL stream is submitted in between them, unless the streams are non-blocking CUDA
streams. These are streams created with cudaStreamCreateWithFlags() runtime API with the
cudaStreamNonBlockingflag. ToimprovepotentialforconcurrentGPUworkexecution,itisrecom-
mendedthattheusercreatesnon-blockingCUDAstreams.
Itisalsorecommendedthattheuserselectstheleastgeneralsynchronizationoptionthatissufficient
fortheirproblem. Forexample,iftherequirementisfortheCPUtowait(block)forallworkonaspecific
CUDA stream to complete, using cudaStreamSynchronize() for that stream would be preferable
to cudaDeviceSynchronize(), as the latter would unnecessarily wait for GPU work on all CUDA
streams of the device to complete. And if the requirement is for the CPU to wait without blocking,
thenusingcudaStreamQuery()andcheckingitsreturnvalue,inapollingloop,maybepreferable.
AsimilarsynchronizationeffectcanalsobeachievedwithCUDAevents(CUDAEvents),e.g.,byrecord-
inganeventonthatstreamandcallingcudaEventSynchronize()towait,inablockingmanner,for
theworkcapturedinthateventtocomplete. Again,thiswouldbepreferableandmorefocusedthan
usingcudaDeviceSynchronize(). CallingcudaEventQuery()andcheckingitsreturnvalue,e.g.,in
apollingloop,wouldbeanon-blockingalternative.
Thechoiceoftheexplicitsynchronizationmethodisparticularlyimportantifthisoperationhappens
in the application’s critical path. Table 4 provides a high-level summary of various synchronization
optionswiththehost.
Table4: Summaryofexplicitsynchronizationoptionswiththe
host
Wait for specific Wait for specific Wait for everything on
stream event thedevice
Non-blocking (would need a cudaStream- cudaEvent- N/A
pollingloop) Query() Query()
Blocking cudaStreamSyn- cudaEventSyn- cudaDeviceSynchronize()
chronize() chronize()
For synchronization, i.e., to express dependencies, between CUDA streams, use of non-timing CUDA
events is recommended, as described in CUDAEvents. A user can call cudaStreamWaitEvent() to
force future submitted operations on a specific stream to wait for the completion of a previously
recorded event (e.g., on another stream). Note that for any CUDA API waiting or querying an event,
it is the responsibility of the user to ensure the cudaEventRecord API has been already called, as a
non-recordedeventwillalwaysreturnsuccess.
CUDAeventscarry,bydefault,timinginformation,astheycanbeusedincudaEventElapsedTime()
APIcalls. However,aCUDAeventthatissolelyusedtoexpressdependenciesacrossstreamsdoesnot
needtiminginformation. Forsuchcases,itisrecommendedtocreateeventswithtiminginformation
disabledforimprovedperformance. ThisispossibleusingcudaEventCreateWithFlags()APIwith
thecudaEventDisableTimingflag.
92 Chapter3. AdvancedCUDA

CUDAProgrammingGuide,Release13.1
3.1.3.1 StreamPriorities
The relative priorities of streams can be specified at creation time using cudaStreamCreateWith-
Priority(). The range of allowable priorities, ordered as [greatest priority, least priority] can be
obtainedusingthecudaDeviceGetStreamPriorityRange()function. Atruntime,theGPUsched-
uler utilizes stream priorities to determine task execution order, but these priorities serve as hints
ratherthanguarantees. Whenselectingworktolaunch,pendingtasksinhigher-prioritystreamstake
precedenceoverthoseinlower-prioritystreams. Higher-prioritytasksdonotpreemptalreadyrunning
lower-prioritytasks. TheGPUdoesnotreassessworkqueuesduringtaskexecution,andincreasinga
stream’s priority will not interrupt ongoing work. Stream priorities influence task execution without
enforcingstrictordering,souserscanleveragestreamprioritiestoinfluencetaskexecutionwithout
relyingonstrictorderingguarantees.
Thefollowingcodesampleobtainstheallowablerangeofprioritiesforthecurrentdevice,andcreates
twonon-blockingCUDAstreamswiththehighestandlowestavailablepriorities.
∕∕ get the range of stream priorities for this device
int leastPriority, greatestPriority;
cudaDeviceGetStreamPriorityRange(&leastPriority, &greatestPriority);
∕∕ create streams with highest and lowest available priorities
cudaStream_t st_high, st_low;
cudaStreamCreateWithPriority(&st_high, cudaStreamNonBlocking,
,→greatestPriority));
cudaStreamCreateWithPriority(&st_low, cudaStreamNonBlocking, leastPriority);
3.1.3.2 ExplicitSynchronization
Aspreviouslyoutlined,thereareanumberofwaysthatstreamscansynchronizewithotherstreams.
The following provides common methods at different levels of granularity: - cudaDeviceSynchro-
nize()waitsuntilallprecedingcommandsinallstreamsofallhostthreadshavecompleted. -cudaS-
treamSynchronize()takesastreamasaparameterandwaitsuntilallprecedingcommandsinthe
givenstreamhavecompleted. Itcanbeusedtosynchronizethehostwithaspecificstream,allowing
otherstreamstocontinueexecutingonthedevice. -cudaStreamWaitEvent()takesastreamandan
eventasparameters(seeCUDAEventsforadescriptionofevents)andmakesallthecommandsadded
to the given stream after the call to cudaStreamWaitEvent()delay their execution until the given
eventhascompleted. -cudaStreamQuery()providesapplicationswithawaytoknowifallpreceding
commandsinastreamhavecompleted.
3.1.3.3 ImplicitSynchronization
Twocommandsfromdifferentstreamscannotrunconcurrentlyifanyoneofthefollowingoperations
isissuedin-betweenthembythehostthread:
▶ apage-lockedhostmemoryallocation
▶ adevicememoryallocation
▶ adevicememoryset
▶ amemorycopybetweentwoaddressestothesamedevicememory
▶ anyCUDAcommandtotheNULLstream
▶ aswitchbetweentheL1/sharedmemoryconfigurations
3.1. AdvancedCUDAAPIsandFeatures 93

CUDAProgrammingGuide,Release13.1
Operationsthatrequireadependencycheckincludeanyothercommandswithinthesamestreamas
thelaunchbeingcheckedandanycalltocudaStreamQuery()onthatstream. Therefore,applications
shouldfollowtheseguidelinestoimprovetheirpotentialforconcurrentkernelexecution:
▶ Allindependentoperationsshouldbeissuedbeforedependentoperations,
▶ Synchronizationofanykindshouldbedelayedaslongaspossible.
3.1.4. Programmatic Dependent Kernel Launch
Aswehavediscussedearlier,thesemanticsofCUDAStreamsaresuchthatkernelsexecuteinorder.
This is so that if we have two successive kernels, where the second kernel depends on results from
thefirstone,theprogrammercanbesafeintheknowledgethatbythetimethesecondkernelstarts
executingthedependentdatawillbeavailable. However,itmaybethecasethatthefirstkernelcan
havethedataonwhichasubsequentkerneldependsalreadywrittentoglobalmemoryanditstillhas
moreworktodo. Likewisethedependentsecondkernelmayhavesomeindependentworkbeforeit
needsthedatafromthefirstkernel. Insuchasituationitispossibletopartiallyoverlaptheexecution
ofthetwokernels(assumingthathardwareresourcesareavailable). Theoverlappingcanalsooverlap
thelaunchoverheadsofthesecondkerneltoo. Otherthantheavailabilityofhardwareresources,the
degreeofoverlapwhichcanbeachievedisdependentonthespecificstructureofthekernels, such
as
▶ wheninitsexecutiondoesthefirstkernelfinishtheworkonwhichthesecondkerneldepends?
▶ wheninitsexecutiondoesthesecondkernelstartworkingonthedatafromthefirstkernel?
sincethisisverymuchdependentonthespecifickernelsinquestionitisdifficulttoautomatecom-
pletelyandhenceCUDAprovidesamechanismtoallowtheapplicationdevelopertospecifythesyn-
chronization point between the two kernels. This is done via a technique known as Programmatic
DependentKernelLaunch. Thesituationisdepictedinthefigurebelow.
PDLhasthreemaincomponents.
i) The first kernel (the so called primarykernel) needs to call a special function to indicate that it
is done with the everything that the subsequent dependentkernels (also called secondaryker-
nel)willneed. ThisisdonebycallingthefunctioncudaTriggerProgrammaticLaunchComple-
tion().
ii) Inturn,thedependentsecondarykernelneedstoindicatethatithasreachedtheportionofthe
its work which is independent of the primary kernel and that it is now waiting on the primary
kerneltofinishtheworkonwhichitdepends. ThisisdonewiththefunctioncudaGridDepen-
dencySynchronize().
iii) THesecondkernelneedstobelaunchedwithaspecialattributecudaLaunchAttributeProgram-
maticStreamSerializationwithitsprogrammaticStreamSerializationAllowedfieldsetto‘1’.
Thefollowingcodesnippetshowsanexampleofhowthiscanbedone.
94 Chapter3. AdvancedCUDA

CUDAProgrammingGuide,Release13.1
|     |     | Listing3: |     | ExampleofProgrammaticDependentKernelLaunch |     |     |     |     |
| --- | --- | --------- | --- | ------------------------------------------ | --- | --- | --- | --- |
withtwoKernels
| __global__ | void | primary_kernel() |     |     |     | {   |     |     |
| ---------- | ---- | ---------------- | --- | --- | --- | --- | --- | --- |
∕∕ Initial work that should finish before starting secondary kernel
| ∕∕  | Trigger | the | secondary |     | kernel |     |     |     |
| --- | ------- | --- | --------- | --- | ------ | --- | --- | --- |
cudaTriggerProgrammaticLaunchCompletion();
| ∕∕  | Work | that | can coincide |     | with | the | secondary | kernel |
| --- | ---- | ---- | ------------ | --- | ---- | --- | --------- | ------ |
}
| __global__ | void | secondary_kernel() |     |     |     |     |     |     |
| ---------- | ---- | ------------------ | --- | --- | --- | --- | --- | --- |
{
| ∕∕  | Initialization, |     |     | Independent |     | work, | etc. |     |
| --- | --------------- | --- | --- | ----------- | --- | ----- | ---- | --- |
∕∕ Will block until all primary kernels the secondary kernel is dependent
,→on have
| ∕∕  | completed |     | and flushed |     | results |     | to global | memory |
| --- | --------- | --- | ----------- | --- | ------- | --- | --------- | ------ |
cudaGridDependencySynchronize();
| ∕∕  | Dependent |     | work |     |     |     |     |     |
| --- | --------- | --- | ---- | --- | --- | --- | --- | --- |
}
| ∕∕ Launch           | the    | secondary |               | kernel | with | the | special | attribute |
| ------------------- | ------ | --------- | ------------- | ------ | ---- | --- | ------- | --------- |
| ∕∕ Set              | Up the | attribute |               |        |      |     |         |           |
| cudaLaunchAttribute |        |           | attribute[1]; |        |      |     |         |           |
attribute[0].id = cudaLaunchAttributeProgrammaticStreamSerialization;
attribute[0].val.programmaticStreamSerializationAllowed = 1;
| ∕∕ Set                     | the attribute |               | in         | a kernel |            | launch | configuration   |     |
| -------------------------- | ------------- | ------------- | ---------- | -------- | ---------- | ------ | --------------- | --- |
| cudaLaunchConfig_t         |               |               | config     |          | = {0};     |        |                 |     |
| ∕∕ Base                    | launch        | configuration |            |          |            |        |                 |     |
| config.gridDim             |               | =             | grid_dim;  |          |            |        |                 |     |
| config.blockDim            |               | =             | block_dim; |          |            |        |                 |     |
| config.dynamicSmemBytes=   |               |               |            | 0;       |            |        |                 |     |
| config.stream              |               | = stream;     |            |          |            |        |                 |     |
| ∕∕ Add                     | special       | attribute     |            | for      | PDL        |        |                 |     |
| config.attrs               |               | = attribute;  |            |          |            |        |                 |     |
| config.numAttrs            |               | =             | 1;         |          |            |        |                 |     |
| ∕∕ Launch                  | primary       |               | kernel     |          |            |        |                 |     |
| primary_kernel<<<grid_dim, |               |               |            |          | block_dim, |        | 0, stream>>>(); |     |
∕∕ Launch secondary (dependent) kernel using the configuration with
| ∕∕ the                      | attribute |     |     |     |                    |     |     |     |
| --------------------------- | --------- | --- | --- | --- | ------------------ | --- | --- | --- |
| cudaLaunchKernelEx(&config, |           |     |     |     | secondary_kernel); |     |     |     |
3.1. AdvancedCUDAAPIsandFeatures 95

CUDAProgrammingGuide,Release13.1
| 3.1.5. | Batched |     | Memory |     | Transfers |     |     |     |
| ------ | ------- | --- | ------ | --- | --------- | --- | --- | --- |
A common pattern in CUDA development is to use a technique of batching. By batching we loosely
meanthatwehaveseveral(typicallysmall)tasksgroupedtogetherintoasingle(typicallybigger)op-
eration. Thecomponentsofthebatchdonotnecessarilyallhavetobeidenticalalthoughtheyoften
are. AnexampleofthisideaisthebatchmatrixmultiplicationoperationprovidedbycuBLAS.
Generally as with CUDA Graphs, and PDL, the purpose of batching is to reduce overheads associ-
atedwithdispatchingtheindividualbatchtasksseparately. Intermsofmemorytransferslaunchinga
memorytransfercanincursomeCPUanddriveroverheads. Further,theregularcudaMemcpyAsync()
functioninitscurrentformdoesnotnecessarilyprovideenoughinformationforthedrivertooptimize
thetransfer,forexample,intermsofhintsaboutthesourceanddestination. OnTegraplatformsone
has the choice of using SMs or Copy Engines (CEs)o perform transfers. The choice of which is cur-
rentlyspecifiedbyaheuristicinthedriver. ThiscanbeimportantbecauseusingtheSMsmayresultin
afastertransfer,howeverittiesdownsomeoftheavailablecomputepower. Ontheotherhand,using
theCEsmayresultinaslowertransferbutoverallhigherapplicationperformance,sinceitleavesthe
SMsfreetoperformotherwork.
These considerations motivated the design of the cudaMemcpyBatchAsync() function (and its rel-
ative cudaMemcpyBatch3DAsync()). These functions allow batched memory transfers to be opti-
mized. Apartfromthelistsofsourceanddestinationpointers,theAPIusesmemorycopyattributes
to specify expectations of orderings, with hints for source and destination locations, as well as for
whetheronepreferstooverlapthetransferwithcompute(somethingthatiscurrentlyonlysupported
onTegraplatformswithCEs).
Letusfirstconsiderthesimplestcaseofasimplebatchtransferofdatafrompinnedhostmemoryto
pinneddevicememory
|     |     | Listing4: | ExampleofHomogeneousBatchedMemoryTransfer |     |     |     |     |     |
| --- | --- | --------- | ----------------------------------------- | --- | --- | --- | --- | --- |
fromPinnedHostMemorytoPinnedDeviceMemory
| std::vector<void         |     |            | *> srcs(batch_size);  |             |        |          |     |     |
| ------------------------ | --- | ---------- | --------------------- | ----------- | ------ | -------- | --- | --- |
| std::vector<void         |     |            | *> dsts(batch_size);  |             |        |          |     |     |
| std::vector<void         |     |            | *> sizes(batch_size); |             |        |          |     |     |
| ∕∕ Allocate              |     | the source | and                   | destination |        | buffers  |     |     |
| ∕∕ initialize            |     | with       | the stream            |             | number |          |     |     |
| for (size_t              |     | i = 0;     | i <                   | batch_size; |        | i++) {   |     |     |
| cudaMallocHost(&srcs[i], |     |            |                       | sizes[i]);  |        |          |     |     |
| cudaMalloc(&dsts[i],     |     |            |                       | sizes[i]);  |        |          |     |     |
| cudaMemsetAsync(srcs[i], |     |            |                       | sizes[i],   |        | stream); |     |     |
}
| ∕∕ Setup             | attributes |         | for       | this batch                      |      | of copies        |     |     |
| -------------------- | ---------- | ------- | --------- | ------------------------------- | ---- | ---------------- | --- | --- |
| cudaMemcpyAttributes |            |         | attrs     | = {};                           |      |                  |     |     |
| attrs.srcAccessOrder |            |         | =         | cudaMemcpySrcAccessOrderStream; |      |                  |     |     |
| ∕∕ All               | copies     | in      | the batch | have                            | same | copy attributes. |     |     |
| size_t               | attrsIdxs  |         | = 0;      | ∕∕ Index                        | of   | the attributes   |     |     |
| ∕∕ Launch            | the        | batched | memory    | transfer                        |      |                  |     |     |
cudaMemcpyBatchAsync(&dsts[0], &srcs[0], &sizes[0], batch_size,
&attrs, &attrsIdxs, 1 ∕*numAttrs*∕, nullptr ∕*failIdx*∕, stream);
ThefirstfewparameterstothecudaMemcpyBatchAsync()functionseemimmediatelysensible. The
arecomprisedofarrayscontainingthesourceanddestinationpointers,aswellasthetransfersizes.
| 96  |     |     |     |     |     |     | Chapter3. | AdvancedCUDA |
| --- | --- | --- | --- | --- | --- | --- | --------- | ------------ |

CUDAProgrammingGuide,Release13.1
Eacharrayhastohave``batch_size``elements. Thenewinformationcomesfromtheattributes. The
functionneeds apointerto anarrayofattributes, and a correspondingarrayofattributeindices. In
principleitisalsopossibletopassanarrayofsize_tandinthisarraytheindicesofanfailedtransfers
canberecorded,howeveritissafetopassanullptrhere,inthiscasetheindicesoffailureswillsimply
notberecorded.
Turningtotheattributes,inthisinstancethetransfersarehomogeneous. Soweuseonlyoneattribute,
whichwillapplytoallthetransfers. ThisiscontrolledbytheattrIndexparameter. Inprinciplethiscan
beanarray. Elementiofthearraycontainstheindexofthefirsttransfertowhichthei-thelementof
the attributearray applies. In thiscase, attrIndex istreatedas a single element array, with the value
‘0’meaningthatattribute[0]willapplytoalltransferswithindex0andup, inotherwordsallthe
transfers.
Finally, wenotethatwehavesetthethesrcAccessOrderattributetocudaMemcpySrcAccessOr-
derStream. Thismeansthatthesourcedatawillbeaccessedinregularstreamorder. Inotherwords,
thememcpywillblockuntilpreviouskernelsdealingwiththedatafromanyofthesesourceanddes-
tinationpointersarecompleted.
Inthenextexamplewewillconsideramorecomplexcaseofaheterogeneousbatchtransfer.
|     |     | Listing5: |       | ExampleofHeterogeneousBatchedMemoryTrans- |           |      |        |           |        |
| --- | --- | --------- | ----- | ----------------------------------------- | --------- | ---- | ------ | --------- | ------ |
|     |     | fer       | using | some                                      | Ephemeral | Host | Memory | to Pinned | Device |
Memory
| std::vector<void         |     |         | *> srcs(batch_size);  |            |            |            |     |     |     |
| ------------------------ | --- | ------- | --------------------- | ---------- | ---------- | ---------- | --- | --- | --- |
| std::vector<void         |     |         | *> dsts(batch_size);  |            |            |            |     |     |     |
| std::vector<void         |     |         | *> sizes(batch_size); |            |            |            |     |     |     |
| ∕∕ Allocate              |     | the src | and                   | dst        | buffers    |            |     |     |     |
| for (size_t              |     | i = 0;  | i <                   | batch_size |            | - 10; i++) | {   |     |     |
| cudaMallocHost(&srcs[i], |     |         |                       |            | sizes[i]); |            |     |     |     |
| cudaMalloc(&dsts[i],     |     |         |                       | sizes[i]); |            |            |     |     |     |
}
int buffer[10];
| for (size_t          |     | i = batch_size |     |            | - 10; i     | < batch_size; |     | i++) { |     |
| -------------------- | --- | -------------- | --- | ---------- | ----------- | ------------- | --- | ------ | --- |
| srcs[i]              |     | = &buffer[10   |     | -          | (batch_size | - i];         |     |        |     |
| cudaMalloc(&dsts[i], |     |                |     | sizes[i]); |             |               |     |        |     |
}
| ∕∕ Setup                | attributes |     | for      | this                              | batch | of copies |     |     |     |
| ----------------------- | ---------- | --- | -------- | --------------------------------- | ----- | --------- | --- | --- | --- |
| cudaMemcpyAttributes    |            |     | attrs[2] |                                   | = {}; |           |     |     |     |
| attrs[0].srcAccessOrder |            |     |          | = cudaMemcpySrcAccessOrderStream; |       |           |     |     |     |
attrs[1].srcAccessOrder = cudaMemcpySrcAccessOrderDuringApiCall;
| size_t       | attrsIdxs[2]; |              |        |     |          |     |     |     |     |
| ------------ | ------------- | ------------ | ------ | --- | -------- | --- | --- | --- | --- |
| attrsIdxs[0] |               | = 0;         |        |     |          |     |     |     |     |
| attrsIdxs[1] |               | = batch_size |        | -   | 10;      |     |     |     |     |
| ∕∕ Launch    |               | the batched  | memory |     | transfer |     |     |     |     |
cudaMemcpyBatchAsync(&dsts[0], &srcs[0], &sizes[0], batch_size,
&attrs, &attrsIdxs, 2 ∕*numAttrs*∕, nullptr ∕*failIdx*∕, stream);
Herewehavetwokindsoftransfers: batch_size-10transferfrompinnedhostmemorytopinned
devicememory,and10transfersfromahostarraytopinneddevicememory. Further,thebufferarray
3.1. AdvancedCUDAAPIsandFeatures 97

CUDAProgrammingGuide,Release13.1
isnotonlyonthehostbutisonlyinexistenceinthecurrentscope–itsaddressiswhatisknownas
anephemeralpointer. ThispointermaynotbevalidaftertheAPIcallcompletes(itisasynchronous).
Toperformthecopieswithsuchephemeralpointers,thesrcAccessOrderintheattributemustbeset
tocudaMemcpySrcAccessOrderDuringApiCall.
Wenowhavetwoattributes, the firstone applies to all transfers with indices starting at0, and less
thanbatch_size-10. Thesecondoneappliestoalltransferswithindicesstartingatbatch_size-10
andlessthanbatch_size.
Ifinsteadofallocatingthebufferarrayfromthestack,wehadallocateditfromtheheapusingmalloc
thedatawouldnotbeephemeralanymore. Itwouldbevaliduntilthepointerwasexplicitlyfreed. In
such a case the best option for how to stage the copies would depend on whether the system had
hardwaremanagedmemoryorcoherentGPUaccesstohostmemoryviaaddresstranslationinwhich
caseitwouldbebesttousestreamordering,orwhetheritdidnotinwhichcasestagingthetransfers
immediatelywouldmakemostsense. Inthissituation,oneshouldusethevaluecudaMemcpyAcces-
sOrderAnyforthesrcAccessOrderoftheattribute.
ThecudaMemcpyBatchAsyncfunctionalsoallowstheprogrammertoprovidehintsaboutthesource
anddestinationlocations. Thisisdone bysettingthe srcLocation and dstLocation fieldsofthe
cudaMemcpyAttributesstructure. ThesrcLocation``and ``dstLocationfieldsarebothoftype
cudaMemLocationwhichisastructurethatcontainsthetypeofthelocationandtheIDoftheloca-
tion. ThisisthesamecudaMemLocationstructurethatcanbeusedtogiveprefetchinghintstothe
runtimewhenusingcudaMemPrefetchAsync(). Weillustratehowtosetupthehintsforatransfer
fromthedevice,toaspecificNUMAnodeofthehostinthecodeexamplebelow:
|     | Listing6: | ExampleofSettingSourceandDestinationLocation |     |     |     |     |     |     |
| --- | --------- | -------------------------------------------- | --- | --- | --- | --- | --- | --- |
Hints
| ∕∕ Allocate      | the source | and                   | destination |     | buffers |     |     |     |
| ---------------- | ---------- | --------------------- | ----------- | --- | ------- | --- | --- | --- |
| std::vector<void |            | *> srcs(batch_size);  |             |     |         |     |     |     |
| std::vector<void |            | *> dsts(batch_size);  |             |     |         |     |     |     |
| std::vector<void |            | *> sizes(batch_size); |             |     |         |     |     |     |
∕∕ cudaMemLocation structures we will use tp provide location hints
| ∕∕ Device | device_id |     |     |     |     |     |     |     |
| --------- | --------- | --- | --- | --- | --- | --- | --- | --- |
cudaMemLocation srcLoc = {cudaMemLocationTypeDevice, dev_id};
| ∕∕ Host | with numa | Node numa_id |     |     |     |     |     |     |
| ------- | --------- | ------------ | --- | --- | --- | --- | --- | --- |
cudaMemLocation dstLoc = {cudaMemLocationTypeHostNuma, numa_id};
| ∕∕ Allocate                   | the src | and dst         | buffers   |            |          |         |             |     |
| ----------------------------- | ------- | --------------- | --------- | ---------- | -------- | ------- | ----------- | --- |
| for (size_t                   | i = 0;  | i < batch_size; |           | i++)       | {        |         |             |     |
| cudaMallocManaged(&srcs[i],   |         |                 |           | sizes[i]); |          |         |             |     |
| cudaMallocManaged(&dsts[i],   |         |                 |           | sizes[i]); |          |         |             |     |
| cudaMemPrefetchAsync(srcs[i], |         |                 |           | sizes[i],  |          | srcLoc, | 0, stream); |     |
| cudaMemPrefetchAsync(dsts[i], |         |                 |           | sizes[i],  |          | dstLoc, | 0, stream); |     |
| cudaMemsetAsync(srcs[i],      |         |                 | sizes[i], |            | stream); |         |             |     |
}
| ∕∕ Setup             | attributes | for this | batch | of  | copies |     |     |     |
| -------------------- | ---------- | -------- | ----- | --- | ------ | --- | --- | --- |
| cudaMemcpyAttributes |            | attrs    | = {}; |     |        |     |     |     |
∕∕ These are managed memory pointers so Stream Order is appropriate
| attrs.srcAccessOrder |     | = cudaMemcpySrcAccessOrderStream; |     |     |     |     |     |     |
| -------------------- | --- | --------------------------------- | --- | --- | --- | --- | --- | --- |
(continuesonnextpage)
| 98  |     |     |     |     |     |     | Chapter3. | AdvancedCUDA |
| --- | --- | --- | --- | --- | --- | --- | --------- | ------------ |

CUDAProgrammingGuide,Release13.1
(continuedfrompreviouspage)
∕∕ Now we can specify the location hints here.
attrs.srcLocHint = srcLoc;
attrs.dstlocHint = dstLoc;
∕∕ All copies in the batch have same copy attributes.
size_t attrsIdxs = 0;
∕∕ Launch the batched memory transfer
cudaMemcpyBatchAsync(&dsts[0], &srcs[0], &sizes[0], batch_size,
&attrs, &attrsIdxs, 1 ∕*numAttrs*∕, nullptr ∕*failIdx*∕, stream);
THelastthingtocoveristheflagforhintingwhetherwewanttouseSM’sorCEsforthetransfers.
ThefieldforthisisthecudaMemcpyAttributesflags::flagsandthepossiblevaluesare:
▶ cudaMemcpyFlagDefault–defaultbehavior
▶ cudaMemcpyFlagPreferOverlapWithCompute – this hints that the system should prefer to
use CEs for the transfers overlapping the transfer with computations. This flag is ignored on
non-Tegraplatforms
Insummary,themainpointsregarding“cudaMemcpyBatchAsync”areasfollows:
▶ ThecudaMemcpyBatchAsyncfunction(andits3Dvariant)allowstheprogrammertospecifya
batchofmemorytransfers,allowingtheamortizationoftransfersetupoverheads.
▶ Otherthanthesourceanddestinationpointersandthetransfersizes,thefunctioncantakeone
or more memory copy attributes providing information about the kind of memory being trans-
ferredandthecorrespondingstreamorderingbehaviorofthesourcepointers, hintsaboutthe
sourceanddestinationlocations,andhintsastowhethertoprefertooverlapthetransferwith
compute(ifpossible)orwhethertouseSMsforthetransfer.
▶ Giventheaboveinformationtheruntimecanattempttooptimizethetransfertothemaximum
degreepossible..
3.1.6. Environment Variables
CUDA provides various environment variables (see Section 5.2), which can affect execution and per-
formance. If they are not explicitly set, CUDA uses reasonable default values for them, but special
handlingmayberequiredonaper-casebasis,e.g.,fordebuggingpurposesortogetimprovedperfor-
mance.
Forexample,increasingthevalueoftheCUDA_DEVICE_MAX_CONNECTIONSenvironmentvariablemay
benecessarytoreducethepossibilitythatindependentworkfromdifferentCUDAstreamsgetsseri-
alizedduetofalsedependencies. Suchfalsedependenciesmaybeintroducedwhenthesameunder-
lyingresource(s)areused. Itisrecommendedtostartbyusingthedefaultvalueandonlyexplorethe
impact of this environment variable in case of performance issues (e.g., unexpected serialization of
independentworkacrossCUDAstreamsthatcannotbeattributedtootherfactorslikelackofavail-
ableSMresources). Worthnotingthatthisenvironmentvariablehasadifferent(lower)defaultvalue
incaseofMPS.
Similarly, setting the CUDA_MODULE_LOADING environment variable to EAGER may be preferable for
latency-sensitiveapplications,inordertomovealloverheadduetomoduleloadingtotheapplication
initializationphaseandoutsideitscriticalphase. Thecurrentdefaultmodeislazymoduleloading. In
thisdefaultmode,asimilareffecttoeagermoduleloadingcouldbeachievedbyadding“warm-up”calls
3.1. AdvancedCUDAAPIsandFeatures 99

CUDAProgrammingGuide,Release13.1
ofthevariouskernelsduringtheapplication’sinitializationphase,toforcemoduleloadingtohappen
sooner.
PleaserefertoCUDAEnvironmentVariablesformoredetailsaboutthevariousCUDAenvironmentvari-
ables. Itisrecommendedthatyousettheenvironmentvariablestonewvaluesbeforeyoulaunchthe
application;attemptingtosetthemwithinyourapplicationmayhavenoeffect.
3.2. Advanced Kernel Programming
ThischapterwillfirsttakeadeeperdiveintothehardwaremodelofNVIDIAGPUs,andthenintroduce
some of the more advanced features available in CUDA kernel code aimed at improving kernel per-
formance. Thischapterwillintroducesomeconceptsrelatedtothreadscopes,asynchronousexecu-
tion,andtheassociatedsynchronizationprimitives. Theseconceptualdiscussionsprovideanecessary
foundationforsomeoftheadvancedperformancefeaturesavailablewithinkernelcode.
Detaileddescriptionsforsomeofthesefeaturesarecontainedinchaptersdedicatedtothefeatures
inthenextpartofthisprogrammingguide.
▶ Advancedsynchronizationprimitivesintroducedinthischapter,arecoveredcompletelyinSection
4.9andSection4.10.
▶ Asynchronousdatacopies,includingthetensormemoryaccelerator(TMA),areintroducedinthis
chapterandcoveredcompletelyinSection4.11.
3.2.1. Using PTX
ParallelThreadExecution(PTX),thevirtualmachineinstructionsetarchitecture(ISA)thatCUDAuses
to abstract hardware ISAs, was introduced in Section 1.3.3. Writing code in PTX directly is a highly
advancedoptimizationtechniquethatisnotnecessaryformostdevelopersandshouldbeconsidered
a tool of last resort. Nevertheless, there are situations where the fine-grained control enabled by
writingPTXdirectlyenablesperformanceimprovementsinspecificapplications. Thesesituationsare
typicallyinveryperformance-sensitiveportionsofanapplicationwhereeveryfractionofapercentof
performanceimprovementhassignificantbenefits. AlloftheavailablePTXinstructionsareinthePTX
ISAdocument.
cuda::ptxnamespace
One way to use PTX directly in your code is to use the cuda::ptx namespace from libcu++. This
namespaceprovidesC++functionsthatmapdirectlytoPTXinstructions,simplifyingtheirusewithin
aC++application. Formoreinformation,pleaserefertothecuda::ptxnamespacedocumentation.
InlinePTX
AnotherwaytoincludePTXinyourcodeistouseinlinePTX.Thismethodisdescribedindetailinthe
correspondingdocumentation. ThisisverysimilartowritingassemblycodeonaCPU.
3.2.2. Hardware Implementation
A streaming multiprocessor or SM (see GPU Hardware Model) is designed to execute hundreds of
threadsconcurrently. Tomanagesuchalargenumberofthreads,itemploysauniqueparallelcomput-
ingmodelcalledSingle-Instruction,Multiple-Thread,orSIMT,thatisdescribedinSIMTExecutionModel.
Theinstructionsarepipelined,leveraginginstruction-levelparallelismwithinasinglethread,aswellas
100 Chapter3. AdvancedCUDA

CUDAProgrammingGuide,Release13.1
extensivethread-levelparallelismthroughsimultaneoushardwaremultithreadingasdetailedinHard-
ware Multithreading. Unlike CPU cores, SMs issue instructions in order and do not perform branch
predictionorspeculativeexecution.
SectionsSIMTExecutionModelandHardwareMultithreadingdescribethearchitecturalfeaturesofthe
SMthatarecommontoalldevices. SectionComputeCapabilitiesprovidesthespecificsfordevicesof
differentcomputecapabilities.
TheNVIDIAGPUarchitectureusesalittle-endianrepresentation.
3.2.2.1 SIMTExecutionModel
Each SM creates, manages, schedules, and executes threads in groups of 32 parallel threads called
warps. Individual threads composing a warp start together at the same program address, but they
have their own instruction address counter and register state and are therefore free to branch and
executeindependently. Thetermwarporiginatesfromweaving,thefirstparallelthreadtechnology. A
half-warpiseitherthefirstorsecondhalfofawarp. Aquarter-warpiseitherthefirst,second, third,
orfourthquarterofawarp.
Awarpexecutesonecommoninstructionatatime,sofullefficiencyisrealizedwhenall32threadsof
a warp agree on their execution path. If threads of a warp diverge via a data-dependent conditional
branch,thewarpexecuteseachbranchpathtaken,disablingthreadsthatarenotonthatpath. Branch
divergence occurs only within a warp; different warps execute independently regardless of whether
theyareexecutingcommonordisjointcodepaths.
TheSIMTarchitectureisakintoSIMD(SingleInstruction,MultipleData)vectororganizationsinthata
singleinstructioncontrolsmultipleprocessingelements. AkeydifferenceisthatSIMDvectororgani-
zationsexposetheSIMDwidthtothesoftware,whereasSIMTinstructionsspecifytheexecutionand
branchingbehaviorofasinglethread. IncontrastwithSIMDvectormachines,SIMTenablesprogram-
merstowritethread-levelparallelcodeforindependent,scalarthreads,aswellasdata-parallelcode
forcoordinatedthreads. Forthepurposesofcorrectness,theprogrammercanessentiallyignorethe
SIMT behavior; however, substantial performance improvements can be realized by taking care that
thecodeseldomrequiresthreadsinawarptodiverge. Inpractice,thisisanalogoustotheroleofcache
lines: thecachelinesizecanbesafelyignoredwhendesigningforcorrectnessbutmustbeconsidered
inthecodestructurewhendesigningforpeakperformance. Vectorarchitectures,ontheotherhand,
requirethesoftwaretocoalesceloadsintovectorsandmanagedivergencemanually.
3.2.2.1.1 IndependentThreadScheduling
OnGPUswithcomputecapabilitylowerthan7.0,warpsusedasingleprogramcountersharedamongst
all32threadsinthewarptogetherwithanactivemaskspecifyingtheactivethreadsofthewarp. Asa
result,threadsfromthesamewarpindivergentregionsordifferentstatesofexecutioncannotsignal
eachotherorexchangedata,andalgorithmsrequiringfine-grainedsharingofdataguardedbylocks
ormutexescanleadtodeadlock,dependingonwhichwarpthecontendingthreadscomefrom.
InGPUsofcomputecapability7.0andlater,independentthreadschedulingallowsfullconcurrencybe-
tweenthreads,regardlessofwarp. Withindependentthreadscheduling,theGPUmaintainsexecution
stateperthread,includingaprogramcounterandcallstack,andcanyieldexecutionataper-thread
granularity, either to make better use of execution resources or to allow one thread to wait for data
to be produced by another. A schedule optimizer determines how to group active threads from the
same warp together into SIMT units. This retains the high throughput of SIMT execution as in prior
NVIDIA GPUs, but with much more flexibility: threads can now diverge and reconverge at sub-warp
granularity.
Independentthreadschedulingcanbreakcodethatreliesonimplicitwarp-synchronousbehaviorfrom
previousGPUarchitectures. Warp-synchronouscodeassumesthatthreadsinthesamewarpexecute
in lockstep at every instruction, but the ability for threads to diverge and reconverge at sub-warp
3.2. AdvancedKernelProgramming 101

CUDAProgrammingGuide,Release13.1
granularitymakessuchassumptionsinvalid. Thiscanleadtoadifferentsetofthreadsparticipating
in the executed code than intended. Any warp-synchronous code developed for GPUs prior to CC
7.0 (such as synchronization-free intra-warp reductions) should be revisited to ensure compatibility.
Developersshouldexplicitlysynchronizesuchcodeusing__syncwarp() toensurecorrectbehavior
acrossallGPUgenerations.
Note
Thethreadsofawarpthatareparticipatinginthecurrentinstructionarecalledtheactivethreads,
whereasthreadsnotonthecurrentinstructionareinactive(disabled). Threadscanbeinactivefor
avarietyofreasonsincludinghavingexitedearlierthanotherthreadsoftheirwarp,havingtaken
a different branch path than the branch path currently executed by the warp, or being the last
threadsofablockwhosenumberofthreadsisnotamultipleofthewarpsize.
If a non-atomic instruction executed by a warp writes to the same location in global or shared
memoryfrommorethanoneofthethreadsofthewarp,thenumberofserializedwritesthatoccur
to that location may vary depending on the compute capability of the device. However, for all
computecapabilities,whichthreadperformsthefinalwriteisundefined.
Ifanatomicinstructionexecutedbyawarpreads,modifies,andwritestothesamelocationinglobal
memory for more than one of the threads of the warp, each read/modify/write to that location
occursandtheyareallserialized,buttheorderinwhichtheyoccurisundefined.
3.2.2.2 HardwareMultithreading
When an SM is given one or more thread blocks to execute, it partitions them into warps and each
warp gets scheduled for execution by a warpscheduler. The way a block is partitioned into warps is
alwaysthesame;eachwarpcontainsthreadsofconsecutive,increasingthreadIDswiththefirstwarp
containingthread0. ThreadHierarchydescribeshowthreadIDsrelatetothreadindicesintheblock.
Thetotalnumberofwarpsinablockisdefinedasfollows:
( )
ceil T ,1
Wsize
▶ T isthenumberofthreadsperblock,
▶ Wsizeisthewarpsize,whichisequalto32,
▶ ceil(x,y)isequaltoxroundeduptothenearestmultipleofy.
Figure19: Athreadblockispartitionedintowarpsof32threads.
102 Chapter3. AdvancedCUDA

CUDAProgrammingGuide,Release13.1
Theexecutioncontext(programcounters,registers,etc.) foreachwarpprocessedbyanSMismain-
tainedon-chipthroughoutthewarp’slifetime. Therefore,switchingbetweenwarpsincursnocost. At
each instruction issue cycle, a warp scheduler selects a warp with threads ready to execute its next
instruction(theactivethreadsofthewarp)andissuestheinstructiontothosethreads.
Each SM has a set of 32-bit registers that are partitioned among the warps, and a shared memory
thatispartitionedamongthethreadblocks. Thenumberofblocksandwarpsthatcanresideandbe
processedconcurrentlyontheSMforagivenkerneldependsontheamountofregistersandshared
memory used by the kernel, as well as the amount of registers and shared memory available on the
SM.TherearealsoamaximumnumberofresidentblocksandwarpsperSM.Theselimits,aswellthe
amountofregistersandsharedmemoryavailableontheSM,dependonthecomputecapabilityofthe
device and are specified in ComputeCapabilities. If there are not enough resources available per SM
toprocessatleastoneblock,thekernelwillfailtolaunch. Thetotalnumberofregistersandshared
memoryallocatedforablockcanbedeterminedinseveralwaysdocumentedintheOccupancysection.
3.2.2.3 AsynchronousExecutionFeatures
RecentNVIDIAGPUgenerationshaveincludedasynchronousexecutioncapabilitiestoallowmoreover-
lap of data movement, computation, and synchronization within the GPU. These capabilities enable
certainoperationsinvokedfromGPUcodetoexecuteasynchronouslytootherGPUcodeinthesame
thread block. This asynchronous execution should not be confused with asynchronous CUDA APIs
discussed in Section 2.3, which enable GPU kernel launches or memory operations to operate asyn-
chronouslytoeachotherortotheCPU.
Computecapability8.0(TheNVIDIAAmpereGPUArchitecture)introducedhardware-acceleratedasyn-
chronous data copies from global to shared memory and asynchronous barriers (see NVIDIA A100
TensorCoreGPUArchitecture).
Computecapability9.0(TheNVIDIAHopperGPUarchitecture)extendedtheasynchronousexecution
featureswiththeTensorMemoryAccelerator(TMA)unit, whichcantransferlargeblocksofdataand
multidimensionaltensorsfromglobalmemorytosharedmemoryandviceversa,asynchronoustrans-
actionbarriers,andasynchronousmatrixmultiply-accumulateoperations(seeHopperArchitecturein
Depthblogpostfordetails.).
CUDAprovidesAPIswhichcanbecalledbythreadsfromdevicecodetousethesefeatures. Theasyn-
chronousprogrammingmodeldefinesthebehaviorofasynchronousoperationswithrespecttoCUDA
threads.
AnasynchronousoperationisanoperationinitiatedbyaCUDAthread,butexecutedasynchronously
as if by another thread, which we will refer to as an asyncthread. In a well-formed program, one or
more CUDA threads synchronize with the asynchronous operation. The CUDA thread that initiated
theasynchronousoperationisnotrequiredtobeamongthesynchronizingthreads. Theasyncthread
isalwaysassociatedwiththeCUDAthreadthatinitiatedtheoperation.
An asynchronous operation uses a synchronization object to signal its completion, which could be a
barrierorapipeline. ThesesynchronizationobjectsareexplainedindetailinAdvancedSynchronization
Primitives, and their role in performing asynchronous memory operations is demonstrated in Asyn-
chronousDataCopies.
3.2.2.3.1 AsyncThreadandAsyncProxy
Asynchronousoperationsmayaccessmemorydifferentlythanregularoperations. Todistinguishbe-
tweenthesedifferentmemoryaccessmethods,CUDAintroducestheconceptsofanasyncthread,a
genericproxy,andanasyncproxy. Normaloperations(loadsandstores)gothroughthegenericproxy.
Someasynchronousinstructions,suchasLDGSTSandSTAS/REDAS,aremodeledusinganasyncthread
operating in the generic proxy. Other asynchronous instructions, such as bulk-asynchronous copies
3.2. AdvancedKernelProgramming 103

CUDAProgrammingGuide,Release13.1
withTMAandsometensorcoreoperations(tcgen05.*, wgmma.mma_async.*), aremodeledusingan
asyncthreadoperatingintheasyncproxy.
Asyncthreadoperatingingenericproxy. Whenanasynchronousoperationisinitiated,itisassociated
withanasyncthread,whichisdifferentfromtheCUDAthreadthatinitiatedtheoperation. Preceding
genericproxy(normal)loadsandstorestothesameaddressareguaranteedtobeorderedbeforethe
asynchronousoperation. However, subsequent normalloadsandstorestothesameaddressarenot
guaranteed to maintain their ordering, potentially incurring a race condition until the async thread
completes.
Asyncthreadoperatinginasyncproxy. Whenanasynchronousoperationisinitiated,itisassociated
withanasyncthread,whichisdifferentfromtheCUDAthreadthatinitiatedtheoperation. Priorand
subsequentnormalloadsandstorestothesameaddressarenotguaranteedtomaintaintheirordering.
Aproxyfenceisrequiredtosynchronizethemacrossthedifferentproxiestoensurepropermemory
ordering. Section Using the Tensor Memory Accelerator (TMA) demonstrates use of proxy fences to
ensurecorrectnesswhenperformingasynchronouscopieswithTMA.
Formoredetailsontheseconcepts,seethePTXISAdocumentation.
3.2.3. Thread Scopes
CUDA threads form a ThreadHierarchy, and using this hierarchy is essential for writing both correct
andperformantCUDAkernels. Withinthishierarchy,thevisibilityandsynchronizationscopeofmem-
oryoperationscanvary. Toaccountforthisnon-uniformity,theCUDAprogrammingmodelintroduces
theconceptofthreadscopes. Athreadscopedefineswhichthreadscanobserveathread’sloadsand
storesandspecifieswhichthreadscansynchronizewitheachotherusingsynchronizationprimitives
suchasatomicoperationsandbarriers. Eachscopehasanassociatedpointofcoherencyinthemem-
oryhierarchy.
ThreadscopesareexposedinCUDAPTXandarealsoavailableasextensionsinthelibcu++library. The
followingtabledefinesthethreadscopesavailable:
CUDA C++ CUDA PTX Description Point of Coherency
ThreadScope Thread in Memory Hierar-
Scope chy
cuda::thread_scope_thread Memory operations are visible only to the –
localthread.
cuda::thread_sc.ocptea_block Memory operations are visible to other L1
threadsinthesamethreadblock.
.cluster Memory operations are visible to other L2
threadsinthesamethreadblockcluster.
cuda::thread_sc.ogppeu_device Memory operations are visible to other L2
threadsinthesameGPUdevice.
cuda::thread_sc.ospyes_system Memory operations are visible to other L2 + connected
threads in the same system (CPU, other caches
GPUs).
Sections Advanced Synchronization Primitives and Asynchronous Data Copies demonstrate use of
threadscopes.
104 Chapter3. AdvancedCUDA

CUDAProgrammingGuide,Release13.1
| 3.2.4. | Advanced |     | Synchronization |     |     | Primitives |     |
| ------ | -------- | --- | --------------- | --- | --- | ---------- | --- |
Thissectionintroducesthreefamiliesofsynchronizationprimitives:
▶ ScopedAtomics,whichpairC++memoryorderingwithCUDAthreadscopestosafelycommuni-
cateacrossthreadsatblock,cluster,device,orsystemscope(seeThreadScopes).
▶ AsynchronousBarriers,whichsplitsynchronizationintoarrivalandwaitphases,andcanbeused
totracktheprogressofasynchronousoperations.
▶ Pipelines,whichstageworkandcoordinatemulti-bufferproducer–consumerpatterns,commonly
usedtooverlapcomputewithasynchronousdatacopies.
3.2.4.1 ScopedAtomics
Section5.4.5givesanoverviewofatomicfunctionsavailableinCUDA.Inthissection,wewillfocuson
scopedatomicsthatsupportC++standardatomicmemorysemantics,availablethroughthelibcu++
libraryorthroughcompilerbuilt-infunctions. Scopedatomicsprovidethetoolsforefficientsynchro-
nizationattheappropriateleveloftheCUDAthreadhierarchy,enablingbothcorrectnessandperfor-
manceincomplexparallelalgorithms.
| 3.2.4.1.1 | ThreadScopeandMemoryOrdering |     |     |     |     |     |     |
| --------- | ---------------------------- | --- | --- | --- | --- | --- | --- |
Scopedatomicscombinetwokeyconcepts:
▶ ThreadScope: defineswhichthreadscanobservetheeffectoftheatomicoperation(seeThread
Scopes).
▶ Memory Ordering: defines the ordering constraints relative to other memory operations (see
C++standardatomicmemorysemantics).
CUDAC++cuda::atomic
| #include   | <cuda∕atomic> |        |                        |         |             |      |       |
| ---------- | ------------- | ------ | ---------------------- | ------- | ----------- | ---- | ----- |
| __global__ |               | void   | block_scoped_counter() |         | {           |      |       |
| ∕∕         | Shared        | atomic | counter                | visible | only within | this | block |
__shared__ cuda::atomic<int, cuda::thread_scope_block> counter;
| ∕∕  | Initialize       |     | counter | (only one                    | thread | should | do this) |
| --- | ---------------- | --- | ------- | ---------------------------- | ------ | ------ | -------- |
| if  | (threadIdx.x     |     | == 0)   | {                            |        |        |          |
|     | counter.store(0, |     |         | cuda::memory_order_relaxed); |        |        |          |
}
__syncthreads();
| ∕∕  | All | threads | in block | atomically | increment |     |     |
| --- | --- | ------- | -------- | ---------- | --------- | --- | --- |
int old_value = counter.fetch_add(1, cuda::memory_order_relaxed);
| ∕∕  | Use | old_value... |     |     |     |     |     |
| --- | --- | ------------ | --- | --- | --- | --- | --- |
}
3.2. AdvancedKernelProgramming 105

CUDAProgrammingGuide,Release13.1
Built-inAtomicFunctions
| __global__ |                               | void    | block_scoped_counter() |           |        | {          |          |     |     |
| ---------- | ----------------------------- | ------- | ---------------------- | --------- | ------ | ---------- | -------- | --- | --- |
| ∕∕         | Shared                        | counter | visible                | only      | within | this block |          |     |     |
| __shared__ |                               |         | int counter;           |           |        |            |          |     |     |
| ∕∕         | Initialize                    |         | counter                | (only one | thread | should     | do this) |     |     |
| if         | (threadIdx.x                  |         | == 0)                  | {         |        |            |          |     |     |
|            | __nv_atomic_store_n(&counter, |         |                        |           |        | 0,         |          |     |     |
__NV_ATOMIC_RELAXED,
__NV_THREAD_SCOPE_BLOCK);
}
__syncthreads();
| ∕∕  | All       | threads | in block                          | atomically |     | increment |     |     |     |
| --- | --------- | ------- | --------------------------------- | ---------- | --- | --------- | --- | --- | --- |
| int | old_value |         | = __nv_atomic_fetch_add(&counter, |            |     |           | 1,  |     |     |
__NV_ATOMIC_RELAXED,
__NV_THREAD_SCOPE_BLOCK);
| ∕∕  | Use | old_value... |     |     |     |     |     |     |     |
| --- | --- | ------------ | --- | --- | --- | --- | --- | --- | --- |
}
Thisexampleimplementsablock-scopedatomiccounterthatdemonstratesthefundamentalconcepts
ofscopedatomics:
▶ Shared Variable: a single counter is shared among all threads in the block using __shared__
memory.
▶ Atomic Type Declaration: cuda::atomic<int, cuda::thread_scope_block> creates an
atomicintegerwithblock-levelvisibility.
▶ SingleInitialization: onlythread0initializesthecountertopreventraceconditionsduringsetup.
▶
BlockSynchronization: __syncthreads()ensuresallthreadsseetheinitializedcounterbefore
proceeding.
▶
Atomic Increment: each thread atomically increments the counter and receives the previous
value.
cuda::memory_order_relaxed is chosen here because we only need atomicity (indivisible read-
modify-write) without ordering constraints between different memory locations. Since this is a
straightforwardcountingoperation,theorderofincrementsdoesn’tmatterforcorrectness.
Forproducer-consumerpatterns,acquire-releasesemanticsensureproperordering:
CUDAC++cuda::atomic
| __global__ |     | void | producer_consumer() |     | {   |     |     |     |     |
| ---------- | --- | ---- | ------------------- | --- | --- | --- | --- | --- | --- |
| __shared__ |     |      | int data;           |     |     |     |     |     |     |
__shared__ cuda::atomic<bool, cuda::thread_scope_block> ready;
| if  | (threadIdx.x |           | == 0) | {         |        |       |     |     |     |
| --- | ------------ | --------- | ----- | --------- | ------ | ----- | --- | --- | --- |
|     | ∕∕           | Producer: | write | data then | signal | ready |     |     |     |
|     | data         | =         | 42;   |           |        |       |     |     |     |
ready.store(true, cuda::memory_order_release); ∕∕ Release ensures
| ,→data | write | is  | visible |     |     |     |     |     |     |
| ------ | ----- | --- | ------- | --- | --- | --- | --- | --- | --- |
(continuesonnextpage)
| 106 |     |     |     |     |     |     |     | Chapter3. | AdvancedCUDA |
| --- | --- | --- | --- | --- | --- | --- | --- | --------- | ------------ |

CUDAProgrammingGuide,Release13.1
(continuedfrompreviouspage)
| }   | else {       |      |           |                  |      |     |
| --- | ------------ | ---- | --------- | ---------------- | ---- | --- |
|     | ∕∕ Consumer: | wait | for ready | signal then read | data |     |
while (!ready.load(cuda::memory_order_acquire)) { ∕∕ Acquire ensures
| ,→data | read sees | the write |     |     |     |     |
| ------ | --------- | --------- | --- | --- | --- | --- |
|        | ∕∕ spin   | wait      |     |     |     |     |
}
|     | int value  | = data;  |     |     |     |     |
| --- | ---------- | -------- | --- | --- | --- | --- |
|     | ∕∕ Process | value... |     |     |     |     |
}
}
Built-inAtomicFunctions
| __global__ | void producer_consumer() |       |     | {   |     |     |
| ---------- | ------------------------ | ----- | --- | --- | --- | --- |
| __shared__ | int                      | data; |     |     |     |     |
__shared__ bool ready; ∕∕ Only ready flag needs atomic operations
| if  | (threadIdx.x                | == 0) | {         |              |     |     |
| --- | --------------------------- | ----- | --------- | ------------ | --- | --- |
|     | ∕∕ Producer:                | write | data then | signal ready |     |     |
|     | data = 42;                  |       |           |              |     |     |
|     | __nv_atomic_store_n(&ready, |       |           | true,        |     |     |
__NV_ATOMIC_RELEASE,
|        |                  |      | __NV_THREAD_SCOPE_BLOCK); |                  | ∕∕ Release | ensures |
| ------ | ---------------- | ---- | ------------------------- | ---------------- | ---------- | -------- |
| ,→data | write is visible |      |                           |                  |            |          |
| }      | else {           |      |                           |                  |            |          |
|        | ∕∕ Consumer:     | wait | for ready                 | signal then read | data       |          |
while (!__nv_atomic_load_n(&ready,
__NV_ATOMIC_ACQUIRE,
|           |           |          |       | __NV_THREAD_SCOPE_BLOCK)) | {   | ∕∕ Acquire |
| --------- | --------- | -------- | ----- | ------------------------- | --- | ----------- |
| ,→ensures | data read | sees the | write |                           |     |             |
|           | ∕∕ spin   | wait     |       |                           |     |             |
}
|     | int value  | = data;  |     |     |     |     |
| --- | ---------- | -------- | --- | --- | --- | --- |
|     | ∕∕ Process | value... |     |     |     |     |
}
}
| 3.2.4.1.2 | PerformanceConsiderations |     |     |     |     |     |
| --------- | ------------------------- | --- | --- | --- | --- | --- |
▶
Use the narrowest scope possible: block-scoped atomics are much faster than system-scoped
atomics.
▶
Preferweakerorderings: usestrongerorderingsonlywhennecessaryforcorrectness.
▶ Considermemorylocation: sharedmemoryatomicsarefasterthanglobalmemoryatomics.
3.2.4.2 AsynchronousBarriers
An asynchronous barrier differs from a typical single-stage barrier (__syncthreads()) in that the
notificationbyathreadthatithasreachedthebarrier(the“arrival”)isseparatedfromtheoperation
of waiting for other threads to arrive at the barrier (the “wait”). This separation increases execution
efficiencybyallowingathreadtoperformadditionaloperationsunrelatedtothebarrier,makingmore
3.2. AdvancedKernelProgramming 107

CUDAProgrammingGuide,Release13.1
efficientuseofthewaittime. Asynchronousbarrierscanbeusedtoimplementproducer-consumer
patternswithCUDAthreadsorenableasynchronousdatacopieswithinthememoryhierarchybyhav-
ingthecopyoperationsignal(“arriveon”)abarrieruponcompletion.
Asynchronousbarriersareavailableondevicesofcomputecapability7.0orhigher. Devicesofcompute
capability 8.0 or higher provide hardware acceleration for asynchronous barriers in shared-memory
andasignificantadvancementinsynchronizationgranularity, byallowinghardware-acceleratedsyn-
chronizationofanysubsetofCUDAthreadswithintheblock. Previousarchitecturesonlyaccelerate
synchronizationatawhole-warp(__syncwarp())orwhole-block(__syncthreads())level.
The CUDA programming model provides asynchronous barriers via cuda::std::barrier, an ISO
C++-conforming barrier available in the libcu++ library. In addition to implementing std::barrier, the
libraryoffersCUDA-specificextensionstoselectabarrier’sthreadscopetoimproveperformanceand
exposesalower-levelcuda::ptxAPI.Acuda::barriercaninteroperatewithcuda::ptxbyusingthe
friendfunctioncuda::device::barrier_native_handle()toretrievethebarrier’snativehan-
dleandpassittocuda::ptxfunctions. CUDAalsoprovidesaprimitivesAPIforasynchronousbarriers
insharedmemoryatthread-blockscope.
Thefollowingtablegivesanoverviewofasynchronousbarriersavailableforsynchronizingatdifferent
threadscopes.
|     | Thread | Memory | Lo-    | Arrive  | on Wait | on Hardware- |     | CUDAAPIs             |     |
| --- | ------ | ------ | ------ | ------- | ------- | ------------ | --- | -------------------- | --- |
|     | Scope  | cation |        | Barrier | Barrier | accelerated  |     |                      |     |
|     | block  | local  | shared | allowed | allowed | yes(8.0+)    |     | cuda::barrier,       |     |
|     |        | memory |        |         |         |              |     | cuda::ptx,primitives |     |
cluster local shared allowed allowed yes(9.0+) cuda::barrier,
memory
cuda::ptx
|     | cluster | remote |      | allowed | not   | al- yes(9.0+) |     | cuda::barrier, |     |
| --- | ------- | ------ | ---- | ------- | ----- | ------------- | --- | -------------- | --- |
|     |         | shared | mem- |         | lowed |               |     |                |     |
cuda::ptx
ory
|     | device | global | mem- | allowed | allowed | no  |     |     |     |
| --- | ------ | ------ | ---- | ------- | ------- | --- | --- | --- | --- |
cuda::barrier
ory
|     | system | global/unified |     | allowed | allowed | no  |     |     |     |
| --- | ------ | -------------- | --- | ------- | ------- | --- | --- | --- | --- |
cuda::barrier
memory
TemporalSplittingofSynchronization
Withouttheasynchronousarrive-waitbarriers,synchronizationwithinathreadblockisachievedusing
__syncthreads()orblock.sync()whenusingCooperativeGroups.
| #include   |      | <cooperative_groups.h> |                                            |                    |                  |       |     |     |     |
| ---------- | ---- | ---------------------- | ------------------------------------------ | ------------------ | ---------------- | ----- | --- | --- | --- |
| __global__ |      | void                   | simple_sync(int                            |                    | iteration_count) |       | {   |     |     |
|            | auto | block                  | = cooperative_groups::this_thread_block(); |                    |                  |       |     |     |     |
|            | for  | (int i                 | = 0; i                                     | < iteration_count; |                  | ++i)  | {   |     |     |
|            |      | ∕* code                | before                                     | arrive             | *∕               |       |     |     |     |
|            |      | ∕∕ Wait                | for all                                    | threads            | to arrive        | here. |     |     |     |
(continuesonnextpage)
| 108 |     |     |     |     |     |     |     | Chapter3. | AdvancedCUDA |
| --- | --- | --- | --- | --- | --- | --- | --- | --------- | ------------ |

CUDAProgrammingGuide,Release13.1
(continuedfrompreviouspage)
block.sync();
∕* code after wait *∕
}
}
Threadsareblockedatthesynchronizationpoint(block.sync())untilallthreadshavereachedthe
synchronizationpoint. Inaddition,memoryupdatesthathappenedbeforethesynchronizationpoint
areguaranteedtobevisibletoallthreadsintheblockafterthesynchronizationpoint.
Thispatternhasthreestages:
▶ Codebeforethesyncperformsmemoryupdatesthatwillbereadafterthesync.
▶ Synchronizationpoint.
▶ Codeafterthesync,withvisibilityofmemoryupdatesthathappenedbeforethesync.
Usingasynchronousbarriersinstead,thetemporally-splitsynchronizationpatternisasfollows.
3.2. AdvancedKernelProgramming 109

CUDAProgrammingGuide,Release13.1
CUDAC++cuda::barrier
| #include   | <cuda∕barrier>         |               |        |                 |     |     |
| ---------- | ---------------------- | ------------- | ------ | --------------- | --- | --- |
| #include   | <cooperative_groups.h> |               |        |                 |     |     |
| __device__ | void                   | compute(float | *data, | int iteration); |     |     |
__global__ void split_arrive_wait(int iteration_count, float *data)
{
| using                   | barrier_t | = cuda::barrier<cuda::thread_scope_block>; |       |     |     |     |
| ----------------------- | --------- | ------------------------------------------ | ----- | --- | --- | --- |
| __shared__              | barrier_t | bar;                                       |       |     |     |     |
| auto block              | =         | cooperative_groups::this_thread_block();   |       |     |     |     |
| if (block.thread_rank() |           |                                            | == 0) |     |     |     |
{
| ∕∕ Initialize |     | barrier        | with expected | arrival | count. |     |
| ------------- | --- | -------------- | ------------- | ------- | ------ | --- |
| init(&bar,    |     | block.size()); |               |         |        |     |
}
block.sync();
| for (int | i = | 0; i < iteration_count; |     | ++i) |     |     |
| -------- | --- | ----------------------- | --- | ---- | --- | --- |
{
| ∕* code                  | before | arrive   | *∕           |               |           |     |
| ------------------------ | ------ | -------- | ------------ | ------------- | --------- | --- |
| ∕∕ This                  | thread | arrives. | Arrival does | not block     | a thread. |     |
| barrier_t::arrival_token |        |          | token =      | bar.arrive(); |           |     |
| compute(data,            |        | i);      |              |               |           |     |
∕∕ Wait for all threads participating in the barrier to complete bar.
,→arrive().
bar.wait(std::move(token));
| ∕* code | after | wait *∕ |     |     |     |     |
| ------- | ----- | ------- | --- | --- | --- | --- |
}
}
| 110 |     |     |     |     | Chapter3. | AdvancedCUDA |
| --- | --- | --- | --- | --- | --------- | ------------ |

CUDAProgrammingGuide,Release13.1
CUDAC++cuda::ptx
| #include   | <cuda∕ptx>             |               |        |                 |     |
| ---------- | ---------------------- | ------------- | ------ | --------------- | --- |
| #include   | <cooperative_groups.h> |               |        |                 |     |
| __device__ | void                   | compute(float | *data, | int iteration); |     |
__global__ void split_arrive_wait(int iteration_count, float *data)
{
| __shared__              | uint64_t | bar;                                     |       |     |     |
| ----------------------- | -------- | ---------------------------------------- | ----- | --- | --- |
| auto block              | =        | cooperative_groups::this_thread_block(); |       |     |     |
| if (block.thread_rank() |          |                                          | == 0) |     |     |
{
| ∕∕ Initialize                  |     | barrier | with expected  | arrival | count. |
| ------------------------------ | --- | ------- | -------------- | ------- | ------ |
| cuda::ptx::mbarrier_init(&bar, |     |         | block.size()); |         |        |
}
block.sync();
| for (int | i = | 0; i < iteration_count; |     | ++i) |     |
| -------- | --- | ----------------------- | --- | ---- | --- |
{
| ∕* code       | before | arrive                              | *∕           |           |           |
| ------------- | ------ | ----------------------------------- | ------------ | --------- | --------- |
| ∕∕ This       | thread | arrives.                            | Arrival does | not block | a thread. |
| uint64_t      | token  | = cuda::ptx::mbarrier_arrive(&bar); |              |           |           |
| compute(data, |        | i);                                 |              |           |           |
∕∕ Wait for all threads participating in the barrier to complete mbarrier_
,→arrive().
| while(!cuda::ptx::mbarrier_try_wait(&bar, |       |         |     | token)) | {}  |
| ----------------------------------------- | ----- | ------- | --- | ------- | --- |
| ∕* code                                   | after | wait *∕ |     |         |     |
}
}
3.2. AdvancedKernelProgramming 111

CUDAProgrammingGuide,Release13.1
CUDACprimitives
| #include   |     | <cuda_awbarrier_primitives.h> |               |        |                 |     |     |     |
| ---------- | --- | ----------------------------- | ------------- | ------ | --------------- | --- | --- | --- |
| #include   |     | <cooperative_groups.h>        |               |        |                 |     |     |     |
| __device__ |     | void                          | compute(float | *data, | int iteration); |     |     |     |
__global__ void split_arrive_wait(int iteration_count, float *data)
{
|     | __shared__              | __mbarrier_t |                                          | bar;  |     |     |     |     |
| --- | ----------------------- | ------------ | ---------------------------------------- | ----- | --- | --- | --- | --- |
|     | auto block              | =            | cooperative_groups::this_thread_block(); |       |     |     |     |     |
|     | if (block.thread_rank() |              |                                          | == 0) |     |     |     |     |
{
|     | ∕∕ Initialize         |     | barrier | with expected  | arrival | count. |     |     |
| --- | --------------------- | --- | ------- | -------------- | ------- | ------ | --- | --- |
|     | __mbarrier_init(&bar, |     |         | block.size()); |         |        |     |     |
}
block.sync();
|     | for (int | i = | 0; i < iteration_count; |     | ++i) |     |     |     |
| --- | -------- | --- | ----------------------- | --- | ---- | --- | --- | --- |
{
|     | ∕* code            | before | arrive   | *∕                               |                |           |     |     |
| --- | ------------------ | ------ | -------- | -------------------------------- | -------------- | --------- | --- | --- |
|     | ∕∕ This            | thread | arrives. | Arrival                          | does not block | a thread. |     |     |
|     | __mbarrier_token_t |        |          | token = __mbarrier_arrive(&bar); |                |           |     |     |
|     | compute(data,      |        | i);      |                                  |                |           |     |     |
∕∕ Wait for all threads participating in the barrier to complete __
,→mbarrier_arrive().
|     | while(!__mbarrier_try_wait(&bar, |       |         |     | token, 1000)) | {}  |     |     |
| --- | -------------------------------- | ----- | ------- | --- | ------------- | --- | --- | --- |
|     | ∕* code                          | after | wait *∕ |     |               |     |     |     |
}
}
Inthispattern,thesynchronizationpointissplitintoanarrivepoint(bar.arrive())andawaitpoint
(bar.wait(std::move(token))). A thread begins participating in a cuda::barrier with its first
call to bar.arrive(). When a thread calls bar.wait(std::move(token)) it will be blocked until
participating threads have completed bar.arrive() the expected number of times, which is the
expectedarrivalcountargumentpassedtoinit(). Memoryupdatesthathappenbeforeparticipating
threads’calltobar.arrive()areguaranteedtobevisibletoparticipatingthreadsaftertheircallto
bar.wait(std::move(token)). Note that the call to bar.arrive() does not block a thread, it
canproceedwithotherworkthatdoesnotdependuponmemoryupdatesthathappenbeforeother
participatingthreads’calltobar.arrive().
Thearriveandwaitpatternhasfivestages:
▶ Codebeforethearriveperformsmemoryupdatesthatwillbereadafterthewait.
▶
Arrivepointwithimplicitmemoryfence(i.e.,equivalenttocuda::atomic_thread_fence(cuda::memory_order_seq_cst,
cuda::thread_scope_block)).
| 112 |     |     |     |     |     |     | Chapter3. | AdvancedCUDA |
| --- | --- | --- | --- | --- | --- | --- | --------- | ------------ |

CUDAProgrammingGuide,Release13.1
▶ Codebetweenarriveandwait.
▶ Waitpoint.
▶ Codeafterthewait,withvisibilityofupdatesthatwereperformedbeforethearrive.
Foracomprehensiveguideonhowtouseasynchronousbarriers,seeAsynchronousBarriers.
3.2.4.3 Pipelines
TheCUDAprogrammingmodelprovidesthepipelinesynchronizationobjectasacoordinationmecha-
nismtosequenceasynchronousmemorycopiesintomultiplestages,facilitatingtheimplementation
ofdouble-ormulti-bufferingproducer-consumerpatterns. Apipelineisadouble-endedqueuewitha
headandatailthatprocessesworkinafirst-infirst-out(FIFO)order. Producerthreadscommitwork
tothepipeline’shead,whileconsumerthreadspullworkfromthepipeline’stail.
Pipelines are exposed through the cuda::pipeline API in the libcu++ library, as well as through a
primitivesAPI.ThefollowingtablesdescribethemainfunctionalityofthetwoAPIs.
cuda::pipelineAPI Description
producer_acquire Acquiresanavailablestageinthepipeline’sinternalqueue.
producer_commit Commits the asynchronous operations issued after the pro-
ducer_acquirecallonthecurrentlyacquiredstageofthepipeline.
consumer_wait Waitsforcompletionofasynchronousoperationsintheoldeststage
ofthepipeline.
consumer_release Releases the oldest stage of the pipeline to the pipeline object for
reuse. Thereleasedstagecanbethenacquiredbyaproducer.
PrimitivesAPI Description
__pipeline_memcpy_asyncRequestamemorycopyfromglobaltosharedmemorytobesubmit-
tedforasynchronousevaluation.
__pipeline_commit Commitstheasynchronousoperationsissuedbeforethecallonthe
currentstageofthepipeline.
__pipeline_wait_prior(N)WaitsforcompletionofasynchronousoperationsinallbutthelastN
commitstothepipeline.
The cuda::pipeline API has a richer interface with less restrictions, while the primitives API only
supportstrackingasynchronouscopiesfromglobalmemorytosharedmemorywithspecificsizeand
alignment requirements. The primitives API provides equivalent functionality to a cuda::pipeline
objectwithcuda::thread_scope_thread.
Fordetailedusagepatternsandexamples,seePipelines.
3.2.5. Asynchronous Data Copies
Efficientdatamovementwithinthememoryhierarchyisfundamentaltoachievinghighperformance
inGPUcomputing. Traditionalsynchronousmemoryoperationsforcethreadstowaitidleduringdata
3.2. AdvancedKernelProgramming 113

CUDAProgrammingGuide,Release13.1
transfers. GPUs inherently hide memory latency through parallelism. That is, the SM switches to
executeanotherwarpwhilememoryoperationscomplete. Evenwiththislatencyhidingthroughpar-
allelism,itisstillpossibleformemorylatencytobeabottleneckonbothmemorybandwidthutilization
and compute resource efficiency. To address these bottlenecks, modern GPU architectures provide
hardware-acceleratedasynchronousdatacopymechanismsthatallowmemorytransferstoproceed
independentlywhilethreadscontinueexecutingotherwork.
Asynchronousdatacopiesenableoverlappingofcomputationwithdatamovement,bydecouplingthe
initiationofamemorytransferfromwaitingforitscompletion. Thisway,threadscanperformuseful
workduringmemorylatencyperiods,leadingtoimprovedoverallthroughputandresourceutilization.
Note
While concepts and principles underlying this section are similar to those discussed in the ear-
lier chapter on AsynchronousExecution, that chapter covered asynchronous execution of kernels
andmemorytransferssuchasthoseinvokedbycudaMemcpyAsync. Thatcanbeconsideredasyn-
chronyofdifferentcomponentsoftheapplication.
The asynchrony described in this section refers to enabling transfer of data between the GPU’s
DRAM,i.e. globalmemory,andon-SMmemorysuchassharedmemoryortensormemorywithout
blockingtheGPUthreads. Thisisanasynchronywithintheexecutionofasinglekernellaunch.
Tounderstandhowasynchronouscopiescanimproveperformance,itishelpfultoexamineacommon
GPUcomputingpattern. CUDAapplicationsoftenemployacopyandcomputepatternthat:
▶ fetchesdatafromglobalmemory,
▶ storesdatatosharedmemory,and
▶ performs computations on shared memory data, and potentially writes results back to global
memory.
The copy phase of this pattern is typically expressed as shared[local_idx] =
global[global_idx]. This global to shared memory copy is expanded by the compiler to a
readfromglobalmemoryintoaregisterfollowedbyawritetosharedmemoryfromtheregister.
When this pattern occurs within an iterative algorithm, each thread block needs to synchronize af-
ter the shared[local_idx] = global[global_idx] assignment, to ensure all writes to shared
memory have completed before the compute phase can begin. The thread block also needs to syn-
chronizeagainafterthecomputephase,topreventoverwritingsharedmemorybeforeallthreadshave
completedtheircomputations. Thispatternisillustratedinthefollowingcodesnippet.
#include <cooperative_groups.h>
__device__ void compute(int* global_out, int const* shared_in) {
∕∕ Computes using all values of current batch from shared memory.
∕∕ Stores this thread's result back to global memory.
}
__global__ void without_async_copy(int* global_out, int const* global_in,
,→size_t size, size_t batch_sz) {
auto grid = cooperative_groups::this_grid();
auto block = cooperative_groups::this_thread_block();
assert(size == batch_sz * grid.size()); ∕∕ Exposition: input size fits batch_
,→sz * grid_size
(continuesonnextpage)
114 Chapter3. AdvancedCUDA

CUDAProgrammingGuide,Release13.1
(continuedfrompreviouspage)
extern __shared__ int shared[]; ∕∕ block.size() * sizeof(int) bytes
| size_t local_idx |       | = block.thread_rank(); |       |             |          |     |
| ---------------- | ----- | ---------------------- | ----- | ----------- | -------- | --- |
| for (size_t      | batch | = 0;                   | batch | < batch_sz; | ++batch) | {   |
∕∕ Compute the index of the current batch for this block in global memory.
size_t block_batch_idx = block.group_index().x * block.size() + grid.
| ,→size() *        | batch;     |                          |              |     |                |     |
| ----------------- | ---------- | ------------------------ | ------------ | --- | -------------- | --- |
| size_t            | global_idx | = block_batch_idx        |              |     | + threadIdx.x; |     |
| shared[local_idx] |            | = global_in[global_idx]; |              |     |                |     |
| ∕∕ Wait           | for all    | copies                   | to complete. |     |                |     |
block.sync();
| ∕∕ Compute         | and         | write | result           | to global | memory.  |         |
| ------------------ | ----------- | ----- | ---------------- | --------- | -------- | ------- |
| compute(global_out |             | +     | block_batch_idx, |           | shared); |         |
| ∕∕ Wait            | for compute | using | shared           | memory    | to       | finish. |
block.sync();
}
}
Withasynchronousdatacopies,datamovementfromglobalmemorytosharedmemorycanbedone
asynchronouslytoenablemoreefficientuseoftheSMwhilewaitingfordatatoarrive.
| #include <cooperative_groups.h>              |     |     |     |     |     |     |
| -------------------------------------------- | --- | --- | --- | --- | --- | --- |
| #include <cooperative_groups∕memcpy_async.h> |     |     |     |     |     |     |
__device__ void compute(int* global_out, int const* shared_in) {
∕∕ Computes using all values of current batch from shared memory.
| ∕∕ Stores | this | thread's | result | back | to global | memory. |
| --------- | ---- | -------- | ------ | ---- | --------- | ------- |
}
__global__ void with_async_copy(int* global_out, int const* global_in, size_t
| ,→size, size_t | batch_sz)                                  |     | {   |     |     |     |
| -------------- | ------------------------------------------ | --- | --- | --- | --- | --- |
| auto grid      | = cooperative_groups::this_grid();         |     |     |     |     |     |
| auto block     | = cooperative_groups::this_thread_block(); |     |     |     |     |     |
assert(size == batch_sz * grid.size()); ∕∕ Exposition: input size fits batch_
,→sz * grid_size
extern __shared__ int shared[]; ∕∕ block.size() * sizeof(int) bytes
| size_t local_idx |       | = block.thread_rank(); |       |             |          |     |
| ---------------- | ----- | ---------------------- | ----- | ----------- | -------- | --- |
| for (size_t      | batch | = 0;                   | batch | < batch_sz; | ++batch) | {   |
∕∕ Compute the index of the current batch for this block in global memory.
size_t block_batch_idx = block.group_index().x * block.size() + grid.
| ,→size() * | batch; |     |     |     |     |     |
| ---------- | ------ | --- | --- | --- | --- | --- |
∕∕ Whole thread-group cooperatively copies whole batch to shared memory.
cooperative_groups::memcpy_async(block, shared, global_in + block_batch_
,→idx, block.size());
(continuesonnextpage)
3.2. AdvancedKernelProgramming 115

CUDAProgrammingGuide,Release13.1
(continuedfrompreviouspage)
∕∕ Compute on different data while waiting.
∕∕ Wait for all copies to complete.
cooperative_groups::wait(block);
∕∕ Compute and write result to global memory.
compute(global_out + block_batch_idx, shared);
∕∕ Wait for compute using shared memory to finish.
block.sync();
}
}
Thecooperative_groups::memcpy_asyncfunctioncopiesblock.size()elementsfromglobalmemory
totheshareddata. Thisoperationhappensas-ifperformedbyanotherthread, whichsynchronizes
withthecurrentthread’scalltocooperative_groups::waitafterthecopyhascompleted. Untilthecopy
operationcompletes,modifyingtheglobaldataorreadingorwritingtheshareddataintroducesadata
race.
Thisexampleillustratesthefundamentalconceptbehindallasynchronouscopyoperations: theyde-
couplememorytransferinitiationfromcompletion,allowingthreadstoperformotherworkwhiledata
moves in the background. The CUDA programming model provides several APIs to access these ca-
pabilities, including memcpy_async functionsavailablein CooperativeGroupsandthe libcu++ library,
aswellaslower-levelcuda::ptxandprimitivesAPIs. TheseAPIssharesimilarsemantics: theycopy
objects from source to destination as-if performed by another thread which, on completion of the
copy,canbesynchronizedusingdifferentcompletionmechanisms.
ModernGPUarchitecturesprovidemultiplehardwaremechanismsforasynchronousdatamovement.
▶ LDGSTS(computecapability8.0+)allowsforefficientsmall-scaleasynchronoustransfersfrom
globaltosharedmemory.
▶ Thetensormemoryaccelerator(TMA,computecapability9.0+)extendsthesecapabilities,pro-
vidingbulk-asynchronouscopyoperationsoptimizedforlargemulti-dimensionaldatatransfers
▶ STASinstructions(computecapability9.0+)enablesmall-scaleasynchronoustransfersfromreg-
isterstodistributedsharedmemorywithinacluster.
Thesemechanismssupportdifferentdatapaths,transfersizes,andalignmentrequirements,allowing
developers to choose the most appropriate approach for their specific data access patterns. The
following table gives an overview of the supported data paths for asynchronous copies within the
GPU.
116 Chapter3. AdvancedCUDA

CUDAProgrammingGuide,Release13.1
|     |     | Table5: Asynchronous |     | copies                           | with possible | source and desti- |
| --- | --- | -------------------- | --- | -------------------------------- | ------------- | ----------------- |
|     |     | nationmemoryspaces.  |     | Anemptycellindicatesthatasource- |               |                   |
destinationpairisnotsupported.
|     | Direction   |             |     | CopyMechanism    |     |                       |
| --- | ----------- | ----------- | --- | ---------------- | --- | --------------------- |
|     | Source      | Destination |     | AsynchronousCopy |     | Bulk-AsynchronousCopy |
|     | global      | global      |     |                  |     |                       |
|     | shared::cta | global      |     |                  |     | supported(TMA,9.0+)   |
global shared::cta supported(LDGSTS,8.0+) supported(TMA,9.0+)
|     | global          | shared::cluster |     |                      |     | supported(TMA,9.0+) |
| --- | --------------- | --------------- | --- | -------------------- | --- | ------------------- |
|     | shared::cluster | shared::cta     |     |                      |     | supported(TMA,9.0+) |
|     | shared::cta     | shared::cta     |     |                      |     |                     |
|     | registers       | shared::cluster |     | supported(STAS,9.0+) |     |                     |
SectionsUsingLDGSTS,UsingtheTensorMemoryAccelerator(TMA)andUsingSTASwillgointomore
detailsabouteachmechanism.
| 3.2.6. | Configuring |     | L1/Shared |     | Memory | Balance |
| ------ | ----------- | --- | --------- | --- | ------ | ------- |
AsmentionedinL1datacache,theL1andsharedmemoryonanSMusethesamephysicalresource,
knownastheunifieddatacache. Onmostarchitectures,ifakerneluseslittleornosharedmemory,
theunifieddatacachecanbeconfiguredtoprovidethemaximalamountofL1cacheallowedbythe
architecture.
The unified data cache reserved for shared memory is configurable on a per-kernel basis. An appli-
cation can set the carveout, or preferred shared memory capacity, with the cudaFuncSetAttribute
functioncalledbeforethekernelislaunched.
cudaFuncSetAttribute(kernel_name,
| ,→cudaFuncAttributePreferredSharedMemoryCarveout, |     |     |     |     |     | carveout); |
| ------------------------------------------------- | --- | --- | --- | --- | --- | ---------- |
The application can set the carveout as an integer percentage of the maximum supported shared
memorycapacityofthatarchitecture. Inadditiontoanintegerpercentage,threeconvenienceenums
areprovidedascarveoutvalues.
▶
cudaSharedmemCarveoutDefault
▶
cudaSharedmemCarveoutMaxL1
▶ cudaSharedmemCarveoutMaxShared
Themaximumsupportedsharedmemoryandthesupportedcarveoutsizesvarybyarchitecture;see
SharedMemoryCapacityperComputeCapabilityfordetails.
Where a chosen integer percentage carveout does not map exactly to a supported shared memory
capacity,thenextlargercapacityisused. Forexample,fordevicesofcomputecapability12.0,which
haveamaximumsharedmemorycapacityof100KB,settingthecarveoutto50%willresultin64KBof
sharedmemory,not50KB,becausedevicesofcomputecapability12.0supportsharedmemorysizes
of0,8,16,32,64,and100.
3.2. AdvancedKernelProgramming 117

CUDAProgrammingGuide,Release13.1
The function passed to cudaFuncSetAttribute must be declared with the __global__ specifier.
cudaFuncSetAttributeisinterpretedbythedriverasahint,andthedrivermaychooseadifferent
carveoutsizeifrequiredtoexecutethekernel.
Note
AnotherCUDAAPI,cudaFuncSetCacheConfig,alsoallowsanapplicationtoadjustthebalancebe-
tweenL1andsharedmemoryforakernel. However,thisAPIsetahardrequirementsonshared/L1
balanceforkernellaunch. Asaresult,interleavingkernelswithdifferentsharedmemoryconfigu-
rationswouldneedlesslyserializelaunchesbehindsharedmemoryreconfigurations. cudaFuncSe-
tAttributeispreferredbecausedrivermaychooseadifferentconfigurationifrequiredtoexecute
thefunctionortoavoidthrashing.
Kernelsrelyingonsharedmemoryallocationsover48KBperblockarearchitecture-specific. Assuch
theymustusedynamicsharedmemoryratherthanstatically-sizedarraysandrequireanexplicitopt-in
usingcudaFuncSetAttributeasfollows.
| ∕∕ Device  | code               |     |     |     |     |
| ---------- | ------------------ | --- | --- | --- | --- |
| __global__ | void MyKernel(...) |     |     |     |     |
{
| extern | __shared__ | float buffer[]; |     |     |     |
| ------ | ---------- | --------------- | --- | --- | --- |
...
}
| ∕∕ Host      | code     |       |     |     |     |
| ------------ | -------- | ----- | --- | --- | --- |
| int maxbytes | = 98304; | ∕∕ 96 | KB  |     |     |
cudaFuncSetAttribute(MyKernel, cudaFuncAttributeMaxDynamicSharedMemorySize,
,→maxbytes);
| MyKernel | <<<gridDim, | blockDim, | maxbytes>>>(...); |     |     |
| -------- | ----------- | --------- | ----------------- | --- | --- |
| 3.3.     | The CUDA    | Driver    | API               |     |     |
PrevioussectionsofthisguidehavecoveredtheCUDAruntime. AsmentionedinCUDARuntimeAPIand
CUDADriverAPI,theCUDAruntimeiswrittenontopofthelowerlevelCUDAdriverAPI.Thissection
covers some of the differences between the CUDA runtime and the driver APIs, as well has how to
intermix them. Most applications can operate at full performance without ever needing to interact
with the CUDA driver API. However, new interfaces are sometimes available in the driver API earlier
than the runtime API, and some advanced interfaces, such as VirtualMemoryManagement, are only
exposedinthedriverAPI.
ThedriverAPIisimplementedinthecudadynamiclibrary(cuda.dllorcuda.so)whichiscopiedon
thesystemduringtheinstallationofthedevicedriver. Allitsentrypointsareprefixedwithcu.
Itisahandle-based,imperativeAPI:Mostobjectsarereferencedbyopaquehandlesthatmaybespec-
ifiedtofunctionstomanipulatetheobjects.
TheobjectsavailableinthedriverAPIaresummarizedinTable6.
| 118 |     |     |     | Chapter3. | AdvancedCUDA |
| --- | --- | --- | --- | --------- | ------------ |

CUDAProgrammingGuide,Release13.1
Table6: ObjectsAvailableintheCUDADriverAPI
| Object | Handle | Description        |     |
| ------ | ------ | ------------------ | --- |
| Device | CUde-  | CUDA-enableddevice |     |
vice
| Context | CUcon- | RoughlyequivalenttoaCPUprocess |     |
| ------- | ------ | ------------------------------ | --- |
text
| Module | CUmod- | Roughlyequivalenttoadynamiclibrary |     |
| ------ | ------ | ---------------------------------- | --- |
ule
| Function | CUfunc- | Kernel |     |
| -------- | ------- | ------ | --- |
tion
| Heap mem- | CUdevi- | Pointertodevicememory |     |
| --------- | ------- | --------------------- | --- |
| ory       | ceptr   |                       |     |
CUDAarray CUarray Opaque container for one-dimensional or two-dimensional data on the
device,readableviatextureorsurfacereferences
Texture ob- CUtexref Objectthatdescribeshowtointerprettexturememorydata
ject
Surface ref- CUsurfref ObjectthatdescribeshowtoreadorwriteCUDAarrays
erence
| Stream | CUs- | ObjectthatdescribesaCUDAstream |     |
| ------ | ---- | ------------------------------ | --- |
tream
| Event | CUevent | ObjectthatdescribesaCUDAevent |     |
| ----- | ------- | ----------------------------- | --- |
ThedriverAPImustbeinitializedwithcuInit()beforeanyfunctionfromthedriverAPIiscalled. A
CUDA context must then be created that is attached to a specific device and made current to the
callinghostthreadasdetailedinContext.
Within a CUDA context, kernels are explicitly loaded as PTX or binary objects by the host code as
describedinModule. KernelswritteninC++mustthereforebecompiledseparatelyintoPTX orbinary
objects. KernelsarelaunchedusingAPIentrypointsasdescribedinKernelExecution.
AnyapplicationthatwantstorunonfuturedevicearchitecturesmustloadPTX,notbinarycode. This
isbecausebinarycodeisarchitecture-specificandthereforeincompatiblewithfuturearchitectures,
whereasPTX codeiscompiledtobinarycodeatloadtimebythedevicedriver.
HereisthehostcodeofthesamplefromKernelswrittenusingthedriverAPI:
int main()
{
| int N         | = ...;                      |                     |                |
| ------------- | --------------------------- | ------------------- | -------------- |
| size_t        | size = N                    | * sizeof(float);    |                |
| ∕∕ Allocate   | input                       | vectors h_A and h_B | in host memory |
| float*        | h_A = (float*)malloc(size); |                     |                |
| float*        | h_B = (float*)malloc(size); |                     |                |
| ∕∕ Initialize | input                       | vectors             |                |
...
(continuesonnextpage)
3.3. TheCUDADriverAPI 119

CUDAProgrammingGuide,Release13.1
(continuedfrompreviouspage)
| ∕∕  | Initialize |     |     |     |     |     |     |     |     |
| --- | ---------- | --- | --- | --- | --- | --- | --- | --- | --- |
cuInit(0);
| ∕∕  | Get         | number | of devices |     | supporting | CUDA |     |     |     |
| --- | ----------- | ------ | ---------- | --- | ---------- | ---- | --- | --- | --- |
| int | deviceCount |        | = 0;       |     |            |      |     |     |     |
cuDeviceGetCount(&deviceCount);
| if  | (deviceCount  |     | ==  | 0) {      |            |     |            |     |     |
| --- | ------------- | --- | --- | --------- | ---------- | --- | ---------- | --- | --- |
|     | printf("There |     | is  | no device | supporting |     | CUDA.\n"); |     |     |
exit (0);
}
| ∕∕                           | Get             | handle     | for device      |                | 0          |                    |            |     |     |
| ---------------------------- | --------------- | ---------- | --------------- | -------------- | ---------- | ------------------ | ---------- | --- | --- |
| CUdevice                     |                 | cuDevice;  |                 |                |            |                    |            |     |     |
| cuDeviceGet(&cuDevice,       |                 |            |                 | 0);            |            |                    |            |     |     |
| ∕∕                           | Create          | context    |                 |                |            |                    |            |     |     |
| CUcontext                    |                 | cuContext; |                 |                |            |                    |            |     |     |
| cuCtxCreate(&cuContext,      |                 |            |                 | 0,             | cuDevice); |                    |            |     |     |
| ∕∕                           | Create          | module     | from            | binary         | file       |                    |            |     |     |
| CUmodule                     |                 | cuModule;  |                 |                |            |                    |            |     |     |
| cuModuleLoad(&cuModule,      |                 |            |                 | "VecAdd.ptx"); |            |                    |            |     |     |
| ∕∕                           | Allocate        |            | vectors         | in device      | memory     |                    |            |     |     |
| CUdeviceptr                  |                 |            | d_A;            |                |            |                    |            |     |     |
| cuMemAlloc(&d_A,             |                 |            | size);          |                |            |                    |            |     |     |
| CUdeviceptr                  |                 |            | d_B;            |                |            |                    |            |     |     |
| cuMemAlloc(&d_B,             |                 |            | size);          |                |            |                    |            |     |     |
| CUdeviceptr                  |                 |            | d_C;            |                |            |                    |            |     |     |
| cuMemAlloc(&d_C,             |                 |            | size);          |                |            |                    |            |     |     |
| ∕∕                           | Copy            | vectors    | from            | host           | memory     | to device          | memory     |     |     |
| cuMemcpyHtoD(d_A,            |                 |            | h_A,            | size);         |            |                    |            |     |     |
| cuMemcpyHtoD(d_B,            |                 |            | h_B,            | size);         |            |                    |            |     |     |
| ∕∕                           | Get             | function   | handle          | from           | module     |                    |            |     |     |
| CUfunction                   |                 | vecAdd;    |                 |                |            |                    |            |     |     |
| cuModuleGetFunction(&vecAdd, |                 |            |                 |                | cuModule,  |                    | "VecAdd"); |     |     |
| ∕∕                           | Invoke          | kernel     |                 |                |            |                    |            |     |     |
| int                          | threadsPerBlock |            |                 | = 256;         |            |                    |            |     |     |
| int                          | blocksPerGrid   |            | =               |                |            |                    |            |     |     |
|                              |                 | (N +       | threadsPerBlock |                | - 1)       | ∕ threadsPerBlock; |            |     |     |
| void*                        | args[]          |            | = { &d_A,       | &d_B,          | &d_C,      | &N };              |            |     |     |
cuLaunchKernel(vecAdd,
|     |     |     | blocksPerGrid, |       | 1,  | 1, threadsPerBlock, |     | 1, 1, |     |
| --- | --- | --- | -------------- | ----- | --- | ------------------- | --- | ----- | --- |
|     |     |     | 0, 0,          | args, | 0); |                     |     |       |     |
...
}
FullcodecanbefoundinthevectorAddDrvCUDAsample.
| 120 |     |     |     |     |     |     |     | Chapter3. | AdvancedCUDA |
| --- | --- | --- | --- | --- | --- | --- | --- | --------- | ------------ |

CUDAProgrammingGuide,Release13.1
3.3.1. Context
ACUDAcontextisanalogoustoaCPUprocess. Allresourcesandactionsperformedwithinthedriver
APIareencapsulatedinsideaCUDAcontext,andthesystemautomaticallycleansuptheseresources
when the context is destroyed. Besides objects such as modules and texture or surface references,
each context has its own distinct address space. As a result, CUdeviceptr values from different
contextsreferencedifferentmemorylocations.
A host thread may have only one device context current at a time. When a context is created with
cuCtxCreate(), it is made current to the calling host thread. CUDA functions that operate in a
context(mostfunctionsthatdonotinvolvedeviceenumerationorcontextmanagement)willreturn
CUDA_ERROR_INVALID_CONTEXTifavalidcontextisnotcurrenttothethread.
Eachhostthreadhasastackofcurrentcontexts. cuCtxCreate()pushesthenewcontextontothe
top of the stack. cuCtxPopCurrent() may be called to detach the context from the host thread.
Thecontextisthen“floating”andmaybepushedasthecurrentcontextforanyhostthread. cuCtx-
PopCurrent()alsorestoresthepreviouscurrentcontext,ifany.
A usage count is also maintained for each context. cuCtxCreate() creates a context with a usage
countof1. cuCtxAttach()incrementstheusagecountandcuCtxDetach()decrementsit. Acon-
textisdestroyedwhentheusagecountgoesto0whencallingcuCtxDetach()orcuCtxDestroy().
The driver API is interoperable with the runtime and it is possible to access the primary context
(seeRuntimeInitialization)managedbytheruntimefromthedriverAPIviacuDevicePrimaryCtxRe-
tain().
Usagecountfacilitatesinteroperabilitybetweenthirdpartyauthoredcodeoperatinginthesamecon-
text. For example, if three libraries are loaded to use the same context, each library would call cuC-
txAttach()toincrementtheusagecountandcuCtxDetach()todecrementtheusagecountwhen
the library is done using the context. For most libraries, it is expected that the application will have
createdacontextbeforeloadingorinitializingthelibrary;thatway,theapplicationcancreatethecon-
textusingitsownheuristics,andthelibrarysimplyoperatesonthecontexthandedtoit. Librariesthat
wishtocreatetheirowncontexts-unbeknownsttotheirAPIclientswhomayormaynothavecreated
contexts oftheir own - woulduse cuCtxPushCurrent() and cuCtxPopCurrent() as illustratedin
thefollowingfigure.
Figure20: LibraryContextManagement
3.3. TheCUDADriverAPI 121

CUDAProgrammingGuide,Release13.1
| 3.3.2. Module |     |     |     |     |     |
| ------------- | --- | --- | --- | --- | --- |
Modules are dynamically loadable packages of device code and data, akin to DLLs in Windows, that
areoutputbynvcc(seeCompilationwithNVCC).Thenamesforallsymbols,includingfunctions,global
variables,andtextureorsurfacereferences,aremaintainedatmodulescopesothatmoduleswritten
byindependentthirdpartiesmayinteroperateinthesameCUDAcontext.
Thiscodesampleloadsamoduleandretrievesahandletosomekernel:
| CUmodule cuModule;             |           |                  |              |     |     |
| ------------------------------ | --------- | ---------------- | ------------ | --- | --- |
| cuModuleLoad(&cuModule,        |           | "myModule.ptx"); |              |     |     |
| CUfunction                     | myKernel; |                  |              |     |     |
| cuModuleGetFunction(&myKernel, |           | cuModule,        | "MyKernel"); |     |     |
ThiscodesamplecompilesandloadsanewmodulefromPTXcodeandparsescompilationerrors:
| #define BUFFER_SIZE | 8192        |     |     |     |     |
| ------------------- | ----------- | --- | --- | --- | --- |
| CUmodule cuModule;  |             |     |     |     |     |
| CUjit_option        | options[3]; |     |     |     |     |
void* values[3];
| char* PTXCode | = "some PTX | code"; |     |     |     |
| ------------- | ----------- | ------ | --- | --- | --- |
char error_log[BUFFER_SIZE];
int err;
| options[0] | = CU_JIT_ERROR_LOG_BUFFER;            |     |     |     |     |
| ---------- | ------------------------------------- | --- | --- | --- | --- |
| values[0]  | = (void*)error_log;                   |     |     |     |     |
| options[1] | = CU_JIT_ERROR_LOG_BUFFER_SIZE_BYTES; |     |     |     |     |
| values[1]  | = (void*)BUFFER_SIZE;                 |     |     |     |     |
| options[2] | = CU_JIT_TARGET_FROM_CUCONTEXT;       |     |     |     |     |
| values[2]  | = 0;                                  |     |     |     |     |
err = cuModuleLoadDataEx(&cuModule, PTXCode, 3, options, values);
| if (err !=   | CUDA_SUCCESS)  |             |     |     |     |
| ------------ | -------------- | ----------- | --- | --- | --- |
| printf("Link | error:\n%s\n", | error_log); |     |     |     |
Thiscodesamplecompiles,links,andloadsanewmodulefrommultiplePTXcodesandparseslinkand
compilationerrors:
| #define BUFFER_SIZE | 8192        |     |     |     |     |
| ------------------- | ----------- | --- | --- | --- | --- |
| CUmodule cuModule;  |             |     |     |     |     |
| CUjit_option        | options[6]; |     |     |     |     |
void* values[6];
float walltime;
| char error_log[BUFFER_SIZE], |            | info_log[BUFFER_SIZE]; |     |     |     |
| ---------------------------- | ---------- | ---------------------- | --- | --- | --- |
| char* PTXCode0               | = "some    | PTX code";             |     |     |     |
| char* PTXCode1               | = "some    | other PTX code";       |     |     |     |
| CUlinkState                  | linkState; |                        |     |     |     |
int err;
void* cubin;
size_t cubinSize;
| options[0]  | = CU_JIT_WALL_TIME;                  |     |     |     |     |
| ----------- | ------------------------------------ | --- | --- | --- | --- |
| values[0] = | (void*)&walltime;                    |     |     |     |     |
| options[1]  | = CU_JIT_INFO_LOG_BUFFER;            |     |     |     |     |
| values[1] = | (void*)info_log;                     |     |     |     |     |
| options[2]  | = CU_JIT_INFO_LOG_BUFFER_SIZE_BYTES; |     |     |     |     |
| values[2] = | (void*)BUFFER_SIZE;                  |     |     |     |     |
(continuesonnextpage)
| 122 |     |     |     | Chapter3. | AdvancedCUDA |
| --- | --- | --- | --- | --------- | ------------ |

CUDAProgrammingGuide,Release13.1
(continuedfrompreviouspage)
| options[3]                | =                        | CU_JIT_ERROR_LOG_BUFFER;            |                  |         |                   |                  |     |         |           |     |
| ------------------------- | ------------------------ | ----------------------------------- | ---------------- | ------- | ----------------- | ---------------- | --- | ------- | --------- | --- |
| values[3]                 | =                        | (void*)error_log;                   |                  |         |                   |                  |     |         |           |     |
| options[4]                | =                        | CU_JIT_ERROR_LOG_BUFFER_SIZE_BYTES; |                  |         |                   |                  |     |         |           |     |
| values[4]                 | =                        | (void*)BUFFER_SIZE;                 |                  |         |                   |                  |     |         |           |     |
| options[5]                | =                        | CU_JIT_LOG_VERBOSE;                 |                  |         |                   |                  |     |         |           |     |
| values[5]                 | =                        | (void*)1;                           |                  |         |                   |                  |     |         |           |     |
| cuLinkCreate(6,           |                          | options,                            |                  | values, | &linkState);      |                  |     |         |           |     |
| err =                     | cuLinkAddData(linkState, |                                     |                  |         | CU_JIT_INPUT_PTX, |                  |     |         |           |     |
|                           |                          |                                     | (void*)PTXCode0, |         |                   | strlen(PTXCode0) |     | + 1, 0, | 0, 0, 0); |     |
| if (err                   | != CUDA_SUCCESS)         |                                     |                  |         |                   |                  |     |         |           |     |
| printf("Link              |                          |                                     | error:\n%s\n",   |         | error_log);       |                  |     |         |           |     |
| err =                     | cuLinkAddData(linkState, |                                     |                  |         | CU_JIT_INPUT_PTX, |                  |     |         |           |     |
|                           |                          |                                     | (void*)PTXCode1, |         |                   | strlen(PTXCode1) |     | + 1, 0, | 0, 0, 0); |     |
| if (err                   | != CUDA_SUCCESS)         |                                     |                  |         |                   |                  |     |         |           |     |
| printf("Link              |                          |                                     | error:\n%s\n",   |         | error_log);       |                  |     |         |           |     |
| cuLinkComplete(linkState, |                          |                                     |                  | &cubin, |                   | &cubinSize);     |     |         |           |     |
printf("Link completed in %fms. Linker Output:\n%s\n", walltime, info_log);
| cuModuleLoadData(cuModule, |     |     |     |     | cubin); |     |     |     |     |     |
| -------------------------- | --- | --- | --- | --- | ------- | --- | --- | --- | --- | --- |
cuLinkDestroy(linkState);
It’spossibletoacceleratesomepartsofthemodulelinking/loadingprocessbyusingmultiplethreads,
includingwhenloadingacubin. ThiscodesampleusesCU_JIT_BINARY_LOADER_THREAD_COUNTto
speedupmoduleloading.
| #define      | BUFFER_SIZE |             | 8192 |     |     |     |     |     |     |     |
| ------------ | ----------- | ----------- | ---- | --- | --- | --- | --- | --- | --- | --- |
| CUmodule     | cuModule;   |             |      |     |     |     |     |     |     |     |
| CUjit_option |             | options[3]; |      |     |     |     |     |     |     |     |
void* values[3];
| char* | cubinCode | =   | "some | cubin | code"; |     |     |     |     |     |
| ----- | --------- | --- | ----- | ----- | ------ | --- | --- | --- | --- | --- |
char error_log[BUFFER_SIZE];
int err;
| options[0] | =   | CU_JIT_ERROR_LOG_BUFFER;            |     |         |         |         |        |         |     |     |
| ---------- | --- | ----------------------------------- | --- | ------- | ------- | ------- | ------ | ------- | --- | --- |
| values[0]  | =   | (void*)error_log;                   |     |         |         |         |        |         |     |     |
| options[1] | =   | CU_JIT_ERROR_LOG_BUFFER_SIZE_BYTES; |     |         |         |         |        |         |     |     |
| values[1]  | =   | (void*)BUFFER_SIZE;                 |     |         |         |         |        |         |     |     |
| options[2] | =   | CU_JIT_BINARY_LOADER_THREAD_COUNT;  |     |         |         |         |        |         |     |     |
| values[2]  | =   | 0; ∕∕                               | Use | as many | threads | as CPUs | on the | machine |     |     |
err = cuModuleLoadDataEx(&cuModule, cubinCode, 3, options, values);
| if (err      | != CUDA_SUCCESS) |     |                |     |             |     |     |     |     |     |
| ------------ | ---------------- | --- | -------------- | --- | ----------- | --- | --- | --- | --- | --- |
| printf("Link |                  |     | error:\n%s\n", |     | error_log); |     |     |     |     |     |
FullcodecanbefoundintheptxjitCUDAsample.
| 3.3.3. | Kernel |     | Execution |     |     |     |     |     |     |     |
| ------ | ------ | --- | --------- | --- | --- | --- | --- | --- | --- | --- |
cuLaunchKernel()launchesakernelwithagivenexecutionconfiguration.
Parametersarepassedeitherasanarrayofpointers(nexttolastparameterofcuLaunchKernel())
wherethenthpointercorrespondstothenthparameterandpointstoaregionofmemoryfromwhich
theparameteriscopied,orasoneoftheextraoptions(lastparameterofcuLaunchKernel()).
| When parameters |     | are | passed | as an | extra option | (the |     |     |     | option), |
| --------------- | --- | --- | ------ | ----- | ------------ | ---- | --- | --- | --- | -------- |
CU_LAUNCH_PARAM_BUFFER_POINTER
they are passed as a pointer to a single buffer where parameters are assumed to be properly offset
| 3.3. TheCUDADriverAPI |     |     |     |     |     |     |     |     |     | 123 |
| --------------------- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |

CUDAProgrammingGuide,Release13.1
withrespecttoeachotherbymatchingthealignmentrequirementforeachparametertypeindevice
code.
Alignmentrequirementsindevicecodeforthebuilt-invectortypesarelistedinTable42. Forallother
basic types, the alignment requirement in device code matches the alignment requirement in host
codeandcanthereforebeobtainedusing__alignof(). Theonlyexceptioniswhenthehostcompiler
aligns double and long long (and long on a 64-bit system) on a one-word boundary instead of a
two-wordboundary(forexample, usinggcc’scompilationflag-mno-align-double)sinceindevice
codethesetypesarealwaysalignedonatwo-wordboundary.
CUdeviceptr is an integer, but represents a pointer, so its alignment requirement is __alig-
nof(void*).
Thefollowingcodesampleusesamacro(ALIGN_UP())toadjusttheoffsetofeachparametertomeet
itsalignmentrequirementandanothermacro(ADD_TO_PARAM_BUFFER())toaddeachparameterto
theparameterbufferpassedtotheCU_LAUNCH_PARAM_BUFFER_POINTERoption.
| #define | ALIGN_UP(offset, |             | alignment)    | \   |        |               |     |      |
| ------- | ---------------- | ----------- | ------------- | --- | ------ | ------------- | --- | ---- |
|         | (offset)         | = ((offset) | + (alignment) |     | - 1) & | ~((alignment) |     | - 1) |
char paramBuffer[1024];
| size_t  | paramBufferSize            |           | = 0;                        |            |     |             |     |     |
| ------- | -------------------------- | --------- | --------------------------- | ---------- | --- | ----------- | --- | --- |
| #define | ADD_TO_PARAM_BUFFER(value, |           |                             | alignment) |     |             |     | \   |
| do      | {                          |           |                             |            |     |             |     | \   |
|         | paramBufferSize            |           | = ALIGN_UP(paramBufferSize, |            |     | alignment); |     | \   |
|         | memcpy(paramBuffer         |           | + paramBufferSize,          |            |     |             |     | \   |
|         |                            | &(value), | sizeof(value));             |            |     |             |     | \   |
|         | paramBufferSize            |           | += sizeof(value);           |            |     |             |     | \   |
| }       | while                      | (0)       |                             |            |     |             |     |     |
int i;
| ADD_TO_PARAM_BUFFER(i,  |     |     | __alignof(i)); |          |           |     |     |     |
| ----------------------- | --- | --- | -------------- | -------- | --------- | --- | --- | --- |
| float4                  | f4; |     |                |          |           |     |     |     |
| ADD_TO_PARAM_BUFFER(f4, |     |     | 16); ∕∕        | float4's | alignment | is  | 16  |     |
char c;
| ADD_TO_PARAM_BUFFER(c, |     |     | __alignof(c)); |     |     |     |     |     |
| ---------------------- | --- | --- | -------------- | --- | --- | --- | --- | --- |
float f;
| ADD_TO_PARAM_BUFFER(f,          |         |         | __alignof(f));      |                   |           |      |     |     |
| ------------------------------- | ------- | ------- | ------------------- | ----------------- | --------- | ---- | --- | --- |
| CUdeviceptr                     |         | devPtr; |                     |                   |           |      |     |     |
| ADD_TO_PARAM_BUFFER(devPtr,     |         |         | __alignof(devPtr)); |                   |           |      |     |     |
| float2                          | f2;     |         |                     |                   |           |      |     |     |
| ADD_TO_PARAM_BUFFER(f2,         |         |         | 8); ∕∕              | float2's          | alignment | is 8 |     |     |
| void*                           | extra[] | = {     |                     |                   |           |      |     |     |
| CU_LAUNCH_PARAM_BUFFER_POINTER, |         |         |                     | paramBuffer,      |           |      |     |     |
| CU_LAUNCH_PARAM_BUFFER_SIZE,    |         |         |                     | &paramBufferSize, |           |      |     |     |
CU_LAUNCH_PARAM_END
};
cuLaunchKernel(cuFunction,
|     |     | blockWidth, | blockHeight, |     | blockDepth, |     |     |     |
| --- | --- | ----------- | ------------ | --- | ----------- | --- | --- | --- |
|     |     | gridWidth,  | gridHeight,  |     | gridDepth,  |     |     |     |
|     |     | 0, 0,       | 0, extra);   |     |             |     |     |     |
Thealignmentrequirementofastructureisequaltothemaximumofthealignmentrequirementsof
itsfields. Thealignmentrequirementofastructurethatcontainsbuilt-invectortypes,CUdeviceptr,
| 124 |     |     |     |     |     |     | Chapter3. | AdvancedCUDA |
| --- | --- | --- | --- | --- | --- | --- | --------- | ------------ |

CUDAProgrammingGuide,Release13.1
or non-aligned double and long long, might therefore differ between device code and host code.
Suchastructuremightalsobepaddeddifferently. Thefollowingstructure,forexample,isnotpadded
at all in host code, but it is padded in device code with 12 bytes after field f since the alignment
requirementforfieldf4is16.
| typedef | struct { |     |     |     |     |
| ------- | -------- | --- | --- | --- | --- |
| float   | f;       |     |     |     |     |
| float4  | f4;      |     |     |     |     |
} myStruct;
| 3.3.4. | Interoperability | between | Runtime | and Driver | APIs |
| ------ | ---------------- | ------- | ------- | ---------- | ---- |
AnapplicationcanmixruntimeAPIcodewithdriverAPIcode.
If a context is created and made current via the driver API, subsequent runtime calls will use this
contextinsteadofcreatinganewone.
Iftheruntimeisinitialized,cuCtxGetCurrent()canbeusedtoretrievethecontextcreatedduring
| initialization. | ThiscontextcanbeusedbysubsequentdriverAPIcalls. |     |     |     |     |
| --------------- | ----------------------------------------------- | --- | --- | --- | --- |
Theimplicitlycreatedcontextfromtheruntimeiscalledtheprimarycontext(seeRuntimeInitialization).
ItcanbemanagedfromthedriverAPIwiththePrimaryContextManagementfunctions.
DevicememorycanbeallocatedandfreedusingeitherAPI.CUdeviceptrcanbecasttoregularpoint-
ersandvice-versa:
| CUdeviceptr         | devPtr;                |     |     |     |     |
| ------------------- | ---------------------- | --- | --- | --- | --- |
| float*              | d_data;                |     |     |     |     |
| ∕∕ Allocation       | using driver           | API |     |     |     |
| cuMemAlloc(&devPtr, | size);                 |     |     |     |     |
| d_data              | = (float*)devPtr;      |     |     |     |     |
| ∕∕ Allocation       | using runtime          | API |     |     |     |
| cudaMalloc(&d_data, | size);                 |     |     |     |     |
| devPtr              | = (CUdeviceptr)d_data; |     |     |     |     |
Inparticular,thismeansthatapplicationswrittenusingthedriverAPIcaninvokelibrarieswrittenusing
theruntimeAPI(suchascuFFT,cuBLAS,…).
Allfunctionsfromthedeviceandversionmanagementsectionsofthereferencemanualcanbeused
interchangeably.
| 3.4. | Programming | Systems | with | Multiple | GPUs |
| ---- | ----------- | ------- | ---- | -------- | ---- |
Multi-GPUprogrammingallowsanapplicationtoaddressproblemsizesandachieveperformancelev-
els beyond what is possible with a single GPU by exploiting the larger aggregate arithmetic perfor-
mance,memorycapacity,andmemorybandwidthprovidedbymulti-GPUsystems.
CUDAenablesmulti-GPUprogrammingthroughhostAPIs,driverinfrastructure,andsupportingGPU
hardwaretechnologies:
▶
HostthreadCUDAcontextmanagement
▶ Unifiedmemoryaddressingforallprocessorsinthesystem
3.4. ProgrammingSystemswithMultipleGPUs 125

CUDAProgrammingGuide,Release13.1
▶ Peer-to-peerbulkmemorytransfersbetweenGPUs
▶ Fine-grainedpeer-to-peerGPUload/storememoryaccess
▶ HigherlevelabstractionsandsupportingsystemsoftwaresuchasCUDAinterprocesscommu-
nication, parallel reductions using NCCL, and communication using NVLink and/or GPU-Direct
RDMAwithAPIssuchasNVSHMEMandMPI
At the most basic level, multi-GPU programming requires the application to manage multiple active
CUDAcontextsconcurrently,distributedatatotheGPUs,launchkernelsontheGPUstocompletetheir
work, and to communicate or collect the results so that they can be acted upon by the application.
The details of how this is done differ depending on the most effective mapping of an application’s
algorithms, available parallelism, and existing code structure to a suitable multi-GPU programming
approach. Someofthemostcommonmulti-GPUprogrammingapproachesinclude:
▶ AsinglehostthreaddrivingmultipleGPUs
▶ Multiplehostthreads,eachdrivingtheirownGPU
▶ Multiplesingle-threadedhostprocesses,eachdrivingtheirownGPU
▶ Multiplehostprocessescontainingmultiplethreads,eachdrivingtheirownGPU
▶ Multi-node NVLink-connected clusters, with GPUs driven by threads and processes running
withinmultipleoperatingsysteminstancesacrosstheclusternodes
GPUs can communicate with each other through memory transfers and peer accesses between de-
vice memories, covering each of the multi-device work distribution approaches listed above. High
performance, low-latency GPU communications are supported by querying for and enabling the use
ofpeer-to-peerGPUmemoryaccess,andleveragingNVLinktoachievehighbandwidthtransfersand
finer-grainedload/storeoperationsbetweendevices.
CUDAunifiedvirtualaddressingpermitscommunicationbetweenmultipleGPUswithinthesamehost
processwithminimaladditionalstepstoqueryandenabletheuseofhighperformancepeer-to-peer
memoryaccessandtransfers,e.g.,viaNVLink.
Communication between multiple GPUs managed by different host processes is supported through
the use of interprocess communication (IPC) and Virtual memory Management (VMM) APIs. An in-
troductiontohighlevelIPCconceptsandintra-nodeCUDAIPCAPIsarediscussedintheInterprocess
Communicationsection. AdvancedVirtualMemoryManagement(VMM)APIssupportbothintra-node
andmulti-nodeIPC,areusableonbothLinuxandWindowsoperatingsystems,andallowper-allocation
granularitycontroloverIPCsharingofmemorybuffersasdescribedinVirtualMemoryManagement.
CUDAitselfprovidestheAPIsneededtoimplementcollectiveoperationswithinagroupofGPUs,po-
tentially including the host, but it does not provide high level multi-GPU collective APIs itself. Multi-
GPUcollectivesareprovidedbyhigherabstractionCUDAcommunicationlibrariessuchasNCCLand
NVSHMEM.
3.4.1. Multi-Device Context and Execution Management
ThefirststepsthatarerequiredtoforanapplicationtousemultipleGPUsaretoenumeratetheavail-
able GPU devices, select among the available devices as appropriate based on their hardware prop-
erties,CPUaffinity,andconnectivitytopeers,andtocreateCUDAcontextsforeachdevicethatthe
applicationwilluse.
126 Chapter3. AdvancedCUDA

CUDAProgrammingGuide,Release13.1
3.4.1.1 DeviceEnumeration
ThefollowingcodesampleshowshowtoquerynumberofCUDA-enableddevices,enumerateeachof
thedevices,andquerytheirproperties.
int deviceCount;
cudaGetDeviceCount(&deviceCount);
int device;
| for (device                          | =       | 0; device         | <   | deviceCount; |            | ++device)          |            | {   |     |
| ------------------------------------ | ------- | ----------------- | --- | ------------ | ---------- | ------------------ | ---------- | --- | --- |
| cudaDeviceProp                       |         | deviceProp;       |     |              |            |                    |            |     |     |
| cudaGetDeviceProperties(&deviceProp, |         |                   |     |              |            | device);           |            |     |     |
| printf("Device                       |         | %d                | has | compute      | capability |                    | %d.%d.\n", |     |     |
|                                      | device, | deviceProp.major, |     |              |            | deviceProp.minor); |            |     |     |
}
3.4.1.2 DeviceSelection
AhostthreadcansetthedeviceitiscurrentlyoperatingonatanytimebycallingcudaSetDevice().
Devicememoryallocationsandkernellaunchesaremadeonthecurrentdevice; streamsandevents
arecreatedinassociationwiththecurrentlysetdevice. UntilacalltocudaSetDevice()ismadeby
thehostthread,thecurrentdevicedefaultstodevice0.
Thefollowingcodesampleillustrateshowsettingthecurrentdeviceaffectssubsequentmemoryal-
locationandkernelexecutionoperations.
| size_t size       | =   | 1024 *      | sizeof(float); |     |          |        |      |             |     |
| ----------------- | --- | ----------- | -------------- | --- | -------- | ------ | ---- | ----------- | --- |
| cudaSetDevice(0); |     |             |                | ∕∕  | Set      | device | 0 as | current     |     |
| float* p0;        |     |             |                |     |          |        |      |             |     |
| cudaMalloc(&p0,   |     | size);      |                | ∕∕  | Allocate | memory |      | on device 0 |     |
| MyKernel<<<1000,  |     | 128>>>(p0); |                | ∕∕  | Launch   | kernel | on   | device 0    |     |
| cudaSetDevice(1); |     |             |                | ∕∕  | Set      | device | 1 as | current     |     |
| float* p1;        |     |             |                |     |          |        |      |             |     |
| cudaMalloc(&p1,   |     | size);      |                | ∕∕  | Allocate | memory |      | on device 1 |     |
| MyKernel<<<1000,  |     | 128>>>(p1); |                | ∕∕  | Launch   | kernel | on   | device 1    |     |
3.4.1.3 Multi-DeviceStream,Event,andMemoryCopyBehavior
Akernellaunchwillfailifitisissuedtoastreamthatisnotassociatedtothecurrentdeviceasillus-
tratedinthefollowingcodesample.
| cudaSetDevice(0);      |     |     |     |     | ∕∕ Set    | device | 0      | as current   |     |
| ---------------------- | --- | --- | --- | --- | --------- | ------ | ------ | ------------ | --- |
| cudaStream_t           | s0; |     |     |     |           |        |        |              |     |
| cudaStreamCreate(&s0); |     |     |     |     | ∕∕ Create |        | stream | s0 on device | 0   |
MyKernel<<<100, 64, 0, s0>>>(); ∕∕ Launch kernel on device 0 in s0
| cudaSetDevice(1);      |     |     |     |     | ∕∕ Set    | device | 1      | as current   |     |
| ---------------------- | --- | --- | --- | --- | --------- | ------ | ------ | ------------ | --- |
| cudaStream_t           | s1; |     |     |     |           |        |        |              |     |
| cudaStreamCreate(&s1); |     |     |     |     | ∕∕ Create |        | stream | s1 on device | 1   |
MyKernel<<<100, 64, 0, s1>>>(); ∕∕ Launch kernel on device 1 in s1
∕∕ This kernel launch will fail, since stream s0 is not associated to device 1:
MyKernel<<<100, 64, 0, s0>>>(); ∕∕ Launch kernel on device 1 in s0
Amemorycopywillsucceedevenifitisissuedtoastreamthatisnotassociatedtothecurrentdevice.
3.4. ProgrammingSystemswithMultipleGPUs 127

CUDAProgrammingGuide,Release13.1
cudaEventRecord()willfailiftheinputeventandinputstreamareassociatedtodifferentdevices.
cudaEventElapsedTime()willfailifthetwoinputeventsareassociatedtodifferentdevices.
cudaEventSynchronize() and cudaEventQuery() will succeed even if the input event is associ-
atedtoadevicethatisdifferentfromthecurrentdevice.
cudaStreamWaitEvent() will succeed even if the input stream and input event are associated to
different devices. cudaStreamWaitEvent() can therefore be used to synchronize multiple devices
witheachother.
Each device has its own defaultstream, so commands issued to the default stream of a device may
executeoutoforderorconcurrentlywithrespecttocommandsissuedtothedefaultstreamofany
otherdevice.
| 3.4.2. | Multi-Device |     | Peer-to-Peer |     | Transfers |     | and | Memory |
| ------ | ------------ | --- | ------------ | --- | --------- | --- | --- | ------ |
Access
3.4.2.1 Peer-to-PeerMemoryTransfers
CUDAcanperformmemorytransfersbetweendevicesandwilltakeadvantageofdedicatedcopyen-
ginesandNVLinkhardwaretomaximizeperformancewhenpeer-to-peermemoryaccessispossible.
cudaMemcpycanbeusedwiththecopytypecudaMemcpyDeviceToDeviceorcudaMemcpyDefault.
Otherwise, copies must be performed using cudaMemcpyPeer(), cudaMemcpyPeerAsync(), cud-
aMemcpy3DPeer(),orcudaMemcpy3DPeerAsync()asillustratedinthefollowingcodesample.
| cudaSetDevice(0);  |        |                       |           | ∕∕ Set      | device | 0 as   | current   |     |
| ------------------ | ------ | --------------------- | --------- | ----------- | ------ | ------ | --------- | --- |
| float*             | p0;    |                       |           |             |        |        |           |     |
| size_t             | size = | 1024 * sizeof(float); |           |             |        |        |           |     |
| cudaMalloc(&p0,    |        | size);                |           | ∕∕ Allocate |        | memory | on device | 0   |
| cudaSetDevice(1);  |        |                       |           | ∕∕ Set      | device | 1 as   | current   |     |
| float*             | p1;    |                       |           |             |        |        |           |     |
| cudaMalloc(&p1,    |        | size);                |           | ∕∕ Allocate |        | memory | on device | 1   |
| cudaSetDevice(0);  |        |                       |           | ∕∕ Set      | device | 0 as   | current   |     |
| MyKernel<<<1000,   |        | 128>>>(p0);           |           | ∕∕ Launch   | kernel |        | on device | 0   |
| cudaSetDevice(1);  |        |                       |           | ∕∕ Set      | device | 1 as   | current   |     |
| cudaMemcpyPeer(p1, |        | 1, p0,                | 0, size); | ∕∕ Copy     | p0     | to p1  |           |     |
| MyKernel<<<1000,   |        | 128>>>(p1);           |           | ∕∕ Launch   | kernel |        | on device | 1   |
Acopy(intheimplicitNULLstream)betweenthememoriesoftwodifferentdevices:
▶
doesnotstartuntilallcommandspreviouslyissuedtoeitherdevicehavecompletedand
▶ runstocompletionbeforeanycommands(seeAsynchronousExecution)issuedafterthecopyto
eitherdevicecanstart.
Consistentwiththenormalbehaviorofstreams,anasynchronouscopybetweenthememoriesoftwo
devicesmayoverlapwithcopiesorkernelsinanotherstream.
Ifpeer-to-peeraccessisenabledbetweentwodevices,e.g.,asdescribedinPeer-to-PeerMemoryAc-
cess, peer-to-peer memory copies between these two devices no longer need to be staged through
thehostandarethereforefaster.
| 128 |     |     |     |     |     |     | Chapter3. | AdvancedCUDA |
| --- | --- | --- | --- | --- | --- | --- | --------- | ------------ |

CUDAProgrammingGuide,Release13.1
3.4.2.2 Peer-to-PeerMemoryAccess
Depending on the system properties, specifically the PCIe and/or NVLink topology, devices are able
to address each other’s memory (i.e., a kernel executing on one device can dereference a pointer to
thememoryoftheotherdevice). Peer-to-peermemoryaccessissupportedbetweentwodevicesif
cudaDeviceCanAccessPeer()returnstrueforthespecifieddevices.
Peer-to-peer memory access must be enabled between two devices by calling
cudaDeviceEn-
ablePeerAccess()asillustratedinthefollowingcodesample. Onnon-NVSwitchenabledsystems,
eachdevicecansupportasystem-widemaximumofeightpeerconnections.
Aunifiedvirtualaddressspaceisusedforbothdevices(seeUnifiedVirtualAddressSpace),sothesame
pointercanbeusedtoaddressmemoryfrombothdevicesasshowninthecodesamplebelow.
| cudaSetDevice(0); |        |                       | ∕∕ Set      | device 0 as | current     |
| ----------------- | ------ | --------------------- | ----------- | ----------- | ----------- |
| float*            | p0;    |                       |             |             |             |
| size_t            | size = | 1024 * sizeof(float); |             |             |             |
| cudaMalloc(&p0,   |        | size);                | ∕∕ Allocate | memory      | on device 0 |
| MyKernel<<<1000,  |        | 128>>>(p0);           | ∕∕ Launch   | kernel      | on device 0 |
| cudaSetDevice(1); |        |                       | ∕∕ Set      | device 1 as | current     |
cudaDeviceEnablePeerAccess(0, 0); ∕∕ Enable peer-to-peer access
|           |        |             | ∕∕ with | device 0 |     |
| --------- | ------ | ----------- | ------- | -------- | --- |
| ∕∕ Launch | kernel | on device 1 |         |          |     |
∕∕ This kernel launch can access memory on device 0 at address p0
| MyKernel<<<1000, |     | 128>>>(p0); |     |     |     |
| ---------------- | --- | ----------- | --- | --- | --- |
Note
The use of cudaDeviceEnablePeerAccess() to enable peer memory access operates globally
onallpreviousandsubsequentGPUmemoryallocationsonthepeerdevice. Enablingpeeraccess
toadeviceviacudaDeviceEnablePeerAccess()addsruntimecosttodevicememoryallocation
operationsonthatpeerduetotheneedmaketheallocationsimmediatelyaccessibletothecurrent
deviceandanyotherpeersthatalsohaveaccess,addingmultiplicativeoverheadthatscaleswith
thenumberofpeerdevices.
Amorescalablealternativetoenablingpeermemoryaccessforalldevicememoryallocationsisto
makeuseofCUDAVirtualMemoryManagementAPIstoexplicitlyallocatepeer-accessiblemem-
ory regions only as-needed, at allocation time. By requesting peer-accessibility explicitly during
memory allocation, the runtime cost of memory allocations are unharmed for allocations not ac-
cessibletopeers,andpeer-accessibledatastructuresarecorrectlyscopedforimprovedsoftware
debuggingandreliability(seeref::virtual-memory-management).
3.4.2.3 Peer-to-PeerMemoryConsistency
Synchronization operations must be used to enforce the ordering and correctness of memory ac-
cesses by concurrently executing threads in grids distributed across multiple devices. Threads syn-
chronizing across devices operate at the thread_scope_system synchronization scope. Similarly,
memoryoperationsfallwithinthethread_scope_systemmemorysynchronizationdomain.
CUDA ref::atomic-functions can perform read-modify-write operations on an object in peer device
memory when only a single GPU is accessing that object. The requirements and limitations for peer
atomicityaredescribedintheCUDAmemorymodelatomicityrequirementsdiscussion.
3.4. ProgrammingSystemswithMultipleGPUs 129

CUDAProgrammingGuide,Release13.1
3.4.2.4 Multi-DeviceManagedMemory
Managed memory can be used on multi-GPU systems with peer-to-peer support. The detailed re-
quirementsforconcurrentmulti-devicemanagedmemoryaccessandAPIsforGPU-exclusiveaccess
tomanagedmemoryaredescribedinMulti-GPU.
3.4.2.5 HostIOMMUHardware,PCIAccessControlServices,andVMs
On Linux specifically, CUDA and the display driver do not support IOMMU-enabled bare-metal PCIe
peer-to-peer memory transfer. However, CUDA and the display driver do support IOMMU via virtual
machine pass through. The IOMMU must be disabled when running Linux on a bare metal system
topreventsilentdevicememorycorruption. Conversely,theIOMMUshouldbeenabledandtheVFIO
driverbeusedforPCIepassthroughforvirtualmachines.
OnWindowstheIOMMUlimitationabovedoesnotexist.
SeealsoAllocatingDMABufferson64-bitPlatforms.
Additionally,PCIAccessControlServices(ACS)canbeenabledonsystemsthatsupportIOMMU.The
PCIACSfeatureredirectsallPCIpoint-to-pointtrafficthroughtheCPUrootcomplex,whichcancause
significantperformancelossduetothereductioninoverallbisectionbandwidth.
3.5. A Tour of CUDA Features
Sections1-3ofthisprogrammingguidehaveintroducedCUDAandGPUprogramming,coveringfoun-
dationaltopicsbothconceptuallyandinsimplecodeexamples. ThesectionsdescribingspecificCUDA
featuresinpart4ofthisguideassumeknowledgeoftheconceptscoveredinsections1-3ofthisguide.
CUDAhasmanyfeatureswhichapplytodifferentproblems. Notallofthemwillbeapplicabletoevery
usecase. Thischapterservestointroduceeachofthesefeaturesanddescribeitsintendeduseand
the problems it may help solve. Features are coarsely sorted into categories by the type of problem
theyareintendedtosolve. Somefeatures,suchasCUDAgraphs,couldfitintomorethanonecategory.
Section4coverstheseCUDAfeaturesinmorecompletedetail.
3.5.1. Improving Kernel Performance
Thefeaturesoutlinedinthissectionareallintendedtoaidkerneldeveloperstomaximizetheperfor-
manceoftheirkernels.
3.5.1.1 AsynchronousBarriers
AsynchronousbarrierswereintroducedinSection3.2.4.2andallowformorenuancedcontroloversyn-
chronization between threads. Asynchronous barriers separate the arrival and the wait of a barrier.
Thisallowsapplicationstoperformworkthatdoesnotdependonthebarrierwhilewaitingforother
threads to arrive. Asynchronous barriers can be specified for different threadscopes. Full details of
asynchronousbarriersarefoundinSection4.9.
3.5.1.2 AsynchronousDataCopiesandtheTensorMemoryAccelerator(TMA)
AsynchronousdatacopiesinthecontextofCUDAkernelcodereferstotheabilitytomovedatabetween
sharedmemoryandGPUDRAMwhilestillcarryingoutcomputations. Thisshouldnotbeconfusedwith
asynchronousmemorycopiesbetweentheCPUandGPU.Thisfeaturemakesusedofasynchronous
barriers. Section4.11coverstheuseofasynchronouscopiesindetail.
130 Chapter3. AdvancedCUDA

CUDAProgrammingGuide,Release13.1
3.5.1.3 Pipelines
Pipelines are a mechanism for staging work and coordinating multi-buffer producer–consumer pat-
terns, commonly used to overlap compute with asynchronous data copies. Section 4.10 has details
andexamplesofusingpipelinesinCUDA.
3.5.1.4 WorkStealingwithClusterLaunchControl
Workstealingisatechniqueformaintainingutilizationinunevenworkloadswhereworkersthathave
completedtheirworkcan‘steal’tasksfromotherworkers. Clusterlaunchcontrol,afeatureintroduced
in compute capability 10.0 (Blackwell), gives kernels direct control over in-flight block scheduling so
they can adapt to uneven workloads in real time. A thread block can cancel the launch of another
threadblockorclusterthathasnotyetstarted,claimitsindex,andimmediatelybeginexecutingthe
stolenwork. Thiswork-stealingflowkeepsSMsbusyandcutsidletimeunderirregulardataorruntime
variation—deliveringfiner-grainedloadbalancingwithoutrelyingonthehardwarescheduleralone.
Section4.12providesdetailsonhowtousethisfeature.
3.5.2. Improving Latencies
Thefeaturesoutlinedinthissectionshareacommonthemeofaimingtoreducesometypeoflatency,
thoughthetypeoflatencybeingaddresseddiffersbetweenthedifferentfeatures. Byandlargethey
are focused on latencies at the kernel launch level or higher. GPU memory access latency within a
kernelisnotoneofthelatenciesconsideredhere.
3.5.2.1 GreenContexts
Greencontexts, also called executioncontexts, is the name given to a CUDA feature which enables a
program to create CUDAcontexts which will execute work only on a subset of the SMs of a GPU. By
default,thethreadblocksofakernellauncharedispatchedtoanySMwithintheGPUwhichcanfulfill
theresourcerequirementsofthekernel. Therearealargenumberoffactorswhichcanaffectwhich
SMscanexecuteathreadblock,includingbutnotnecessarilylimitedto: sharedmemoryuse,register
use,useofclusters,andtotalnumberofthreadsinthethreadblock.
Executioncontextsallowakerneltobelaunchedintoaspeciallycreatedcontextwhichfurtherlimits
the number of SMs available to execute the kernel. Importantly, when a program creates a green
contextwhichusessomesetofSMs,othercontextsontheGPUwillnotschedulethreadblocksonto
theSMsallocatedtothegreencontext. Thisincludestheprimarycontext,whichisthedefaultcontext
usedbytheCUDAruntime. ThisallowstheseSMstobereservedforworkloadswhicharehighpriority
orlatency-sensitive.
Section4.6givesfulldetailsontheuseofgreencontexts. GreencontextsareavailableintheCUDA
runtimeinCUDA13.1andlater.
3.5.2.2 Stream-OrderedMemoryAllocation
Thestream-orderedmemoryallocatorallowsprogramstosequenceallocationandfreeingofGPUmem-
ory into a CUDA stream. Unlike cudaMalloc and cudaFree which execute immediately, cudaMal-
locAsync and cudaFreeAsync inserts a memory allocation or free operation into a CUDA stream.
Section4.3coversallthedetailsoftheseAPIs.
3.5. ATourofCUDAFeatures 131

CUDAProgrammingGuide,Release13.1
3.5.2.3 CUDAGraphs
CUDAgraphsenableanapplicationtospecifyasequenceofCUDAoperationssuchaskernellaunches
or memory copies and the dependencies between these operations so that they can be executed
efficientlyontheGPU.SimilarbehaviorcanbeattainedbyusingCUDAstreams,andindeedoneofthe
mechanismsforcreatingagraphiscalledstreamcapture,whichenablestheoperationsonastream
toberecordedintoaCUDAgraph. GraphscanalsobecreatedusingtheCUDAgraphsAPI.
Once a graph has been created, it can be instantiated and executed many times. This is useful for
specifyingworkloadsthatwillberepeated. GraphsoffersomeperformancebenefitsinreducingCPU
launchcostsassociatedwithinvokingCUDAoperationsaswellasenablingoptimizationsonlyavailable
whenthewholeworkloadisspecifiedinadvance.
Section4.2describesanddemonstrateshowtouseCUDAgraphs.
3.5.2.4 ProgrammaticDependentLaunch
ProgrammaticdependentlaunchisaCUDAfeaturewhichallowsadependentkernel,i.e. akernelwhich
depends on the output of a prior kernel, to begin execution before the primary kernel on which it
depends has completed. The dependentkernelcan executesetup code and unrelatedworkup until
it requires data from the primary kernel and block there. The primary kernel can signal when the
data required by the dependent kernel is ready, which will release the dependent kernel to continue
executing. This enables some overlap between the kernels which can help keep GPU utilization high
while minimizing the latency of the critical data path. Section 4.5 covers programmatic dependent
launch.
3.5.2.5 LazyLoading
LazyloadingisafeaturewhichallowscontroloverhowtheJITcompileroperatesatapplicationstartup.
ApplicationswhichhavemanykernelswhichneedtobeJITcompiledfromPTXtocubinmayexperience
longstartuptimesifallkernelsareJITcompiledaspartofapplicationstartup. Thedefaultbehavioris
thatmodulesarenotcompileduntiltheyareneeded. Thiscanbechangedbytheuseofenvironment
variables,asdetailedinSection4.7.
3.5.3. Functionality Features
Thefeaturesdescribedhereshareacommontraitthattheyaremeanttoenableadditionalcapabilities
orfunctionality.
3.5.3.1 ExtendedGPUMemory
ExtendedGPUmemory isafeatureavailableinNVLink-C2Cconnectedsystemsthatenablesefficient
accesstoallmemorywithinthesystemfromwithinaGPU.EGMiscoveredindetailinSection4.17.
3.5.3.2 DynamicParallelism
CUDAapplicationsmostcommonlylaunchkernelsfromcoderunningontheCPU.Itisalsopossibleto
createnewkernelinvocationsfromakernelrunningontheGPU.ThisfeatureisreferredtoasCUDA
dynamicparallelism. Section 4.18 covers the details of creating new GPU kernel launches from code
runningontheGPU.
132 Chapter3. AdvancedCUDA

CUDAProgrammingGuide,Release13.1
3.5.4. CUDA Interoperability
3.5.4.1 CUDAInteroperabilitywithotherAPIs
ThereareothermechanismsthanCUDAforrunningcodeonGPUs. TheapplicationGPUswereorig-
inally built to accelerate, computer graphics, uses its own set of APIs such as Direct3D and Vulkan.
Applications may wish to use one of the graphics APIs for 3D rendering while performing computa-
tionsinCUDA.CUDAprovidesmechanismsforexchangingdatastoredontheGPUbetweentheCUDA
contextsandtheGPUcontextsusedbythe3DAPIs. Forexample,anapplicationmayperformasim-
ulationusingCUDA,andthenusea3DAPItocreatevisualizationsoftheresults. Thisisachievedby
makingsomebuffersreadableand/orwriteablefrombothCUDAandthegraphicsAPI.
ThesamemechanismswhichallowsharingofbufferswithgraphicsAPIsarealsousedtosharebuffers
withcommunicationsmechanismswhichcanenablerapid,directGPU-to-GPUcommunicationwithin
multi-nodeenvironments.
Section4.19describeshowCUDAinteroperateswithotherGPUAPIsandhowtosharedatabetween
CUDAandotherAPIs,providingspecificexamplesforanumberofdifferentAPIs.
3.5.4.2 InterprocessCommunication
Forverylargecomputations,itiscommontousemultipleGPUstogethertomakeuseofmorememory
andmorecomputeresourcesworkingtogetheronaproblem. Withinasinglesystem,ornodeincluster
computingterminology,multipleGPUscanbeusedinasinglehostprocess. ThisisdescribedinSection
3.4.
Itisalsocommontouseseparatehostprocessesspanningeitherasinglecomputerormultiplecom-
puters. When multiple processes are working together, communication between them is known as
interprocesscommunication. CUDAinterprocesscommunication(CUDAIPC)providesmechanismsto
shareGPUbuffersbetweendifferentprocesses. Section4.15explainsanddemonstrateshowCUDA
IPCcanbeusedtocoordinateandcommunicatebetweendifferenthostprocesses.
3.5.5. Fine-Grained Control
3.5.5.1 VirtualMemoryManagement
AsmentionedinSection2.4.1,allGPUsinasystem,alongwiththeCPUmemory,shareasingleunified
virtualaddressspace. MostapplicationscanusethedefaultmemorymanagementprovidedbyCUDA
withouttheneedtochangeitsbehavior. However,theCUDAdriverAPIprovidesadvancedanddetailed
controlsoverthelayoutofthisvirtualmemoryspaceforthosethatneedit. Thisismostlyapplicable
for controlling the behavior of buffers when sharing between GPUs both within and across multiple
systems.
Section4.16coversthecontrolsofferedbytheCUDAdriverAPI,howtheyworkandwhenadeveloper
mayfindthemadvantageous.
3.5.5.2 DriverEntryPointAccess
Driverentrypointaccess refers to the ability, starting in CUDA 11.3, to retrieve function pointers to
the CUDA Driver and CUDA Runtime APIs. It also allows developers to retrieve function pointers for
specific variants of driver functions, and to access driver functions from drivers newer than those
availableintheCUDAtoolkit. Section4.20coversdriverentrypointaccess.
3.5. ATourofCUDAFeatures 133

CUDAProgrammingGuide,Release13.1
3.5.5.3 ErrorLogManagement
Error log management provides utilities for handling and logging errors from CUDA APIs. Setting a
singleenvironmentvariableCUDA_LOG_FILEenablescapturingCUDAerrorsdirectlytostderr,stdout,
orafile. Errorlogmanagementalsoenablesapplicationstoregisteracallbackwhichistriggeredwhen
CUDAencountersanerror. Section4.8providesmoredetailsonerrorlogmanagement.
134 Chapter3. AdvancedCUDA

Chapter 4. CUDA Features
4.1. Unified Memory
Thissectionexplainsthedetailedbehavioranduseofeachofthedifferentparadigmsofunifiedmem-
oryavailable. Theearliersectiononunifiedmemory showedhowtodeterminewhichunifiedmemory
paradigmappliesandbrieflyintroducedeach.
Asdiscussedpreviouslytherearefourparadigmsofunifiedmemoryprogramming:
▶ Fullsupportforexplicitmanagedmemoryallocations
▶ Fullsupportforallallocationswithsoftwarecoherence
▶ Fullsupportforallallocationswithhardwarecoherence
▶ Limitedunifiedmemorysupport
The first three paradigms involving full unified memory support have very similar behavior and pro-
grammingmodelandarecoveredinUnifiedMemoryonDeviceswithFullCUDAUnifiedMemorySupport
withanydifferenceshighlighted.
Thelastparadigm, whereunifiedmemorysupportislimited, isdiscussedindetailinUnifiedMemory
onWindows,WSL,andTegra.
4.1.1. Unified Memory on Devices with Full CUDA Unified
Memory Support
Thesesystemsincludehardware-coherentmemorysystems,suchasNVIDIAGraceHopperandmod-
ern Linux systems with Heterogeneous Memory Management (HMM) enabled. HMM is a software-
basedmemorymanagementsystem,providingthesameprogrammingmodelashardware-coherent
memorysystems.
LinuxHMMrequiresLinuxkernelversion6.1.24+,6.2.11+or6.3+,deviceswithcomputecapability7.5
orhigherandaCUDAdriverversion535+installedwithOpenKernelModules.
Note
WerefertosystemswithacombinedpagetableforbothCPUsandGPUsashardwarecoherentsys-
tems. SystemswithseparatepagetablesforCPUsandGPUsarereferredtoassoftware-coherent.
Hardware-coherent systems such as NVIDIA Grace Hopper offer a logically combined page table for
bothCPUsandGPUs,seeCPUandGPUPageTables:HardwareCoherencyvs.SoftwareCoherency. The
135

CUDAProgrammingGuide,Release13.1
followingsectiononlyappliestohardware-coherentsystems:
▶ AccessCounterMigration
| 4.1.1.1 UnifiedMemory: |     |     | In-DepthExamples |     |     |     |     |     |     |
| ---------------------- | --- | --- | ---------------- | --- | --- | --- | --- | --- | --- |
Systems with full CUDA unified memory support, see table Overview of Unified Memory Paradigms,
allowthedevicetoaccessanymemoryownedbythehostprocessinteractingwiththedevice.
Thissectionshowsafewadvanceduse-cases,usingakernelthatsimplyprintsthefirst8characters
ofaninputcharacterarraytothestandardoutputstream:
| __global__ | void  | kernel(const |             | char*       | type,        | const | char*     | data) { |     |
| ---------- | ----- | ------------ | ----------- | ----------- | ------------ | ----- | --------- | ------- | --- |
| static     | const | int          | n_char      | = 8;        |              |       |           |         |     |
| printf("%s |       | - first      | %d          | characters: | '",          | type, | n_char);  |         |     |
| for (int   | i     | = 0;         | i < n_char; | ++i)        | printf("%c", |       | data[i]); |         |     |
printf("'\n");
}
Thefollowingtabsshowvariouswaysofhowthiskernelmaybecalledwithsystem-allocatedmemory:
Malloc
| void test_malloc()             |           |                | {                                     |             |                       |     |     |     |     |
| ------------------------------ | --------- | -------------- | ------------------------------------- | ----------- | --------------------- | --- | --- | --- | --- |
| const                          | char      | test_string[]  |                                       | = "Hello    | World";               |     |     |     |     |
| char*                          | heap_data |                | = (char*)malloc(sizeof(test_string)); |             |                       |     |     |     |     |
| strncpy(heap_data,             |           |                | test_string,                          |             | sizeof(test_string)); |     |     |     |     |
| kernel<<<1,                    |           | 1>>>("malloc", |                                       | heap_data); |                       |     |     |     |     |
| ASSERT(cudaDeviceSynchronize() |           |                |                                       |             | == cudaSuccess,       |     |     |     |     |
"CUDA failed with '%s'", cudaGetErrorString(cudaGetLastError()));
free(heap_data);
}
Managed
| void test_managed()            |       |                 | {   |                       |                 |     |     |     |     |
| ------------------------------ | ----- | --------------- | --- | --------------------- | --------------- | --- | --- | --- | --- |
| const                          | char  | test_string[]   |     | = "Hello              | World";         |     |     |     |     |
| char*                          | data; |                 |     |                       |                 |     |     |     |     |
| cudaMallocManaged(&data,       |       |                 |     | sizeof(test_string)); |                 |     |     |     |     |
| strncpy(data,                  |       | test_string,    |     | sizeof(test_string)); |                 |     |     |     |     |
| kernel<<<1,                    |       | 1>>>("managed", |     | data);                |                 |     |     |     |     |
| ASSERT(cudaDeviceSynchronize() |       |                 |     |                       | == cudaSuccess, |     |     |     |     |
"CUDA failed with '%s'", cudaGetErrorString(cudaGetLastError()));
cudaFree(data);
}
Stackvariable
| void test_stack()              |      |               | {   |               |                 |     |     |     |     |
| ------------------------------ | ---- | ------------- | --- | ------------- | --------------- | --- | --- | --- | --- |
| const                          | char | test_string[] |     | = "Hello      | World";         |     |     |     |     |
| kernel<<<1,                    |      | 1>>>("stack", |     | test_string); |                 |     |     |     |     |
| ASSERT(cudaDeviceSynchronize() |      |               |     |               | == cudaSuccess, |     |     |     |     |
(continuesonnextpage)
| 136 |     |     |     |     |     |     |     | Chapter4. | CUDAFeatures |
| --- | --- | --- | --- | --- | --- | --- | --- | --------- | ------------ |

CUDAProgrammingGuide,Release13.1
(continuedfrompreviouspage)
"CUDA failed with '%s'", cudaGetErrorString(cudaGetLastError()));
}
File-scopestaticvariable
| void test_static()             |       |                | {                  |     |               |                 |         |     |
| ------------------------------ | ----- | -------------- | ------------------ | --- | ------------- | --------------- | ------- | --- |
| static                         | const |                | char test_string[] |     |               | = "Hello        | World"; |     |
| kernel<<<1,                    |       | 1>>>("static", |                    |     | test_string); |                 |         |     |
| ASSERT(cudaDeviceSynchronize() |       |                |                    |     |               | == cudaSuccess, |         |     |
"CUDA failed with '%s'", cudaGetErrorString(cudaGetLastError()));
}
Global-scopevariable
| const                          | char | global_string[] |     |     | = "Hello        | World";         |     |     |
| ------------------------------ | ---- | --------------- | --- | --- | --------------- | --------------- | --- | --- |
| void test_global()             |      |                 | {   |     |                 |                 |     |     |
| kernel<<<1,                    |      | 1>>>("global",  |     |     | global_string); |                 |     |     |
| ASSERT(cudaDeviceSynchronize() |      |                 |     |     |                 | == cudaSuccess, |     |     |
"CUDA failed with '%s'", cudaGetErrorString(cudaGetLastError()));
}
Global-scopeexternvariable
| ∕∕ declared                    |       | in             | separate | file, | see        | below           |     |     |
| ------------------------------ | ----- | -------------- | -------- | ----- | ---------- | --------------- | --- | --- |
| extern                         | char* | ext_data;      |          |       |            |                 |     |     |
| void test_extern()             |       |                | {        |       |            |                 |     |     |
| kernel<<<1,                    |       | 1>>>("extern", |          |       | ext_data); |                 |     |     |
| ASSERT(cudaDeviceSynchronize() |       |                |          |       |            | == cudaSuccess, |     |     |
"CUDA failed with '%s'", cudaGetErrorString(cudaGetLastError()));
}
| ∕** This | may | be  | a non-CUDA |     | file | *∕  |     |     |
| -------- | --- | --- | ---------- | --- | ---- | --- | --- | --- |
char* ext_data;
| static             | const | char                                    | global_string[] |     |     | = "Hello                | World"; |     |
| ------------------ | ----- | --------------------------------------- | --------------- | --- | --- | ----------------------- | ------- | --- |
| void __attribute__ |       |                                         | ((constructor)) |     |     | setup(void)             |         | {   |
| ext_data           |       | = (char*)malloc(sizeof(global_string)); |                 |     |     |                         |         |     |
| strncpy(ext_data,  |       |                                         | global_string,  |     |     | sizeof(global_string)); |         |     |
}
| void __attribute__ |     |     | ((destructor)) |     |     | tear_down(void) |     | {   |
| ------------------ | --- | --- | -------------- | --- | --- | --------------- | --- | --- |
free(ext_data);
}
Notethatfortheexternvariable,itcouldbedeclaredanditsmemoryownedandmanagedbyathird-
partylibrary,whichdoesnotinteractwithCUDAatall.
Also note that stack variables as well as file-scope and global-scope variables can only be accessed
throughapointerbytheGPU.Inthisspecificexample,thisisconvenientbecausethecharacterarray
4.1. UnifiedMemory 137

CUDAProgrammingGuide,Release13.1
isalreadydeclaredasapointer: const char*. However,considerthefollowingexamplewithaglobal-
scopeinteger:
| ∕∕ this | variable | is declared |     | at  | global | scope |     |     |     |     |
| ------- | -------- | ----------- | --- | --- | ------ | ----- | --- | --- | --- | --- |
int global_variable;
| __global__ | void | kernel_uncompilable() |     |     |     | {   |     |     |     |     |
| ---------- | ---- | --------------------- | --- | --- | --- | --- | --- | --- | --- | --- |
∕∕ this causes a compilation error: global (__host__) variables must not
| ∕∕ be          | accessed | from              | __device__ |     | ∕ __global__ |     |     | code |     |     |
| -------------- | -------- | ----------------- | ---------- | --- | ------------ | --- | --- | ---- | --- | --- |
| printf("%d\n", |          | global_variable); |            |     |              |     |     |      |     |     |
}
∕∕ On systems with pageableMemoryAccess set to 1, we can access the address
∕∕ of a global variable. The below kernel takes that address as an argument
| __global__     | void | kernel(int*             |     | global_variable_addr) |     |     |     |     | {   |     |
| -------------- | ---- | ----------------------- | --- | --------------------- | --- | --- | --- | --- | --- | --- |
| printf("%d\n", |      | *global_variable_addr); |     |                       |     |     |     |     |     |     |
}
| int main()  | {   |                         |     |     |     |     |     |     |     |     |
| ----------- | --- | ----------------------- | --- | --- | --- | --- | --- | --- | --- | --- |
| kernel<<<1, |     | 1>>>(&global_variable); |     |     |     |     |     |     |     |     |
...
| return | 0;  |     |     |     |     |     |     |     |     |     |
| ------ | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
}
Intheexampleabove,weneedtoensuretopassapointertotheglobalvariabletothekernelinstead
of directly accessing the global variable in the kernel. This is because global variables without the
__managed__ specifier are declared as __host__-only by default, thus most compilers won’t allow
usingthesevariablesdirectlyindevicecodeasofnow.
| 4.1.1.1.1 File-backedUnifiedMemory |     |     |     |     |     |     |     |     |     |     |
| ---------------------------------- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
SincesystemswithfullCUDAunifiedmemorysupportallowthedevicetoaccessanymemoryowned
bythehostprocess,theycandirectlyaccessfile-backedmemory.
Here, we show a modified version of the initial example shown in the previous section to use file-
backedmemoryinordertoprintastringfromtheGPU,readdirectlyfromaninputfile. Inthefollowing
example,thememoryisbackedbyaphysicalfile,buttheexampleappliestomemory-backedfilestoo.
| __global__ | void  | kernel(const |                | char* |      | type,        | const | char*     | data) { |     |
| ---------- | ----- | ------------ | -------------- | ----- | ---- | ------------ | ----- | --------- | ------- | --- |
| static     | const | int n_char   |                | = 8;  |      |              |       |           |         |     |
| printf("%s |       | - first      | %d characters: |       |      | '",          | type, | n_char);  |         |     |
| for (int   | i     | = 0; i <     | n_char;        |       | ++i) | printf("%c", |       | data[i]); |         |     |
printf("'\n");
}
| void test_file_backed() |                         |             | {            |      |            |          |     |     |     |     |
| ----------------------- | ----------------------- | ----------- | ------------ | ---- | ---------- | -------- | --- | --- | --- | --- |
| int fd                  | = open(INPUT_FILE_NAME, |             |              |      | O_RDONLY); |          |     |     |     |     |
| ASSERT(fd               | >=                      | 0, "Invalid |              | file | handle");  |          |     |     |     |     |
| struct                  | stat                    | file_stat;  |              |      |            |          |     |     |     |     |
| int status              |                         | = fstat(fd, | &file_stat); |      |            |          |     |     |     |     |
| ASSERT(status           |                         | >= 0,       | "Invalid     |      | file       | stats"); |     |     |     |     |
char* mapped = (char*)mmap(0, file_stat.st_size, PROT_READ, MAP_PRIVATE, fd,
,→ 0);
| ASSERT(mapped |     | != MAP_FAILED,      |     |     | "Cannot  | map | file | into | memory"); |     |
| ------------- | --- | ------------------- | --- | --- | -------- | --- | ---- | ---- | --------- | --- |
| kernel<<<1,   |     | 1>>>("file-backed", |     |     | mapped); |     |      |      |           |     |
(continuesonnextpage)
| 138 |     |     |     |     |     |     |     |     | Chapter4. | CUDAFeatures |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --------- | ------------ |

CUDAProgrammingGuide,Release13.1
(continuedfrompreviouspage)
ASSERT(cudaDeviceSynchronize() == cudaSuccess,
"CUDA failed with '%s'", cudaGetErrorString(cudaGetLastError()));
ASSERT(munmap(mapped, file_stat.st_size) == 0, "Cannot unmap file");
ASSERT(close(fd) == 0, "Cannot close file");
}
Note that on systems without the hostNativeAtomicSupported property (see HostNativeAtom-
ics)includingsystemswithLinuxHMMenabled,atomicaccessestofile-backedmemoryarenotsup-
ported.
4.1.1.1.2 Inter-ProcessCommunication(IPC)withUnifiedMemory
Note
Asofnow,usingIPCwithunifiedmemorycanhavesignificantperformanceimplications.
Many applications prefer to manage one GPU per process, but still need to use unified memory, for
exampleforover-subscription,andaccessitfrommultipleGPUs.
CUDA IPC ( see Interprocess Communication ) does not support managed memory: handles to this
typeofmemorymaynotbesharedthroughanyofthemechanismsdiscussedinthissection. Onsys-
temswithfullCUDAunifiedmemorysupport,system-allocatedmemoryisIPCcapable. Onceaccess
to system-allocated memory has been shared with other processes, the same programming model
applies,similartoFile-backedUnifiedMemory.
See the following references for more information on various ways of creating IPC-capable system-
allocatedmemoryunderLinux:
▶ mmapwithMAP_SHARED
▶ POSIXIPCAPIs
▶ Linuxmemfd_create.
Note that it is not possible to share memory between different hosts and their devices using this
technique.
4.1.1.2 PerformanceTuning
Inordertoachievegoodperformancewithunifiedmemory,itisimportantto:
▶ understandhowpagingworksonyoursystem,andhowtoavoidunnecessarypagefaults
▶ understandthevariousmechanismsallowingyoutokeepdatalocaltotheaccessingprocessor
▶ considertuningyourapplicationforthegranularityofmemorytransfersofyoursystem.
As general advice, performance hints (see PerformanceHints) might provide improved performance,
butusingthemincorrectlymightdegradeperformancecomparedtothedefaultbehavior. Alsonote
thatanyhinthasaperformancecostassociatedwithitonthehost,thususefulhintsmustatthevery
leastimproveperformanceenoughtoovercomethiscost.
4.1. UnifiedMemory 139

CUDAProgrammingGuide,Release13.1
4.1.1.2.1 MemoryPagingandPageSizes
To better understand the performance implication of unified memory, it is important to understand
virtualaddressing, memorypagesandpagesizes. Thissub-sectionattemptstodefineallnecessary
termsandexplainwhypagingmattersforperformance.
Allcurrentlysupportedsystemsforunifiedmemoryuseavirtualaddressspace: thismeansthatmem-
oryaddressesusedbyanapplicationrepresentavirtuallocationwhichmightbemappedtoaphysical
locationwherethememoryactuallyresides.
All currently supported processors, including both CPUs and GPUs, additionally use memory paging.
Becauseallsystemsuseavirtualaddressspace,therearetwotypesofmemorypages:
▶ Virtual pages: This represents a fixed-size contiguous chunk of virtual memory per process
tracked by the operating system, which can be mapped into physical memory. Note that the
virtualpageislinkedtothemapping: forexample,asinglevirtualaddressmightbemappedinto
physicalmemoryusingdifferentpagesizes.
▶ Physicalpages: Thisrepresentsafixed-sizecontiguouschunkofmemorytheprocessor’smain
MemoryManagementUnit(MMU)supportsandintowhichavirtualpagecanbemapped.
Currently,allx86_64CPUsuseadefaultphysicalpagesizeof4KiB.ArmCPUssupportmultiplephysical
pagesizes-4KiB,16KiB,32KiBand64KiB-dependingontheexactCPU.Finally,NVIDIAGPUssupport
multiplephysicalpagesizes,butprefer2MiBphysicalpagesorlarger. Notethatthesesizesaresubject
tochangeinfuturehardware.
Thedefaultpagesizeofvirtualpagesusuallycorrespondstothephysicalpagesize,butanapplication
mayusedifferentpagesizesaslongastheyaresupportedbytheoperatingsystemandthehardware.
Typically,supportedvirtualpagesizesmustbepowersof2andmultiplesofthephysicalpagesize.
Thelogicalentitytrackingthemappingofvirtualpagesintophysicalpageswillbereferredtoasapage
table,andeachmappingofagivenvirtualpagewithagivenvirtualsizetophysicalpagesiscalledaPage
TableEntry(PTE).Allsupportedprocessorsprovidespecificcachesforthepagetabletospeedupthe
translationofvirtualaddressestophysicaladdresses. ThesecachesarecalledTranslationLookaside
Buffers(TLBs).
Therearetwoimportantaspectsforperformancetuningofapplications:
▶ thechoiceofvirtualpagesize,
▶ whetherthesystemoffersacombinedpagetableusedbybothCPUsandGPUs,orseparatepage
tablesforeachCPUandGPUindividually.
4.1.1.2.1.1 ChoosingtheRightPageSize
Ingeneral,smallpagesizesleadtoless(virtual)memoryfragmentationbutmoreTLBmisses,whereas
largerpagesizesleadtomorememoryfragmentationbutlessTLBmisses. Additionally,memorymi-
grationisgenerallymoreexpensivewithlargerpagesizescomparedtosmallerpagesizes,becausewe
typicallymigratefullmemorypages. Thiscancauselargerlatencyspikesinanapplicationusinglarge
pagesizes. Seealsothenextsectionformoredetailsonpagefaults.
OneimportantaspectforperformancetuningisthatTLBmissesaregenerallysignificantlymoreex-
pensiveontheGPUcomparedtotheCPU.ThismeansthatifaGPUthreadfrequentlyaccessesrandom
locationsofunifiedmemorymappedusingasmallenoughpagesize,itmightbesignificantlyslower
comparedtothesameaccessestounifiedmemorymappedusingalargeenoughpagesize. Whilea
similareffectmightoccurforaCPUthreadrandomlyaccessingalargeareaofmemorymappedus-
ing a small page size, the slowdown is less pronounced, meaning that the application might want to
trade-offthisslowdownwithhavinglessmemoryfragmentation.
140 Chapter4. CUDAFeatures

CUDAProgrammingGuide,Release13.1
Note that in general, applications should not tune their performance to the physical page size of a
given processor, since physical page sizes are subject to change depending on the hardware. The
adviceaboveonlyappliestovirtualpagesizes.
4.1.1.2.1.2 CPUandGPUPageTables: HardwareCoherencyvs. SoftwareCoherency
Hardware-coherent systems such as NVIDIA Grace Hopper offer a logically combined page table for
both CPUs and GPUs. This is important because in order to access system-allocated memory from
theGPU,theGPUuseswhicheverpagetableentrywascreatedbytheCPUfortherequestedmemory.
If that page table entry uses the default CPU page size of 4KiB or 64KiB, accesses to large virtual
memoryareaswillcausesignificantTLBmisses,thussignificantslowdowns.
On the other hand, on software-coherent systems where the CPUs and GPUs each have their own
logicalpagetable,differentperformancetuningaspectsshouldbeconsidered: inordertoguarantee
coherency, these systems usually use page faults in case a processor accesses a memory address
mappedintothephysicalmemoryofadifferentprocessor. Suchapagefaultmeansthat:
▶ It needs to be ensured that the currently owning processor (where the physical page currently
resides)cannotaccessthispageanymore,eitherbydeletingthepagetableentryorupdatingit.
▶ It needs to be ensured that the processor requesting access can access this page, either by
creatinganewpagetableentryorupdatingandexistingentry,suchthatitbecomesvalid/active.
▶ Thephysicalpagebackingthisvirtualpagemustbemoved/migratedtotheprocessorrequesting
access: thiscanbeanexpensiveoperation,andtheamountofworkisproportionaltothepage
size.
Overall,hardware-coherentsystemsprovidesignificantperformancebenefitscomparedtosoftware-
coherentsystemsincaseswherefrequentconcurrentaccessestothesamememorypagearemade
bybothCPUandGPUthreads:
▶ lesspage-faults: thesesystemsdonotneedtousepage-faultsforemulatingcoherencyormi-
gratingmemory,
▶ lesscontention: thesesystemsarecoherentatcache-linegranularityinsteadofpage-sizegran-
ularity, that is, when there is contention from multiple processors within a cache line, only the
cachelineisexchangedwhichismuchsmallerthanthesmallestpage-size,andwhenthediffer-
entprocessorsaccessdifferentcache-lineswithinapage,thenthereisnocontention.
Thisimpactstheperformanceofthefollowingscenarios:
▶ atomicupdatestothesameaddressconcurrentlyfrombothCPUsandGPUs
▶ signalingaGPUthreadfromaCPUthreadorvice-versa.
4.1.1.2.2 DirectUnifiedMemoryAccessfromtheHost
Somedeviceshavehardwaresupportforcoherentreads, storesandatomicaccessesfromthehost
on GPU-resident unified memory. These devices have the attribute cudaDevAttrDirectManaged-
MemAccessFromHost set to 1. Note that all hardware-coherent systems have this attribute set for
NVLink-connected devices. On these systems, the host has direct access to GPU-resident memory
withoutpagefaultsanddatamigration. NotethatwithCUDAmanagedmemory,thecudaMemAdvis-
eSetAccessedBy hint with location type cudaMemLocationTypeHost is necessary to enable this
directaccesswithoutpagefaults,seeexamplebelow.
4.1. UnifiedMemory 141

CUDAProgrammingGuide,Release13.1
SystemAllocator
| __global__       |     | void | write(int | *ret, | int          | a, int | b) { |     |     |
| ---------------- | --- | ---- | --------- | ----- | ------------ | ------ | ---- | --- | --- |
| ret[threadIdx.x] |     |      | = a       | + b + | threadIdx.x; |        |      |     |     |
}
| __global__       |     | void | append(int | *ret, |                | int a, int | b) { |     |     |
| ---------------- | --- | ---- | ---------- | ----- | -------------- | ---------- | ---- | --- | --- |
| ret[threadIdx.x] |     |      | += a       | + b   | + threadIdx.x; |            |      |     |     |
}
| void | test_malloc() |                     | {   |     |     |               |     |     |     |
| ---- | ------------- | ------------------- | --- | --- | --- | ------------- | --- | --- | --- |
| int  | *ret          | = (int*)malloc(1000 |     |     | *   | sizeof(int)); |     |     |     |
∕∕ for shared page table systems, the following hint is not necesary
cudaMemLocation location = {.type = cudaMemLocationTypeHost};
cudaMemAdvise(ret, 1000 * sizeof(int), cudaMemAdviseSetAccessedBy,
,→location);
write<<< 1, 1000 >>>(ret, 10, 100); ∕∕ pages populated in GPU
,→memory
cudaDeviceSynchronize();
| for(int |             | i = 0; | i < 1000; | i++)   |     |          |     |     |     |
| ------- | ----------- | ------ | --------- | ------ | --- | -------- | --- | --- | --- |
|         | printf("%d: |        | A+B =     | %d\n", | i,  | ret[i]); | ∕∕ |     |     |
,→directManagedMemAccessFromHost=1: CPU accesses GPU memory directly without
,→migrations
∕∕
,→directManagedMemAccessFromHost=0: CPU faults and triggers device-to-host
,→migrations
| append<<< |     | 1,  | 1000 >>>(ret, |     | 10, | 100); | ∕∕ |     |     |
| --------- | --- | --- | ------------- | --- | --- | ----- | --- | --- | --- |
,→directManagedMemAccessFromHost=1: GPU accesses GPU memory without migrations
| cudaDeviceSynchronize(); |     |     |     |     |     |     | ∕∕ |     |     |
| ------------------------ | --- | --- | --- | --- | --- | --- | --- | --- | --- |
,→directManagedMemAccessFromHost=0: GPU faults and triggers host-to-device
,→migrations
free(ret);
}
Managed
| __global__       |     | void | write(int | *ret, | int          | a, int | b) { |     |     |
| ---------------- | --- | ---- | --------- | ----- | ------------ | ------ | ---- | --- | --- |
| ret[threadIdx.x] |     |      | = a       | + b + | threadIdx.x; |        |      |     |     |
}
| __global__       |     | void | append(int | *ret, |                | int a, int | b) { |     |     |
| ---------------- | --- | ---- | ---------- | ----- | -------------- | ---------- | ---- | --- | --- |
| ret[threadIdx.x] |     |      | += a       | + b   | + threadIdx.x; |            |      |     |     |
}
| void | test_managed() |     | {   |     |     |     |     |     |     |
| ---- | -------------- | --- | --- | --- | --- | --- | --- | --- | --- |
int *ret;
| cudaMallocManaged(&ret, |     |     |     | 1000 | *   | sizeof(int)); |     |     |     |
| ----------------------- | --- | --- | --- | ---- | --- | ------------- | --- | --- | --- |
cudaMemLocation location = {.type = cudaMemLocationTypeHost};
cudaMemAdvise(ret, 1000 * sizeof(int), cudaMemAdviseSetAccessedBy,
| ,→location); |     | ∕∕  | set direct |     | access | hint |     |     |     |
| ------------ | --- | --- | ---------- | --- | ------ | ---- | --- | --- | --- |
write<<< 1, 1000 >>>(ret, 10, 100); ∕∕ pages populated in GPU
(continuesonnextpage)
| 142 |     |     |     |     |     |     |     | Chapter4. | CUDAFeatures |
| --- | --- | --- | --- | --- | --- | --- | --- | --------- | ------------ |

CUDAProgrammingGuide,Release13.1
(continuedfrompreviouspage)
,→memory
cudaDeviceSynchronize();
for(int i = 0; i < 1000; i++)
printf("%d: A+B = %d\n", i, ret[i]); ∕∕
,→directManagedMemAccessFromHost=1: CPU accesses GPU memory directly without
,→migrations
∕∕
,→directManagedMemAccessFromHost=0: CPU faults and triggers device-to-host
,→migrations
append<<< 1, 1000 >>>(ret, 10, 100); ∕∕
,→directManagedMemAccessFromHost=1: GPU accesses GPU memory without migrations
cudaDeviceSynchronize(); ∕∕
,→directManagedMemAccessFromHost=0: GPU faults and triggers host-to-device
,→migrations
cudaFree(ret);
Afterwritekerneliscompleted,retwillbecreatedandinitializedinGPUmemory. Next,theCPUwill
accessretfollowedbyappendkernelusingthesameretmemoryagain. Thiscodewillshowdifferent
behaviordependingonthesystemarchitectureandsupportofhardwarecoherency:
▶ onsystemswithdirectManagedMemAccessFromHost=1: CPUaccessestothemanagedbuffer
willnottriggeranymigrations;thedatawillremainresidentinGPUmemoryandanysubsequent
GPUkernelscancontinuetoaccessitdirectlywithoutinflictingfaultsormigrations
▶ onsystemswithdirectManagedMemAccessFromHost=0: CPUaccessestothemanagedbuffer
will page fault and initiate data migration; any GPU kernel trying to access the same data first
timewillpagefaultandmigratepagesbacktoGPUmemory.
4.1.1.2.3 HostNativeAtomics
Somedevices,includingNVLink-connecteddevicesofhardware-coherentsystems,supporthardware-
accelerated atomic accesses to CPU-resident memory. This implies that atomic accesses to host
memorydonothavetobeemulatedwithapagefault. Forthesedevices,theattributecudaDevAt-
trHostNativeAtomicSupportedissetto1.
4.1.1.2.4 AtomicAccessesandSynchronizationPrimitives
CUDAunifiedmemorysupportsallatomicoperationsavailabletohostanddevicethreads,enablingall
threadstocooperatebyconcurrentlyaccessingthesamesharedmemorylocation. Thelibcu++library
providesmanyheterogeneoussynchronizationprimitivestunedforconcurrentusebetweenhostand
devicethreads,includingcuda::atomic,cuda::atomic_ref,cuda::barrier,cuda::semaphore,
amongmanyothers.
Onsoftware-coherentsystems,atomicaccessesfromthedevicetofile-backedhostmemoryarenot
supported. Thefollowingexamplecodeisvalidonhardware-coherentsystemsbutexhibitsundefined
behavioronothersystems:
#include <cuda∕atomic>
#include <cstdio>
#include <fcntl.h>
#include <sys∕mman.h>
(continuesonnextpage)
4.1. UnifiedMemory 143

CUDAProgrammingGuide,Release13.1
(continuedfrompreviouspage)
#define ERR(msg, ...) { fprintf(stderr, msg, ##__VA_ARGS__); return EXIT_
| ,→FAILURE; | }                |      |     |     |     |     |
| ---------- | ---------------- | ---- | --- | --- | --- | --- |
| __global__ | void kernel(int* | ptr) | {   |     |     |     |
cuda::atomic_ref{*ptr}.store(2);
}
| int main() | {                       |     |            |         |     |     |
| ---------- | ----------------------- | --- | ---------- | ------- | --- | --- |
| ∕∕ this    | will be closed∕deleted  |     | by default | on exit |     |     |
| FILE*      | tmp_file = tmpfile64(); |     |            |         |     |     |
∕∕ need to allocate space in the file, we do this with posix_fallocate here
| int status | = posix_fallocate(fileno(tmp_file), |     |     | 0, 4096); |     |     |
| ---------- | ----------------------------------- | --- | --- | --------- | --- | --- |
if (status != 0) ERR("Failed to allocate space in temp file\n");
int* ptr = (int*)mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_PRIVATE,
| ,→fileno(tmp_file), | 0);            |             |             |                    |     |     |
| ------------------- | -------------- | ----------- | ----------- | ------------------ | --- | --- |
| if (ptr             | == MAP_FAILED) | ERR("Failed | to          | map temp file\n"); |     |     |
| ∕∕ initialize       | the value      | in our      | file-backed | memory             |     |     |
| *ptr =              | 1;             |             |             |                    |     |     |
| printf("Atom        | value:         | %d\n",      | *ptr);      |                    |     |     |
∕∕ device and host thread access ptr concurrently, using cuda::atomic_ref
| kernel<<<1,  | 1>>>(ptr);                     |        |        |     |     |     |
| ------------ | ------------------------------ | ------ | ------ | --- | --- | --- |
| while        | (cuda::atomic_ref{*ptr}.load() |        | !=     | 2); |     |     |
| ∕∕ this      | will always                    | be 2   |        |     |     |     |
| printf("Atom | value:                         | %d\n", | *ptr); |     |     |     |
| return       | EXIT_SUCCESS;                  |        |        |     |     |     |
}
Onsoftware-coherentsystems,atomicaccessestounifiedmemorymayincurpagefaultswhichcan
lead to significant latencies. Note that this is not the case for all GPU atomics to CPU memory on
thesesystems: operationslistedbynvidia-smi -q | grep "Atomic Caps Outbound"mayavoid
pagefaults.
Onhardware-coherentsystems,atomicsbetweenhostanddevicedonotrequirepagefaults,butmay
stillfaultforotherreasonsthatcancauseanymemoryaccesstofault.
| 4.1.1.2.5 Memcpy()/Memset()BehaviorWithUnifiedMemory |     |     |     |     |     |     |
| ---------------------------------------------------- | --- | --- | --- | --- | --- | --- |
cudaMemcpy*()andcudaMemset*()acceptanyunifiedmemorypointerasarguments.
For cudaMemcpy*(), the direction specified as cudaMemcpyKind is a performance hint, which can
haveahigherperformanceimpactifanyoftheargumentsisaunifiedmemorypointer.
Thus,itisrecommendedtofollowthefollowingperformanceadvice:
▶
Whenthephysicallocationofunifiedmemoryisknown,useanaccuratecudaMemcpyKindhint.
▶ PrefercudaMemcpyDefaultoveraninaccuratecudaMemcpyKindhint.
▶ Alwaysusepopulated(initialized)buffers: avoidusingtheseAPIstoinitializememory.
▶
AvoidusingcudaMemcpy*()ifbothpointerspointtosystem-allocatedmemory: launchakernel
oruseaCPUmemorycopyalgorithmsuchasstd::memcpyinstead.
| 144 |     |     |     |     | Chapter4. | CUDAFeatures |
| --- | --- | --- | --- | --- | --------- | ------------ |

CUDAProgrammingGuide,Release13.1
4.1.1.2.6 OverviewofMemoryAllocatorsforUnifiedMemory
For systems with full CUDA unified memory support various different allocators may be used to al-
locate unified memory. The following table shows an overview of a selection of allocators with their
respectivefeatures. NotethatallinformationinthissectionissubjecttochangeinfutureCUDAver-
sions.
|     | Table7: Overviewofunifiedmemorysupportofdifferentallo- |     |     |     |     |
| --- | ------------------------------------------------------ | --- | --- | --- | --- |
cators
PageSizes45
| API |     | Place- | Acces- Migrate |     |     |
| --- | --- | ------ | -------------- | --- | --- |
|     |     | ment   | sible Based On |     |     |
Policy From Access2
| malloc,new,mmap |     | First | CPU, Yes3 | System | or huge page |
| --------------- | --- | ----- | --------- | ------ | ------------ |
touch/hinGt1PU size6
| cudaMallocManaged |     | First         | CPU, Yes | CPU resident: | system        |
| ----------------- | --- | ------------- | -------- | ------------- | ------------- |
|                   |     | touch/hinGtPU |          | page size     | GPU resident: |
2MB
|     |     | GPU | GPU No | GPUpagesize: | 2MB |
| --- | --- | --- | ------ | ------------ | --- |
cudaMalloc
cudaMallocHost, cudaHostAlloc, CPU CPU, No MappedbyCPU:system
|     |     |     | GPU | pagesize |     |
| --- | --- | --- | --- | -------- | --- |
cudaHostRegister
MappedbyGPU:2MB
Memory pools, location type host: CPU CPU, No MappedbyCPU:system
| cuMemCreate,cudaMemPoolCreate |     |     | GPU | pagesize |     |
| ----------------------------- | --- | --- | --- | -------- | --- |
MappedbyGPU:2MB
| Memory pools, | location type      | device: GPU | GPU No | 2MB |     |
| ------------- | ------------------ | ----------- | ------ | --- | --- |
| cuMemCreate,  | cudaMemPoolCreate, |             |        |     |     |
cudaMallocAsync
ThetableOverviewofunifiedmemorysupportofdifferentallocatorsshowsthedifferenceinsemantics
of several allocators that may be considered to allocate data accessible from multiple processors at
atime,includinghostanddevice. ForadditionaldetailsaboutcudaMemPoolCreate,seetheMemory
Poolssection,foradditionaldetailsaboutcuMemCreate,seetheVirtualMemoryManagementsection.
Onhardware-coherentsystemswheredevicememoryisexposedasaNUMAdomaintothesystem,
specialallocatorssuchasnuma_alloc_on_nodemaybeusedtopinmemorytothegivenNUMAnode,
either host or device. This memory is accessible from both host and device and does not migrate.
Similarly, mbind can be used to pin memory to the given NUMA node(s), and can cause file-backed
memorytobeplacedonthegivenNUMAnode(s)beforeitisfirstaccessed.
Thefollowingappliestoallocatorsofmemorythatisshared:
2ThisfeaturecanbeoverriddenwithcudaMemAdvise.Evenifaccess-basedmigrationsaredisabled,ifthebackingmemory
spaceisfull,memorymightmigrate.
4Thedefaultsystempagesizeis4KiBor64KiBonmostsystems,unlesshugepagesizewasexplicitlyspecified(forexample,
withmmapMAP_HUGETLB/MAP_HUGE_SHIFT).Inthiscase,anyhugepagesizeconfiguredonthesystemissupported.
5Page-sizesforGPU-residentmemorymayevolveinfutureCUDAversions.
1Formmap,file-backedmemoryisplacedontheCPUbydefault,unlessspecifiedotherwisethroughcudaMemAdviseSet-
PreferredLocation(ormbind,seebulletpointsbelow).
3File-backedmemorywillnotmigratebasedonaccess.
6CurrentlyhugepagesizesmaynotbekeptwhenmigratingmemorytotheGPUorplacingitthroughfirst-touchonthe
GPU.
4.1. UnifiedMemory 145

CUDAProgrammingGuide,Release13.1
▶ System allocators such as mmap allow sharing the memory between processes using the
MAP_SHARED flag. This is supported in CUDA and can be used to share memory between dif-
ferentdevicesconnectedtothesamehost. However,thisiscurrentlynotsupportedforsharing
memory between multiple hosts as well as multiple devices. See Inter-ProcessCommunication
(IPC)withUnifiedMemoryfordetails.
▶ ForaccesstounifiedmemoryorotherCUDAmemorythroughanetworkonmultiplehosts,con-
sultthedocumentationofthecommunicationlibraryused,forexampleNCCL,NVSHMEM,Open-
MPI,UCX,etc.
4.1.1.2.7 AccessCounterMigration
Onhardware-coherentsystems,theaccesscountersfeaturekeepstrackofthefrequencyofaccess
that a GPU makes to memory located on other processors. This is needed to ensure memory pages
are moved to the physical memory of the processor that is accessing the pages most frequently. It
canguidemigrationsbetweenCPUandGPU,aswellasbetweenpeerGPUs, aprocesscalledaccess
countermigration.
Starting with CUDA 12.4, access counters are supported system-allocated memory. Note that file-
backed memory does not migrate based on access. For system-allocated memory, access counters
migrationcanbeswitchedonbyusingthecudaMemAdviseSetAccessedByhinttoadevicewiththe
correspondingdeviceid. Ifaccesscountersareon, onecanusecudaMemAdviseSetPreferredLo-
cation set to host to prevent migrations. Per default cudaMallocManaged migrates based on a
fault-and-migratemechanism.7
Thedrivermayalsouseaccesscountersformoreefficientthrashingmitigationormemoryoversub-
scriptionscenarios.
4.1.1.2.8 AvoidFrequentWritestoGPU-ResidentMemoryfromtheCPU
Ifthehostaccessesunifiedmemory,cachemissesmayintroducemoretrafficthanexpectedbetween
host and device. Many CPU architectures require all memory operations to go through the cache
hierarchy,includingwrites. IfsystemmemoryisresidentontheGPU,thismeansthatfrequentwrites
bytheCPUtothismemorycancausecachemisses,thustransferringthedatafirstfromtheGPUto
CPUbeforewritingtheactualvalueintotherequestedmemoryrange. Onsoftware-coherentsystems,
this may introduce additional page faults, while on hardware-coherent systems, it may cause higher
latenciesbetweenCPUoperations. Thus,inordertosharedataproducedbythehostwiththedevice,
considerwritingtoCPU-residentmemoryandreadingthevaluesdirectlyfromthedevice. Thecode
belowshowshowtoachievethiswithunifiedmemory.
SystemAllocator
size_t data_size = sizeof(int);
int* data = (int*)malloc(data_size);
∕∕ ensure that data stays local to the host and avoid faults
cudaMemLocation location = {.type = cudaMemLocationTypeHost};
cudaMemAdvise(data, data_size, cudaMemAdviseSetPreferredLocation, location);
cudaMemAdvise(data, data_size, cudaMemAdviseSetAccessedBy, location);
∕∕ frequent exchanges of small data: if the CPU writes to CPU-resident
,→memory,
∕∕ and GPU directly accesses that data, we can avoid the CPU caches re-
(continuesonnextpage)
7Currentsystemsallowtheuseofaccess-countermigrationwithmanagedmemorywhentheaccessed-bydevicehintis
set.Thisisanimplementationdetailandshouldnotbereliedonforfuturecompatibility.
146 Chapter4. CUDAFeatures

CUDAProgrammingGuide,Release13.1
(continuedfrompreviouspage)
,→loading
| ∕∕ data     | if it | was         | evicted | in   | between | writes |     |     |
| ----------- | ----- | ----------- | ------- | ---- | ------- | ------ | --- | --- |
| for (int    | i =   | 0; i        | < 10;   | ++i) | {       |        |     |     |
| *data       | = 42  | + i;        |         |      |         |        |     |     |
| kernel<<<1, |       | 1>>>(data); |         |      |         |        |     |     |
cudaDeviceSynchronize();
| ∕∕ CPU | cache | potentially |     | evicted |     | data here |     |     |
| ------ | ----- | ----------- | --- | ------- | --- | --------- | --- | --- |
}
free(data);
Managed
int* data;
| size_t                   | data_size |      | = sizeof(int); |             |     |          |           |        |
| ------------------------ | --------- | ---- | -------------- | ----------- | --- | -------- | --------- | ------ |
| cudaMallocManaged(&data, |           |      |                | data_size); |     |          |           |        |
| ∕∕ ensure                | that      | data | stays          | local       | to  | the host | and avoid | faults |
cudaMemLocation location = {.type = cudaMemLocationTypeHost};
cudaMemAdvise(data, data_size, cudaMemAdviseSetPreferredLocation, location);
cudaMemAdvise(data, data_size, cudaMemAdviseSetAccessedBy, location);
∕∕ frequent exchanges of small data: if the CPU writes to CPU-resident
,→memory,
∕∕ and GPU directly accesses that data, we can avoid the CPU caches re-
,→loading
| ∕∕ data     | if it | was         | evicted | in   | between | writes |     |     |
| ----------- | ----- | ----------- | ------- | ---- | ------- | ------ | --- | --- |
| for (int    | i =   | 0; i        | < 10;   | ++i) | {       |        |     |     |
| *data       | = 42  | + i;        |         |      |         |        |     |     |
| kernel<<<1, |       | 1>>>(data); |         |      |         |        |     |     |
cudaDeviceSynchronize();
| ∕∕ CPU | cache | potentially |     | evicted |     | data here |     |     |
| ------ | ----- | ----------- | --- | ------- | --- | --------- | --- | --- |
}
cudaFree(data);
4.1.1.2.9 ExploitingAsynchronousAccesstoSystemMemory
Ifanapplicationneedstoshareresultsfromworkonthedevicewiththehost,thereareseveralpossible
options:
1. The device writes its result to GPU-resident memory, the result is transferred using cudaMem-
cpy*,andthehostreadsthetransferreddata.
2. ThedevicedirectlywritesitsresulttoCPU-residentmemory,andthehostreadsthatdata.
3. ThedevicewritestoGPU-residentmemory,andthehostdirectlyaccessesthatdata.
If independent work can be scheduled on the device while the result is transferred/accessed by the
host,options1or3arepreferred. Ifthedeviceisstarveduntilthehosthasaccessedtheresult,option
2mightbe preferred. Thisisbecausethedevicecan generallywriteata higherbandwidth thanthe
hostcanread,unlessmanyhostthreadsareusedtoreadthedata.
4.1. UnifiedMemory 147

CUDAProgrammingGuide,Release13.1
1. ExplicitCopy
| void exchange_explicit_copy(cudaStream_t |             |                        |             |        |     |     | stream) | {   |     |     |
| ---------------------------------------- | ----------- | ---------------------- | ----------- | ------ | --- | --- | ------- | --- | --- | --- |
| int* data,                               | *host_data; |                        |             |        |     |     |         |     |     |     |
| size_t                                   | n_bytes     | =                      | sizeof(int) |        | *   | 16; |         |     |     |     |
| ∕∕ allocate                              |             | receiving              |             | buffer |     |     |         |     |     |     |
| host_data                                | =           | (int*)malloc(n_bytes); |             |        |     |     |         |     |     |     |
∕∕ allocate, since we touch on the device first, will be GPU-resident
| cudaMallocManaged(&data, |             |         |                  | n_bytes); |        |                       |     |     |       |     |
| ------------------------ | ----------- | ------- | ---------------- | --------- | ------ | --------------------- | --- | --- | ----- | --- |
| kernel<<<1,              |             | 16, 0,  | stream>>>(data); |           |        |                       |     |     |       |     |
| ∕∕ launch                | independent |         |                  | work      | on the | device                |     |     |       |     |
| ∕∕ other_kernel<<<1024,  |             |         |                  | 256,      | 0,     | stream>>>(other_data, |     |     | ...); |     |
| ∕∕ transfer              |             | to host |                  |           |        |                       |     |     |       |     |
cudaMemcpyAsync(host_data, data, n_bytes, cudaMemcpyDeviceToHost, stream);
| ∕∕ sync | stream | to  | ensure | data | has | been transferred |     |     |     |     |
| ------- | ------ | --- | ------ | ---- | --- | ---------------- | --- | --- | --- | --- |
cudaStreamSynchronize(stream);
| ∕∕ read | transferred |     | data |     |     |     |     |     |     |     |
| ------- | ----------- | --- | ---- | --- | --- | --- | --- | --- | --- | --- |
printf("Got values %d - %d from GPU\n", host_data[0], host_data[15]);
cudaFree(data);
free(host_data);
}
2. DeviceDirectWrite
| void exchange_device_direct_write(cudaStream_t |         |           |             |           |     |          | stream) |      | {   |     |
| ---------------------------------------------- | ------- | --------- | ----------- | --------- | --- | -------- | ------- | ---- | --- | --- |
| int* data;                                     |         |           |             |           |     |          |         |      |     |     |
| size_t                                         | n_bytes | =         | sizeof(int) |           | *   | 16;      |         |      |     |     |
| ∕∕ allocate                                    |         | receiving |             | buffer    |     |          |         |      |     |     |
| cudaMallocManaged(&data,                       |         |           |             | n_bytes); |     |          |         |      |     |     |
| ∕∕ ensure                                      | that    | data      | is          | mapped    | and | resident | on the  | host |     |     |
cudaMemLocation location = {.type = cudaMemLocationTypeHost};
cudaMemAdvise(data, n_bytes, cudaMemAdviseSetPreferredLocation, location);
cudaMemAdvise(data, n_bytes, cudaMemAdviseSetAccessedBy, location);
| kernel<<<1, |        | 16, 0, | stream>>>(data); |      |     |                  |     |     |     |     |
| ----------- | ------ | ------ | ---------------- | ---- | --- | ---------------- | --- | --- | --- | --- |
| ∕∕ sync     | stream | to     | ensure           | data | has | been transferred |     |     |     |     |
cudaStreamSynchronize(stream);
| ∕∕ read     | transferred |        | data |      |      |         |          |            |     |     |
| ----------- | ----------- | ------ | ---- | ---- | ---- | ------- | -------- | ---------- | --- | --- |
| printf("Got |             | values | %d   | - %d | from | GPU\n", | data[0], | data[15]); |     |     |
cudaFree(data);
}
3. HostDirectRead
| void exchange_host_direct_read(cudaStream_t |         |            |             |           |       |          | stream) | {      |     |     |
| ------------------------------------------- | ------- | ---------- | ----------- | --------- | ----- | -------- | ------- | ------ | --- | --- |
| int* data;                                  |         |            |             |           |       |          |         |        |     |     |
| size_t                                      | n_bytes | =          | sizeof(int) |           | *     | 16;      |         |        |     |     |
| ∕∕ allocate                                 |         | receiving  |             | buffer    |       |          |         |        |     |     |
| cudaMallocManaged(&data,                    |         |            |             | n_bytes); |       |          |         |        |     |     |
| ∕∕ ensure                                   | that    | data       | is          | mapped    | and   | resident | on the  | device |     |     |
| cudaMemLocation                             |         | device_loc |             |           | = {}; |          |         |        |     |     |
cudaGetDevice(&device_loc.id);
| device_loc.type |     | =   | cudaMemLocationTypeDevice; |     |     |     |     |     |     |     |
| --------------- | --- | --- | -------------------------- | --- | --- | --- | --- | --- | --- | --- |
(continuesonnextpage)
| 148 |     |     |     |     |     |     |     |     | Chapter4. | CUDAFeatures |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --------- | ------------ |

CUDAProgrammingGuide,Release13.1
(continuedfrompreviouspage)
cudaMemAdvise(data, n_bytes, cudaMemAdviseSetPreferredLocation, device_loc);
cudaMemAdvise(data, n_bytes, cudaMemAdviseSetAccessedBy, device_loc);
kernel<<<1, 16, 0, stream>>>(data);
∕∕ launch independent work on the GPU
∕∕ other_kernel<<<1024, 256, 0, stream>>>(other_data, ...);
∕∕ sync stream to ensure data may be accessed (has been written by device)
cudaStreamSynchronize(stream);
∕∕ read data directly from host
printf("Got values %d - %d from GPU\n", data[0], data[15]);
cudaFree(data);
Finally,intheExplicitCopyexampleabove,insteadofusingcudaMemcpy*totransferdata,onecould
use a host or device kernel to perform this transfer explicitly. For contiguous data, using the CUDA
copy-engines is preferred because operations performed by copy-engines can be overlapped with
work on both the host and device. Copy-engines might be used in cudaMemcpy* and cudaMem-
PrefetchAsyncAPIs,butthereisnoguarantee. thatcopy-enginesareusedwithcudaMemcpy*API
calls. For the same reason, explicitly copy is preferred over direct host read for large enough data:
if both host and device perform work that does not saturate their respective memory systems, the
transfer can be performed by the copy-engines concurrently with the work performed by both host
anddevice.
Copy-enginesaregenerallyusedforbothtransfersbetweenhostanddeviceaswellasbetweenpeer
deviceswithinanNVLink-connectedsystem. Duetothelimitedtotalnumberofcopy-engines,some
systemsmayhavealowerbandwidthofcudaMemcpy*comparedtousingthedevicetoexplicitlyper-
form the transfer. In such a case, if the transfer is in the critical path of the application, it may be
preferredtouseanexplicitdevice-basedtransfer.
4.1.2. Unified Memory on Devices with only CUDA
Managed Memory Support
For devices with compute capability 6.x or higher but without pageable memory access, see table
OverviewofUnifiedMemoryParadigms, CUDA managed memory is fully supported and coherent but
theGPUcannotaccesssystem-allocatedmemory. Theprogrammingmodelandperformancetuning
ofunifiedmemoryislargelysimilartothemodelasdescribedinthesection,UnifiedMemoryonDevices
withFullCUDAUnifiedMemorySupport,withthenotableexceptionthatsystemallocatorscannotbe
usedtoallocatememory. Thus,thefollowinglistofsub-sectionsdonotapply:
▶ UnifiedMemory: In-DepthExamples
▶ CPUandGPUPageTables: HardwareCoherencyvs. SoftwareCoherency
▶ AtomicAccessesandSynchronizationPrimitives
▶ AccessCounterMigration
▶ AvoidFrequentWritestoGPU-ResidentMemoryfromtheCPU
▶ ExploitingAsynchronousAccesstoSystemMemory
4.1. UnifiedMemory 149

CUDAProgrammingGuide,Release13.1
4.1.3. Unified Memory on Windows, WSL, and Tegra
Note
Thissectionisonlylookingatdeviceswithcomputecapabilitylowerthan6.0orWindowsplatforms,
deviceswithconcurrentManagedAccesspropertysetto0.
Deviceswithcomputecapabilitylowerthan6.0orWindowsplatforms,deviceswithconcurrentMan-
agedAccess property set to 0, see OverviewofUnifiedMemoryParadigms, support CUDA managed
memorywiththefollowinglimitations:
▶ DataMigrationandCoherency: Fine-grainedmovementofthemanageddatatoGPUon-demand
isnotsupported. WheneveraGPUkernelislaunchedallmanagedmemorygenerallyhastobe
transferredtoGPUmemorytoavoidfaultingonmemoryaccess. Pagefaultingisonlysupported
fromtheCPUside.
▶ GPUMemoryOversubscription: Theycannotallocatemoremanagedmemorythanthephysical
sizeofGPUmemory.
▶ CoherencyandConcurrency: Simultaneousaccesstomanagedmemoryisnotpossible,because
coherencecouldnotbeguaranteediftheCPUaccessedaunifiedmemoryallocationwhileaGPU
kernelisactivebecauseofthemissingGPUpagefaultingmechanism.
4.1.3.1 Multi-GPU
Onsystemswithdevicesofcomputecapabilitieslowerthan6.0orWindowsplatformsmanagedallo-
cationsareautomaticallyvisibletoallGPUsinasystemviathepeer-to-peercapabilitiesoftheGPUs.
ManagedmemoryallocationsbehavesimilartounmanagedmemoryallocatedusingcudaMalloc():
the current active device is the home for the physical allocation but other GPUs in the system will
accessthememoryatreducedbandwidthoverthePCIebus.
OnLinuxthemanagedmemoryisallocatedinGPUmemoryaslongasallGPUsthatareactivelybeing
used by a program have the peer-to-peer support. If at any time the application starts using a GPU
thatdoesn’thavepeer-to-peersupportwithanyoftheotherGPUsthathavemanagedallocationson
them, then the driver will migrate all managed allocations to system memory. In this case, all GPUs
experiencePCIebandwidthrestrictions.
OnWindows,ifpeermappingsarenotavailable(forexample,betweenGPUsofdifferentarchitectures),
then the system will automatically fall back to using mapped memory, regardless of whether both
GPUsareactuallyusedbya program. Ifonlyone GPUisactuallygoingtobe used, itisnecessaryto
settheCUDA_VISIBLE_DEVICESenvironmentvariablebeforelaunchingtheprogram. Thisconstrains
whichGPUsarevisibleandallowsmanagedmemorytobeallocatedinGPUmemory.
Alternatively, on Windows users can also set CUDA_MANAGED_FORCE_DEVICE_ALLOC to a non-zero
value to force the driver to always use device memory for physical storage. When this environment
variable is set to a non-zero value, all devices used in that process that support managed memory
havetobepeer-to-peercompatiblewitheachother. Theerror::cudaErrorInvalidDevicewillbe
returnedifadevicethatsupportsmanagedmemoryisusedanditisnotpeer-to-peercompatiblewith
anyoftheothermanagedmemorysupportingdevicesthatwerepreviouslyusedinthatprocess,even
if::cudaDeviceResethasbeencalledonthosedevices. Theseenvironmentvariablesaredescribed
inCUDAEnvironmentVariables.
150 Chapter4. CUDAFeatures

CUDAProgrammingGuide,Release13.1
4.1.3.2 CoherencyandConcurrency
Toensurecoherencytheunifiedmemoryprogrammingmodelputsconstraintsondataaccesseswhile
boththeCPUandGPUareexecutingconcurrently. Ineffect,theGPUhasexclusiveaccesstoallman-
ageddataandtheCPUisnotpermittedtoaccessit,whileanykerneloperationisexecuting,regardless
ofwhetherthespecifickernelisactivelyusingthedata. ConcurrentCPU/GPUaccesses,eventodif-
ferentmanagedmemoryallocations,willcauseasegmentationfaultbecausethepageisconsidered
inaccessibletotheCPU.
For example the following code runs successfully on devices of compute capability 6.x due to the
GPUpagefaultingcapabilitywhichliftsallrestrictionsonsimultaneousaccessbutfailsononpre-6.x
architectures and Windows platforms because the GPU program kernel is still active when the CPU
touchesy:
| __device__ | __managed__   | int x, | y=2; |     |     |     |
| ---------- | ------------- | ------ | ---- | --- | --- | --- |
| __global__ | void kernel() | {      |      |     |     |     |
| x =        | 10;           |        |      |     |     |     |
}
| int main() | {           |          |             |            |            |        |
| ---------- | ----------- | -------- | ----------- | ---------- | ---------- | ------ |
| kernel<<<  | 1, 1 >>>(); |          |             |            |            |        |
| y =        | 20;         | ∕∕ Error | on GPUs not | supporting | concurrent | access |
cudaDeviceSynchronize();
| return | 0;  |     |     |     |     |     |
| ------ | --- | --- | --- | --- | --- | --- |
}
TheprogrammustexplicitlysynchronizewiththeGPUbeforeaccessingy(regardlessofwhetherthe
GPUkernelactuallytouchesy(oranymanageddataatall):
| __device__ | __managed__   | int x, | y=2; |     |     |     |
| ---------- | ------------- | ------ | ---- | --- | --- | --- |
| __global__ | void kernel() | {      |      |     |     |     |
| x =        | 10;           |        |      |     |     |     |
}
| int main() | {           |     |     |     |     |     |
| ---------- | ----------- | --- | --- | --- | --- | --- |
| kernel<<<  | 1, 1 >>>(); |     |     |     |     |     |
cudaDeviceSynchronize();
| y =    | 20; | ∕∕ Success | on GPUs | not supporting | concurrent | access |
| ------ | --- | ---------- | ------- | -------------- | ---------- | ------ |
| return | 0;  |            |         |                |            |        |
}
Note that any function call that logically guarantees the GPU completes its work is valid to ensure
logicallythattheGPUworkiscompleted,seeExplicitSynchronization.
NotethatifmemoryisdynamicallyallocatedwithcudaMallocManaged()orcuMemAllocManaged()
whiletheGPUisactive,thebehaviorofthememoryisunspecifieduntiladditionalworkislaunchedor
theGPUissynchronized. AttemptingtoaccessthememoryontheCPUduringthistimemayormay
notcauseasegmentationfault. ThisdoesnotapplytomemoryallocatedusingtheflagcudaMemAt-
tachHostorCU_MEM_ATTACH_HOST.
4.1.3.3 StreamAssociatedUnifiedMemory
The CUDA programming model provides streams as a mechanism for programs to indicate depen-
denceandindependenceamongkernellaunches. Kernelslaunchedintothesamestreamareguaran-
teedtoexecuteconsecutively,whilekernelslaunchedintodifferentstreamsarepermittedtoexecute
| concurrently. | SeesectionCUDAStreams. |     |     |     |     |     |
| ------------- | ---------------------- | --- | --- | --- | --- | --- |
4.1. UnifiedMemory 151

CUDAProgrammingGuide,Release13.1
| 4.1.3.3.1 StreamCallbacks |     |     |     |     |     |     |     |     |     |     |
| ------------------------- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
ItislegalfortheCPUtoaccessmanageddatafromwithinastreamcallback,providednootherstream
thatcouldpotentiallybeaccessingmanageddataisactiveontheGPU.Inaddition,acallbackthatis
notfollowedbyanydeviceworkcanbeusedforsynchronization: forexample,bysignalingacondition
variablefrominsidethecallback;otherwise,CPUaccessisvalidonlyforthedurationofthecallback(s).
Thereareseveralimportantpointsofnote:
1. ItisalwayspermittedfortheCPUtoaccessnon-managedmappedmemorydatawhiletheGPU
isactive.
2. TheGPUisconsideredactivewhenitisrunninganykernel,evenifthatkerneldoesnotmakeuse
| ofmanageddata. |     | Ifakernelmightusedata,thenaccessisforbidden |     |     |     |     |     |     |     |     |
| -------------- | --- | ------------------------------------------- | --- | --- | --- | --- | --- | --- | --- | --- |
3. Therearenoconstraintsonconcurrentinter-GPUaccessofmanagedmemory,otherthanthose
thatapplytomulti-GPUaccessofnon-managedmemory.
4. TherearenoconstraintsonconcurrentGPUkernelsaccessingmanageddata.
NotehowthelastpointallowsforracesbetweenGPUkernels,asiscurrentlythecasefornon-managed
GPUmemory. IntheperspectiveoftheGPU,managedmemoryfunctionsareidenticaltonon-managed
| memory. Thefollowingcodeexampleillustratesthesepoints: |     |          |     |          |     |     |     |     |     |     |
| ------------------------------------------------------ | --- | -------- | --- | -------- | --- | --- | --- | --- | --- | --- |
| int main()                                             | {   |          |     |          |     |     |     |     |     |     |
| cudaStream_t                                           |     | stream1, |     | stream2; |     |     |     |     |     |     |
cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);
| int *non_managed, |     |     | *managed, |     | *also_managed; |     |     |     |     |     |
| ----------------- | --- | --- | --------- | --- | -------------- | --- | --- | --- | --- | --- |
cudaMallocHost(&non_managed, 4); ∕∕ Non-managed, CPU-accessible memory
| cudaMallocManaged(&managed,      |     |       |            |        | 4);           |     |       |     |     |     |
| -------------------------------- | --- | ----- | ---------- | ------ | ------------- | --- | ----- | --- | --- | --- |
| cudaMallocManaged(&also_managed, |     |       |            |        |               | 4); |       |     |     |     |
| ∕∕ Point                         | 1:  | CPU   | can        | access | non-managed   |     | data. |     |     |     |
| kernel<<<                        |     | 1, 1, | 0, stream1 |        | >>>(managed); |     |       |     |     |     |
| *non_managed                     |     | =     | 1;         |        |               |     |       |     |     |     |
∕∕ Point 2: CPU cannot access any managed data while GPU is busy,
| ∕∕  |     | unless | concurrentManagedAccess |     |     |     |     | = 1 |     |     |
| --- | --- | ------ | ----------------------- | --- | --- | --- | --- | --- | --- | --- |
∕∕ Note we have not yet synchronized, so "kernel" is still active.
| *also_managed |     | =          | 2;         | ∕∕         | Will          | issue  | segmentation |                    | fault      |     |
| ------------- | --- | ---------- | ---------- | ---------- | ------------- | ------ | ------------ | ------------------ | ---------- | --- |
| ∕∕ Point      | 3:  | Concurrent |            | GPU        | kernels       | can    | access       | the                | same data. |     |
| kernel<<<     |     | 1, 1,      | 0, stream2 |            | >>>(managed); |        |              |                    |            |     |
| ∕∕ Point      | 4:  | Multi-GPU  |            | concurrent |               | access |              | is also permitted. |            |     |
cudaSetDevice(1);
| kernel<<< |     | 1, 1 | >>>(managed); |     |     |     |     |     |     |     |
| --------- | --- | ---- | ------------- | --- | --- | --- | --- | --- | --- | --- |
| return    | 0;  |      |               |     |     |     |     |     |     |     |
}
4.1.3.3.2 Managedmemoryassociatedtostreamsallowsforfiner-grainedcontrol
Unifiedmemorybuildsuponthestream-independencemodelbyallowingaCUDAprogramtoexplicitly
associatemanagedallocationswithaCUDAstream. Inthisway,theprogrammerindicatestheuseof
databykernelsbasedonwhethertheyarelaunchedintoaspecifiedstreamornot. Thisenablesop-
portunitiesforconcurrencybasedonprogram-specificdataaccesspatterns. Thefunctiontocontrol
thisbehavioris:
| cudaError_t | cudaStreamAttachMemAsync(cudaStream_t |     |     |     |     |      |       | stream, |     |     |
| ----------- | ------------------------------------- | --- | --- | --- | --- | ---- | ----- | ------- | --- | --- |
|             |                                       |     |     |     |     | void | *ptr, |         |     |     |
(continuesonnextpage)
| 152 |     |     |     |     |     |     |     |     | Chapter4. | CUDAFeatures |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --------- | ------------ |

CUDAProgrammingGuide,Release13.1
(continuedfrompreviouspage)
size_t length=0,
|     |     |     | unsigned | int flags=0); |     |     |
| --- | --- | --- | -------- | ------------- | --- | --- |
ThecudaStreamAttachMemAsync()functionassociateslengthbytesofmemorystartingfromptr
with the specified stream. This allows CPU access to that memory region so long as all operations
instreamhavecompleted, regardlessofwhetherotherstreamsareactive. Ineffect, thisconstrains
exclusiveownershipofthemanagedmemoryregionbyanactiveGPUtoper-streamactivityinstead
ofwhole-GPUactivity. Mostimportantly,ifanallocationisnotassociatedwithaspecificstream,itis
visibletoallrunningkernelsregardlessoftheirstream. ThisisthedefaultvisibilityforacudaMalloc-
Managed()allocationora__managed__variable;hence,thesimple-caserulethattheCPUmaynot
touchthedatawhileanykernelisrunning.
Note
Byassociatinganallocationwithaspecificstream,theprogrammakesaguaranteethatonlyker-
nelslaunchedintothatstreamwilltouchthatdata. Noerrorcheckingisperformedbytheunified
memorysystem.
Note
Inadditiontoallowinggreaterconcurrency,theuseofcudaStreamAttachMemAsync()canenable
datatransferoptimizationswithintheunifiedmemorysystemthatmayaffectlatenciesandother
overhead.
Thefollowingexampleshowshowtoexplicitlyassociateywithhostaccessibility,thusenablingaccess
at all times from the CPU. (Note the absence of cudaDeviceSynchronize() after the kernel call.)
AccessestoybytheGPUrunningkernelwillnowproduceundefinedresults.
| __device__ | __managed__   | int x, y=2; |     |     |     |     |
| ---------- | ------------- | ----------- | --- | --- | --- | --- |
| __global__ | void kernel() | {           |     |     |     |     |
| x =        | 10;           |             |     |     |     |     |
}
| int main()   | {        |     |     |     |     |     |
| ------------ | -------- | --- | --- | --- | --- | --- |
| cudaStream_t | stream1; |     |     |     |     |     |
cudaStreamCreate(&stream1);
| cudaStreamAttachMemAsync(stream1, |     |     | &y, 0, cudaMemAttachHost); |     |     |     |
| --------------------------------- | --- | --- | -------------------------- | --- | --- | --- |
cudaDeviceSynchronize(); ∕∕ Wait for Host attachment to occur.
kernel<<< 1, 1, 0, stream1 >>>(); ∕∕ Note: Launches into stream1.
| y =    | 20; |     | ∕∕ Success  | – a kernel | is running | but “y” |
| ------ | --- | --- | ----------- | ---------- | ---------- | ------- |
|        |     |     | ∕∕ has been | associated | with no    | stream. |
| return | 0;  |     |             |            |            |         |
}
| 4.1.3.3.3 Amoreelaborateexampleonmultithreadedhostprograms |     |     |     |     |     |     |
| ---------------------------------------------------------- | --- | --- | --- | --- | --- | --- |
The primary use for cudaStreamAttachMemAsync() is to enable independent task parallelism us-
ing CPU threads. Typically in such a program, a CPU thread creates its own stream for all work that
it generates because using CUDA’s NULL stream would cause dependencies between threads. The
default global visibility of managed data to any GPU stream can make it difficult to avoid interac-
tionsbetweenCPUthreadsinamulti-threadedprogram. FunctioncudaStreamAttachMemAsync()
| 4.1. UnifiedMemory |     |     |     |     |     | 153 |
| ------------------ | --- | --- | --- | --- | --- | --- |

CUDAProgrammingGuide,Release13.1
isthereforeusedtoassociateathread’smanagedallocationswiththatthread’sownstream,andthe
association is typically not changed for the life of the thread. Such a program would simply add a
singlecalltocudaStreamAttachMemAsync()touseunifiedmemoryforitsdataaccesses:
∕∕ This function performs some task, in its own , in its own private stream and
| ,→can        | be run       | in  | parallel |     |       |         |         |     |     |     |
| ------------ | ------------ | --- | -------- | --- | ----- | ------- | ------- | --- | --- | --- |
| void         | run_task(int |     | *in,     | int | *out, | int     | length) | {   |     |     |
| ∕∕           | Create       | a   | stream   | for | us    | to use. |         |     |     |     |
| cudaStream_t |              |     | stream;  |     |       |         |         |     |     |     |
cudaStreamCreate(&stream);
| ∕∕  | Allocate |     | some | managed | data | and | associate | with | our stream. |     |
| --- | -------- | --- | ---- | ------- | ---- | --- | --------- | ---- | ----------- | --- |
∕∕ Note the use of the host-attach flag to cudaMallocManaged();
| ∕∕  | we then |     | associate |          | the allocation |     | with       | our stream | so that |     |
| --- | ------- | --- | --------- | -------- | -------------- | --- | ---------- | ---------- | ------- | --- |
| ∕∕  | our     | GPU | kernel    | launches |                | can | access it. |            |         |     |
| int | *data;  |     |           |          |                |     |            |            |         |     |
cudaMallocManaged((void **)&data, length, cudaMemAttachHost);
| cudaStreamAttachMemAsync(stream, |     |     |     |     |     |     | data); |     |     |     |
| -------------------------------- | --- | --- | --- | --- | --- | --- | ------ | --- | --- | --- |
cudaStreamSynchronize(stream);
∕∕ Iterate on the data in some way, using both Host & Device.
| for(int |              | i=0; | i<N; | i++) | {    |     |                |       |          |     |
| ------- | ------------ | ---- | ---- | ---- | ---- | --- | -------------- | ----- | -------- | --- |
|         | transform<<< |      |      | 100, | 256, | 0,  | stream >>>(in, | data, | length); |     |
cudaStreamSynchronize(stream);
|     | host_process(data, |     |      |     | length); |        | ∕∕ CPU   | uses managed | data.    |     |
| --- | ------------------ | --- | ---- | --- | -------- | ------ | -------- | ------------ | -------- | --- |
|     | convert<<<         |     | 100, |     | 256, 0,  | stream | >>>(out, | data,        | length); |     |
}
cudaStreamSynchronize(stream);
cudaStreamDestroy(stream);
cudaFree(data);
}
Inthisexample, theallocation-streamassociationisestablishedjustonce, andthen dataisusedre-
peatedly by both the host and device. The result is much simpler code than occurs with explicitly
copyingdatabetweenhostanddevice,althoughtheresultisthesame.
ThefunctioncudaMallocManaged()specifiesthecudaMemAttachHostflag,whichcreatesanallo-
cationthatisinitiallyinvisibletodevice-sideexecution. (Thedefaultallocationwouldbevisibletoall
GPUkernelsonallstreams.) Thisensuresthatthereisnoaccidentalinteractionwithanotherthread’s
execution in the interval between the data allocation and when the data is acquired for a specific
stream.
Without this flag, a new allocation would be considered in-use on the GPU if a kernel launched by
another thread happens to be running. This might impact the thread’s ability to access the newly
allocateddatafromtheCPUbeforeitisabletoexplicitlyattachittoaprivatestream. Toenablesafe
independencebetweenthreads,therefore,allocationsshouldbemadespecifyingthisflag.
Analternativewouldbetoplaceaprocess-widebarrieracrossallthreadsaftertheallocationhasbeen
attachedto the stream. Thiswouldensure thatall threadscomplete their data/streamassociations
before any kernels are launched, avoiding the hazard. A second barrier would be needed before the
stream is destroyed because stream destruction causes allocations to revert to their default visibil-
ity. The cudaMemAttachHost flag exists both to simplify this process, and because it is not always
possibletoinsertglobalbarrierswhererequired.
| 154 |     |     |     |     |     |     |     |     | Chapter4. | CUDAFeatures |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --------- | ------------ |

CUDAProgrammingGuide,Release13.1
4.1.3.3.4 DataMovementofStreamAssociatedUnifiedMemory
Memcpy()/Memset()withstreamassociatedunifiedmemorybehavesdifferentondeviceswherecon-
currentManagedAccessisnotset,thefollowingrulesapply:
If cudaMemcpyHostTo* is specified and the source data is unified memory, then it will be accessed
fromthehostifitiscoherentlyaccessiblefromthehostinthecopystream(1);otherwiseitwillbeac-
cessedfromthedevice. SimilarrulesapplytothedestinationwhencudaMemcpy*ToHostisspecified
andthedestinationisunifiedmemory.
IfcudaMemcpyDeviceTo*isspecifiedandthesourcedataisunifiedmemory,thenitwillbeaccessed
from the device. The source must be coherently accessible from the device in the copy stream (2);
otherwise,anerrorisreturned. SimilarrulesapplytothedestinationwhencudaMemcpy*ToDeviceis
specifiedandthedestinationisunifiedmemory.
If cudaMemcpyDefault is specified, then unified memory will be accessed from the host either if it
cannotbecoherentlyaccessedfromthedeviceinthecopystream(2)orifthepreferredlocationfor
thedataiscudaCpuDeviceIdanditcanbecoherentlyaccessedfromthehostinthecopystream(1);
otherwise,itwillbeaccessedfromthedevice.
WhenusingcudaMemset*()withunifiedmemory,thedatamustbecoherentlyaccessiblefromthe
deviceinthestreambeingusedforthecudaMemset()operation(2);otherwise,anerrorisreturned.
WhendataisaccessedfromthedeviceeitherbycudaMemcpy*orcudaMemset*,thestreamofopera-
tionisconsideredtobeactiveontheGPU.Duringthistime,anyCPUaccessofdatathatisassociated
with that stream or data that has global visibility, will result in a segmentation fault if the GPU has
a zero value for the device attribute concurrentManagedAccess. The program must synchronize
appropriatelytoensuretheoperationhascompletedbeforeaccessinganyassociateddatafromthe
CPU.
1. Coherentlyaccessiblefromthehostinagivenstreammeansthatthememoryneitherhasglobal
visibilitynorisitassociatedwiththegivenstream.
2. Coherentlyaccessiblefromthedeviceinagivenstreammeansthatthememoryeitherhasglobal
visibilityorisassociatedwiththegivenstream.
4.1.4. Performance Hints
PerformancehintsallowprogrammerstoprovideCUDAwithmoreinformationaboutunifiedmemory
usage. CUDA uses performance hints to managed memory more efficiently and improve application
performance. Performancehintsneverimpactthecorrectnessofanapplication. Performancehints
onlyaffectperformance.
Note
Applicationsshouldonlyuseunifiedmemoryperformancehintsiftheyimproveperformance.
Performancehintsmaybeusedonanyunifiedmemoryallocation,includingCUDAmanagedmemory.
OnsystemswithfullCUDAunifiedmemorysupport,performancehintscanbeappliedtoallsystem-
allocatedmemory.
4.1. UnifiedMemory 155

CUDAProgrammingGuide,Release13.1
4.1.4.1 DataPrefetching
The cudaMemPrefetchAsync API is an asynchronous stream-ordered API that may migrate data to
resideclosertothespecifiedprocessor. Thedatamaybeaccessedwhileitisbeingprefetched. The
migrationdoesnotbeginuntilallprioroperationsinthestreamhavecompleted,andcompletesbefore
anysubsequentoperationinthestream.
| cudaError_t | cudaMemPrefetchAsync(const |     |              | void            | *devPtr,   |           |     |
| ----------- | -------------------------- | --- | ------------ | --------------- | ---------- | --------- | --- |
|             |                            |     | size_t       | count,          |            |           |     |
|             |                            |     | struct       | cudaMemLocation |            | location, |     |
|             |                            |     | unsigned     | int             | flags,     |           |     |
|             |                            |     | cudaStream_t |                 | stream=0); |           |     |
Amemoryregioncontaining[devPtr, count)maybemigratedtothedestinationde-
|     |     |     | devPtr | +   |     |     |     |
| --- | --- | --- | ------ | --- | --- | --- | --- |
vicelocation.idiflocation.type iscudaMemLocationTypeDevice, orCPUiflocation.type
iscudaMemLocationTypeHost,whentheprefetchtaskisexecutedinthegivenstream. Fordetails
onflags,seethecurrentCUDARuntimeAPIdocumentation.
Considerthesimplecodeexamplebelow:
SystemAllocator
| void test_prefetch_sam(const |                                 |                 | cudaStream_t& | s)  | {   |     |     |
| ---------------------------- | ------------------------------- | --------------- | ------------- | --- | --- | --- | --- |
| ∕∕ initialize                | data                            | on CPU          |               |     |     |     |     |
| char *data                   | = (char*)malloc(dataSizeBytes); |                 |               |     |     |     |     |
| init_data(data,              |                                 | dataSizeBytes); |               |     |     |     |     |
cudaMemLocation location = {.type = cudaMemLocationTypeDevice, .id =
,→myGpuId};
| ∕∕ encourage | data     | to move   | to GPU before | use |     |     |     |
| ------------ | -------- | --------- | ------------- | --- | --- | --- | --- |
| const        | unsigned | int flags | = 0;          |     |     |     |     |
cudaMemPrefetchAsync(data, dataSizeBytes, location, flags, s);
| ∕∕ use | data on | GPU |     |     |     |     |     |
| ------ | ------- | --- | --- | --- | --- | --- | --- |
const unsigned num_blocks = (dataSizeBytes + threadsPerBlock - 1) ∕
,→threadsPerBlock;
mykernel<<<num_blocks, threadsPerBlock, 0, s>>>(data, dataSizeBytes);
| ∕∕ encourage | data     | to move                     | back to CPU |     |     |     |     |
| ------------ | -------- | --------------------------- | ----------- | --- | --- | --- | --- |
| location     | = {.type | = cudaMemLocationTypeHost}; |             |     |     |     |     |
cudaMemPrefetchAsync(data, dataSizeBytes, location, flags, s);
cudaStreamSynchronize(s);
| ∕∕ use         | data on | CPU             |     |     |     |     |     |
| -------------- | ------- | --------------- | --- | --- | --- | --- | --- |
| use_data(data, |         | dataSizeBytes); |     |     |     |     |     |
free(data);
}
Managed
| void test_prefetch_managed(const |      |        | cudaStream_t& |     | s) { |     |     |
| -------------------------------- | ---- | ------ | ------------- | --- | ---- | --- | --- |
| ∕∕ initialize                    | data | on CPU |               |     |      |     |     |
(continuesonnextpage)
| 156 |     |     |     |     |     | Chapter4. | CUDAFeatures |
| --- | --- | --- | --- | --- | --- | --------- | ------------ |

CUDAProgrammingGuide,Release13.1
(continuedfrompreviouspage)
| char *data;              |     |                 |     |                 |     |     |
| ------------------------ | --- | --------------- | --- | --------------- | --- | --- |
| cudaMallocManaged(&data, |     |                 |     | dataSizeBytes); |     |     |
| init_data(data,          |     | dataSizeBytes); |     |                 |     |     |
cudaMemLocation location = {.type = cudaMemLocationTypeDevice, .id =
,→myGpuId};
| ∕∕ encourage |          | data | to move | to  | GPU before | use |
| ------------ | -------- | ---- | ------- | --- | ---------- | --- |
| const        | unsigned | int  | flags   | =   | 0;         |     |
cudaMemPrefetchAsync(data, dataSizeBytes, location, flags, s);
| ∕∕ use | data | on GPU |     |     |     |     |
| ------ | ---- | ------ | --- | --- | --- | --- |
const uinsigned num_blocks = (dataSizeBytes + threadsPerBlock - 1) ∕
,→threadsPerBlock;
mykernel<<<num_blocks, threadsPerBlock, 0, s>>>(data, dataSizeBytes);
| ∕∕ encourage |          | data | to move                     | back | to CPU |     |
| ------------ | -------- | ---- | --------------------------- | ---- | ------ | --- |
| location     | = {.type |      | = cudaMemLocationTypeHost}; |      |        |     |
cudaMemPrefetchAsync(data, dataSizeBytes, location, flags, s);
cudaStreamSynchronize(s);
| ∕∕ use         | data | on CPU          |     |     |     |     |
| -------------- | ---- | --------------- | --- | --- | --- | --- |
| use_data(data, |      | dataSizeBytes); |     |     |     |     |
cudaFree(data);
}
4.1.4.2 DataUsageHints
Whenmultipleprocessorssimultaneouslyaccessthesamedata,cudaMemAdvisemaybeusedtohint
| howthedataat[devPtr, |                     |     | devPtr |        | + count)willbeaccessed: |            |
| -------------------- | ------------------- | --- | ------ | ------ | ----------------------- | ---------- |
| cudaError_t          | cudaMemAdvise(const |     |        |        | void *devPtr,           |            |
|                      |                     |     |        | size_t | count,                  |            |
|                      |                     |     |        | enum   | cudaMemoryAdvise        | advice,    |
|                      |                     |     |        | struct | cudaMemLocation         | location); |
TheexampleshowshowtousecudaMemAdvise:
| init_data(data, |     | dataSizeBytes); |     |     |     |     |
| --------------- | --- | --------------- | --- | --- | --- | --- |
cudaMemLocation location = {.type = cudaMemLocationTypeDevice, .id =
,→myGpuId};
| ∕∕ encourage |          | data | to move | to  | GPU before | use |
| ------------ | -------- | ---- | ------- | --- | ---------- | --- |
| const        | unsigned | int  | flags   | =   | 0;         |     |
cudaMemPrefetchAsync(data, dataSizeBytes, location, flags, s);
| ∕∕ use | data | on GPU |     |     |     |     |
| ------ | ---- | ------ | --- | --- | --- | --- |
const uinsigned num_blocks = (dataSizeBytes + threadsPerBlock - 1) ∕
,→threadsPerBlock;
mykernel<<<num_blocks, threadsPerBlock, 0, s>>>(data, dataSizeBytes);
| ∕∕ encourage |     | data | to move | back | to CPU |     |
| ------------ | --- | ---- | ------- | ---- | ------ | --- |
(continuesonnextpage)
4.1. UnifiedMemory 157

CUDAProgrammingGuide,Release13.1
(continuedfrompreviouspage)
| location | = {.type | = cudaMemLocationTypeHost}; |     |     |     |     |     |     |
| -------- | -------- | --------------------------- | --- | --- | --- | --- | --- | --- |
cudaMemPrefetchAsync(data, dataSizeBytes, location, flags, s);
cudaStreamSynchronize(s);
| ∕∕ use         | data on | CPU             |     |     |     |     |     |     |
| -------------- | ------- | --------------- | --- | --- | --- | --- | --- | --- |
| use_data(data, |         | dataSizeBytes); |     |     |     |     |     |     |
cudaFree(data);
}
∕∕ test-prefetch-managed-end
| static | const int | maxDevices       | =   | 1;   |     |     |     |     |
| ------ | --------- | ---------------- | --- | ---- | --- | --- | --- | --- |
| static | const int | maxOuterLoopIter |     | = 3; |     |     |     |     |
| static | const int | maxInnerLoopIter |     | = 4; |     |     |     |     |
∕∕ test-advise-managed-begin
| void test_advise_managed(cudaStream_t |           |        |                  |     | stream) { |     |     |     |
| ------------------------------------- | --------- | ------ | ---------------- | --- | --------- | --- | --- | --- |
| char                                  | *dataPtr; |        |                  |     |           |     |     |     |
| size_t                                | dataSize  | = 64 * | threadsPerBlock; |     | ∕∕ 16 KiB |     |     |     |
Whereadvicemaytakethefollowingvalues:
▶cudaMemAdviseSetReadMostly:
Thisimpliesthatthedataismostlygoingtobereadfromandonlyoccasionallywrittento.
Ingeneral,itallowstradingoffreadbandwidthforwritebandwidthonthisregion.
▶cudaMemAdviseSetPreferredLocation:
Thishintsetsthepreferredlocationforthedatatobethespecifieddevice’sphysicalmem-
ory. Thishintencouragesthesystemtokeepthedataatthepreferredlocation,butdoesnot
guaranteeit. PassinginavalueofcudaMemLocationTypeHost forlocation.typesetsthe
preferredlocationasCPUmemory. Otherhints,likecudaMemPrefetchAsync,mayoverride
thishintandallowthememorytomigrateawayfromitspreferredlocation.
▶cudaMemAdviseSetAccessedBy:
Insomesystems,itmaybebeneficialforperformancetoestablishamappingintomemory
beforeaccessingthedatafromagivenprocessor. Thishinttellsthesystemthatthedata
willbefrequentlyaccessedbylocation.idwhenlocation.typeiscudaMemLocation-
TypeDevice, enabling the system to assume that creating these mappings pays off. This
hintdoesnotimplywherethedatashouldreside,butitcanbecombinedwithcudaMemAd-
|     |     |     |     | to specify | that. On hardware-coherent |     | systems, this | hint |
| --- | --- | --- | --- | ---------- | -------------------------- | --- | ------------- | ---- |
viseSetPreferredLocation
switchesonaccesscountermigration,seeAccessCounterMigration.
| Each advice | can be | also unset | by using | one of | the following values: |     |     |     |
| ----------- | ------ | ---------- | -------- | ------ | --------------------- | --- | --- | --- |
cudaMemAdviseUnsetRead-
Mostly,cudaMemAdviseUnsetPreferredLocationandcudaMemAdviseUnsetAccessedBy.
TheexampleshowshowtousecudaMemAdvise:
SystemAllocator
| void test_advise_sam(cudaStream_t |           |        |                  | stream)              | {         |     |     |     |
| --------------------------------- | --------- | ------ | ---------------- | -------------------- | --------- | --- | --- | --- |
| char                              | *dataPtr; |        |                  |                      |           |     |     |     |
| size_t                            | dataSize  | = 64 * | threadsPerBlock; |                      | ∕∕ 16 KiB |     |     |     |
| ∕∕ Allocate                       | memory    | using  | malloc           | or cudaMallocManaged |           |     |     |     |
(continuesonnextpage)
| 158 |     |     |     |     |     | Chapter4. | CUDAFeatures |     |
| --- | --- | --- | --- | --- | --- | --------- | ------------ | --- |

CUDAProgrammingGuide,Release13.1
(continuedfrompreviouspage)
| dataPtr    | = (char*)malloc(dataSize); |     |        |        |        |     |     |     |     |
| ---------- | -------------------------- | --- | ------ | ------ | ------ | --- | --- | --- | --- |
| ∕∕ Set the | advice                     |     | on the | memory | region |     |     |     |     |
cudaMemLocation loc = {.type = cudaMemLocationTypeDevice, .id = myGpuId};
cudaMemAdvise(dataPtr, dataSize, cudaMemAdviseSetReadMostly, loc);
| int outerLoopIter    |          |           | = 0;                       |                   |         |             |       |                 |           |
| -------------------- | -------- | --------- | -------------------------- | ----------------- | ------- | ----------- | ----- | --------------- | --------- |
| while (outerLoopIter |          |           | <                          | maxOuterLoopIter) |         |             | {     |                 |           |
| ∕∕ The               | data     | is        | written                    | by                | the CPU | each        | outer | loop            | iteration |
| init_data(dataPtr,   |          |           | dataSize);                 |                   |         |             |       |                 |           |
| ∕∕ The               | data     | is        | made                       | available         | to      | all         | GPUs  | by prefetching. |           |
| ∕∕ Prefetching       |          |           | here                       | causes            | read    | duplication |       | of data         | instead   |
| ∕∕ of                | data     | migration |                            |                   |         |             |       |                 |           |
| cudaMemLocation      |          |           | location;                  |                   |         |             |       |                 |           |
| location.type        |          | =         | cudaMemLocationTypeDevice; |                   |         |             |       |                 |           |
| for (int             | device   |           | = 0;                       | device            | <       | maxDevices; |       | device++)       | {         |
| location.id          |          | =         | device;                    |                   |         |             |       |                 |           |
| const                | unsigned |           | int                        | flags             | = 0;    |             |       |                 |           |
cudaMemPrefetchAsync(dataPtr, dataSize, location, flags, stream);
}
| ∕∕ The            | kernel         | only | reads |                     | this data | in  | the | inner loop |     |
| ----------------- | -------------- | ---- | ----- | ------------------- | --------- | --- | --- | ---------- | --- |
| int innerLoopIter |                |      | =     | 0;                  |           |     |     |            |     |
| while             | (innerLoopIter |      |       | < maxInnerLoopIter) |           |     |     | {          |     |
mykernel<<<32, threadsPerBlock, 0, stream>>>((const char *)dataPtr,
,→dataSize);
innerLoopIter++;
}
outerLoopIter++;
}
free(dataPtr);
}
Managed
| void test_advise_managed(cudaStream_t |     |     |     |     |     | stream) |     | {   |     |
| ------------------------------------- | --- | --- | --- | --- | --- | ------- | --- | --- | --- |
char *dataPtr;
| size_t dataSize |        | =   | 64 *  | threadsPerBlock;  |     |     | ∕∕  | 16 KiB |     |
| --------------- | ------ | --- | ----- | ----------------- | --- | --- | --- | ------ | --- |
| ∕∕ Allocate     | memory |     | using | cudaMallocManaged |     |     |     |        |     |
∕∕ (malloc may be used on systems with full CUDA Unified memory support)
| cudaMallocManaged(&dataPtr, |        |     |        |        | dataSize); |     |     |     |     |
| --------------------------- | ------ | --- | ------ | ------ | ---------- | --- | --- | --- | --- |
| ∕∕ Set the                  | advice |     | on the | memory | region     |     |     |     |     |
cudaMemLocation loc = {.type = cudaMemLocationTypeDevice, .id = myGpuId};
cudaMemAdvise(dataPtr, dataSize, cudaMemAdviseSetReadMostly, loc);
| int outerLoopIter    |     |     | = 0; |                   |     |     |     |     |     |
| -------------------- | --- | --- | ---- | ----------------- | --- | --- | --- | --- | --- |
| while (outerLoopIter |     |     | <    | maxOuterLoopIter) |     |     | {   |     |     |
(continuesonnextpage)
4.1. UnifiedMemory 159

CUDAProgrammingGuide,Release13.1
(continuedfrompreviouspage)
|     | ∕∕ The             | data     | is        | written                    | by        | the CPU | each        | outer     | loop         | iteration |     |
| --- | ------------------ | -------- | --------- | -------------------------- | --------- | ------- | ----------- | --------- | ------------ | --------- | --- |
|     | init_data(dataPtr, |          |           | dataSize);                 |           |         |             |           |              |           |     |
|     | ∕∕ The             | data     | is        | made                       | available | to      | all GPUs    | by        | prefetching. |           |     |
|     | ∕∕ Prefetching     |          |           | here                       | causes    | read    | duplication | of        | data         | instead   |     |
|     | ∕∕ of              | data     | migration |                            |           |         |             |           |              |           |     |
|     | cudaMemLocation    |          |           | location;                  |           |         |             |           |              |           |     |
|     | location.type      |          | =         | cudaMemLocationTypeDevice; |           |         |             |           |              |           |     |
|     | for (int           | device   |           | = 0;                       | device    | <       | maxDevices; | device++) |              | {         |     |
|     | location.id        |          | =         | device;                    |           |         |             |           |              |           |     |
|     | const              | unsigned |           | int                        | flags     | = 0;    |             |           |              |           |     |
cudaMemPrefetchAsync(dataPtr, dataSize, location, flags, stream);
}
|     | ∕∕ The            | kernel         | only | reads |                     | this data | in the | inner | loop |     |     |
| --- | ----------------- | -------------- | ---- | ----- | ------------------- | --------- | ------ | ----- | ---- | --- | --- |
|     | int innerLoopIter |                |      | =     | 0;                  |           |        |       |      |     |     |
|     | while             | (innerLoopIter |      |       | < maxInnerLoopIter) |           |        | {     |      |     |     |
mykernel<<<32, threadsPerBlock, 0, stream>>>((const char *)dataPtr,
,→dataSize);
innerLoopIter++;
}
outerLoopIter++;
}
cudaFree(dataPtr);
}
4.1.4.3 QueryingDataUsageAttributesonManagedMemory
A program can query memory range attributes assigned through cudaMemAdvise or cudaMem-
PrefetchAsynconCUDAmanagedmemorybyusingthefollowingAPI:
| cudaMemRangeGetAttribute(void |     |     |     |     |        | *data,                |          |     |            |     |     |
| ----------------------------- | --- | --- | --- | --- | ------ | --------------------- | -------- | --- | ---------- | --- | --- |
|                               |     |     |     |     | size_t | dataSize,             |          |     |            |     |     |
|                               |     |     |     |     | enum   | cudaMemRangeAttribute |          |     | attribute, |     |     |
|                               |     |     |     |     | const  | void                  | *devPtr, |     |            |     |     |
|                               |     |     |     |     | size_t | count);               |          |     |            |     |     |
ThisfunctionqueriesanattributeofthememoryrangestartingatdevPtrwithasizeofcountbytes.
ThememoryrangemustrefertomanagedmemoryallocatedviacudaMallocManagedordeclaredvia
| __managed__variables. |     |     |     | Itispossibletoquerythefollowingattributes: |     |     |     |     |     |     |     |
| --------------------- | --- | --- | --- | ------------------------------------------ | --- | --- | --- | --- | --- | --- | --- |
▶
cudaMemRangeAttributeReadMostly: returns 1 if the entire memory range has the cud-
aMemAdviseSetReadMostlyattributeset,or0otherwise.
▶
cudaMemRangeAttributePreferredLocation: theresultreturnedwillbeaGPUdeviceidor
cudaCpuDeviceIdiftheentirememoryrangehasthecorrespondingprocessoraspreferredlo-
cation,otherwisecudaInvalidDeviceIdwillbereturned. AnapplicationcanusethisqueryAPI
tomakedecisionaboutstagingdatathroughCPUorGPUdependingonthepreferredlocation
attributeofthemanagedpointer. Notethattheactuallocationofthememoryrangeatthetime
ofthequerymaybedifferentfromthepreferredlocation.
▶ cudaMemRangeAttributeAccessedBy: willreturnthelistofdevicesthathavethatadviseset
forthatmemoryrange.
| 160 |     |     |     |     |     |     |     |     |     | Chapter4. | CUDAFeatures |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --------- | ------------ |

CUDAProgrammingGuide,Release13.1
▶ cudaMemRangeAttributeLastPrefetchLocation: will return the last location to which the
memoryrangewasprefetchedexplicitlyusingcudaMemPrefetchAsync. Notethatthissimply
returnsthelastlocationthattheapplicationrequestedtoprefetchthememoryrangeto. Itgives
noindicationastowhethertheprefetchoperationtothatlocationhascompletedorevenbegun.
▶ cudaMemRangeAttributePreferredLocationType: it returns the location type of the pre-
ferredlocationwiththefollowingvalues:
▶ cudaMemLocationTypeDevice: if all pages in the memory range have the same GPU as
theirpreferredlocation,
▶ cudaMemLocationTypeHost: ifallpagesinthememoryrangehavetheCPUastheirpre-
ferredlocation,
▶ cudaMemLocationTypeHostNuma: ifallthepagesinthememoryrangehavethesamehost
NUMAnodeIDastheirpreferredlocation,
▶ cudaMemLocationTypeInvalid: ifeitherallthepagesdon’thavethesamepreferredlo-
cationorsomeofthepagesdon’thaveapreferredlocationatall.
▶ cudaMemRangeAttributePreferredLocationId: returnsthedeviceordinalifthecudaMem-
RangeAttributePreferredLocationType query for the same address range returns cud-
aMemLocationTypeDevice. IfthepreferredlocationtypeisahostNUMAnode, itreturnsthe
hostNUMAnodeID.Otherwise,theidshouldbeignored.
▶ cudaMemRangeAttributeLastPrefetchLocationType: returns the last location type to
which all pages in the memory range were prefetched explicitly via cudaMemPrefetchAsync.
Thefollowingvaluesarereturned:
▶ cudaMemLocationTypeDevice: if all pages in the memory range were prefetched to the
sameGPU,
▶ cudaMemLocationTypeHost: ifallpagesinthememoryrangewereprefetchedtotheCPU,
▶ cudaMemLocationTypeHostNuma: ifallthepagesinthememoryrangewereprefetchedto
thesamehostNUMAnodeID,
▶ cudaMemLocationTypeInvalid: if either all the pages were not prefetched to the same
locationorsomeofthepageswereneverprefetchedatall.
▶ cudaMemRangeAttributeLastPrefetchLocationId: if the cudaMemRangeAttribute-
LastPrefetchLocationType query for the same address range returns cudaMemLocation-
TypeDevice,itwillbeavaliddeviceordinalorifitreturnscudaMemLocationTypeHostNuma,it
willbeavalidhostNUMAnodeID.Otherwise,theidshouldbeignored.
Additionally, multiple attributes can be queried by using corresponding cudaMemRangeGetAt-
tributesfunction.
4.1.4.4 GPUMemoryOversubscription
Unifiedmemoryenablesapplicationstooversubscribethememoryofanyindividualprocessor: inother
wordstheycanallocateandsharearrayslargerthanthememorycapacityofanyindividualprocessor
inthesystem,enablingamongothersout-of-coreprocessingofdatasetsthatdonotfitwithinasingle
GPU,withoutaddingsignificantcomplexitytotheprogrammingmodel.
Additionally, multiple attributes can be queried by using corresponding cudaMemRangeGetAt-
tributesfunction.
4.1. UnifiedMemory 161

CUDAProgrammingGuide,Release13.1
4.2. CUDA Graphs
CUDAGraphspresentanothermodelforworksubmissioninCUDA.Agraphisaseriesofoperations
such as kernel launches, data movement, etc., connected by dependencies, which is defined sepa-
ratelyfromitsexecution. Thisallowsagraphtobedefinedonceandthenlaunchedrepeatedly. Sep-
aratingoutthedefinitionofagraphfromitsexecutionenablesanumberofoptimizations: first,CPU
launchcostsarereducedcomparedtostreams,becausemuchofthesetupisdoneinadvance; sec-
ond,presentingthewholeworkflowtoCUDAenablesoptimizationswhichmightnotbepossiblewith
thepiecewiseworksubmissionmechanismofstreams.
Toseetheoptimizationspossiblewithgraphs,considerwhathappensinastream: whenyouplacea
kernelintoastream,thehostdriverperformsasequenceofoperationsinpreparationfortheexecu-
tion of the kernel on the GPU. These operations, necessary for setting up and launching the kernel,
areanoverheadcostwhichmustbepaidforeachkernelthatisissued. ForaGPUkernelwithashort
execution time, this overhead cost can be a significant fraction of the overall end-to-end execution
time. BycreatingaCUDAgraphthatencompassesaworkflowthatwillbelaunchedmanytimes,these
overheadcostscanbepaidoncefortheentiregraphduringinstantiation,andthegraphitselfcanthen
belaunchedrepeatedlywithverylittleoverhead.
4.2.1. Graph Structure
Anoperationformsanodeinagraph. Thedependenciesbetweentheoperationsaretheedges. These
dependenciesconstraintheexecutionsequenceoftheoperations.
Anoperationmaybescheduledatanytimeoncethenodesonwhichitdependsarecomplete. Schedul-
ingisleftuptotheCUDAsystem.
4.2.1.1 NodeTypes
Agraphnodecanbeoneof:
▶ kernel
▶ CPUfunctioncall
▶ memorycopy
▶ memset
▶ emptynode
▶ waitingonaCUDAEvent
▶ recordingaCUDAEvent
▶ signallinganexternalsemaphore
▶ waitingonanexternalsemaphore
▶ conditionalnode
▶ memorynode
▶ childgraph: Toexecuteaseparatenestedgraph,asshowninthefollowingfigure.
162 Chapter4. CUDAFeatures

CUDAProgrammingGuide,Release13.1
Figure21: ChildGraphExample
4.2.1.2 EdgeData
CUDA12.3introducededgedataonCUDAGraphs. Atthistime,theonlyusefornon-defaultedgedata
isenablingProgrammaticDependentLaunch.
Generallyspeaking,edgedatamodifiesadependencyspecifiedbyanedgeandconsistsofthreeparts:
anoutgoingport,anincomingport,andatype. Anoutgoingportspecifieswhenanassociatededge
istriggered. Anincomingportspecifieswhatportionofanodeisdependentonanassociatededge.
Atypemodifiestherelationbetweentheendpoints.
Portvaluesarespecifictonodetypeanddirection,andedgetypesmayberestrictedtospecificnode
types. Inallcases,zero-initializededgedatarepresentsdefaultbehavior. Outgoingport0waitsonan
entiretask,incomingport0blocksanentiretask,andedgetype0isassociatedwithafulldependency
withmemorysynchronizingbehavior.
EdgedataisoptionallyspecifiedinvariousgraphAPIsviaaparallelarraytotheassociatednodes. If
it is omitted as an input parameter, zero-initialized data is used. If it is omitted as an output (query)
parameter,theAPIacceptsthisiftheedgedatabeingignoredisallzero-initialized,andreturnscud-
aErrorLossyQueryifthecallwoulddiscardinformation.
EdgedataisalsoavailableinsomestreamcaptureAPIs: cudaStreamBeginCaptureToGraph(),cu-
daStreamGetCaptureInfo(), and cudaStreamUpdateCaptureDependencies(). In these cases,
there is not yet a downstream node. The data is associated with a dangling edge (half edge) which
will either be connected to a future captured node or discarded at termination of stream capture.
Note that some edge types do not wait on full completion of the upstream node. These edges are
ignoredwhenconsideringifastreamcapturehasbeenfullyrejoinedtotheoriginstream,andcannot
bediscardedattheendofcapture. SeeStreamCapture.
No node types define additional incoming ports, and only kernel nodes define additional outgoing
ports. Thereisonenon-defaultdependencytype,cudaGraphDependencyTypeProgrammatic,which
isusedtoenableProgrammaticDependentLaunchbetweentwokernelnodes.
4.2. CUDAGraphs 163

CUDAProgrammingGuide,Release13.1
| 4.2.2. | Building | and Running |     | Graphs |     |     |
| ------ | -------- | ----------- | --- | ------ | --- | --- |
Work submission using graphs is separated into three distinct stages: definition, instantiation, and
execution.
▶ Duringthedefinitionorcreationphase,aprogramcreatesadescriptionoftheoperationsinthe
graphalongwiththedependenciesbetweenthem.
▶ Instantiation takes a snapshot of the graph template, validates it, and performs much of the
setupandinitializationofworkwiththeaimofminimizingwhatneedstobedoneatlaunch. The
resultinginstanceisknownasanexecutablegraph.
▶
Anexecutablegraphmaybelaunchedintoastream,similartoanyotherCUDAwork. Itmaybe
launchedanynumberoftimeswithoutrepeatingtheinstantiation.
4.2.2.1 GraphCreation
Graphscanbecreatedviatwomechanisms: usingtheexplicitGraphAPIandviastreamcapture.
| 4.2.2.1.1 | GraphAPIs |     |     |     |     |     |
| --------- | --------- | --- | --- | --- | --- | --- |
The following is an example (omitting declarations and other boilerplate code) of creating the below
graph. NotetheuseofcudaGraphCreate()tocreatethegraphandcudaGraphAddNode()toadd
thekernelnodesandtheirdependencies. TheCUDARuntimeAPIdocumentationlistsallthefunctions
availableforaddingnodesanddependencies.
|                         |           | Figure22:   | CreatingaGraphUsingGraphAPIsExample |       |     |     |
| ----------------------- | --------- | ----------- | ----------------------------------- | ----- | --- | --- |
| ∕∕ Create               | the graph | - it starts | out                                 | empty |     |     |
| cudaGraphCreate(&graph, |           | 0);         |                                     |       |     |     |
| ∕∕ Create               | the nodes | and their   | dependencies                        |       |     |     |
| cudaGraphNode_t         | nodes[4]; |             |                                     |       |     |     |
| cudaGraphNodeParams     |           | kParams     | = { cudaGraphNodeTypeKernel         |       | };  |     |
(continuesonnextpage)
| 164 |     |     |     |     | Chapter4. | CUDAFeatures |
| --- | --- | --- | --- | --- | --------- | ------------ |

CUDAProgrammingGuide,Release13.1
(continuedfrompreviouspage)
| kParams.kernel.func |     | = (void *)kernelName; |     |     |
| ------------------- | --- | --------------------- | --- | --- |
kParams.kernel.gridDim.x = kParams.kernel.gridDim.y = kParams.kernel.
| ,→gridDim.z | = 1; |     |     |     |
| ----------- | ---- | --- | --- | --- |
kParams.kernel.blockDim.x = kParams.kernel.blockDim.y = kParams.kernel.
| ,→blockDim.z                | = 1; |              |       |               |
| --------------------------- | ---- | ------------ | ----- | ------------- |
| cudaGraphAddNode(&nodes[0], |      | graph, NULL, | NULL, | 0, &kParams); |
cudaGraphAddNode(&nodes[1], graph, &nodes[0], NULL, 1, &kParams);
cudaGraphAddNode(&nodes[2], graph, &nodes[0], NULL, 1, &kParams);
cudaGraphAddNode(&nodes[3], graph, &nodes[1], NULL, 2, &kParams);
Theexampleaboveshowsfourkernelnodeswithdependenciesbetweenthemtoillustratethecreation
of a very simple graph. In a typical user application there would also need to be nodes added for
memoryoperations,suchascudaGraphAddMemcpyNode()andthelike. Forfullreferenceofallgraph
APIfunctionstoaddnodes,seetheTheCUDARuntimeAPIdocumentation.
| 4.2.2.1.2 StreamCapture |     |     |     |     |
| ----------------------- | --- | --- | --- | --- |
Streamcaptureprovidesamechanismtocreateagraphfromexistingstream-basedAPIs. Asection
of code which launches work into streams, including existing code, can be bracketed with calls to
| cudaStreamBeginCapture()andcudaStreamEndCapture(). |        |     |     | Seebelow. |
| -------------------------------------------------- | ------ | --- | --- | --------- |
| cudaGraph_t                                        | graph; |     |     |           |
cudaStreamBeginCapture(stream);
| kernel_A<<< | ..., stream | >>>(...); |     |     |
| ----------- | ----------- | --------- | --- | --- |
| kernel_B<<< | ..., stream | >>>(...); |     |     |
libraryCall(stream);
| kernel_C<<<                  | ..., stream | >>>(...); |     |     |
| ---------------------------- | ----------- | --------- | --- | --- |
| cudaStreamEndCapture(stream, |             | &graph);  |     |     |
A call to cudaStreamBeginCapture() places a stream in capture mode. When a stream is being
captured, work launched into the stream is not enqueued for execution. It is instead appended to
an internal graph that is progressively being built up. This graph is then returned by calling cudaS-
treamEndCapture(),whichalsoendscapturemodeforthestream. Agraphwhichisactivelybeing
constructedbystreamcaptureisreferredtoasacapturegraph.
Stream capture can be used on any CUDA stream except cudaStreamLegacy (the “NULL stream”).
NotethatitcanbeusedoncudaStreamPerThread. Ifaprogramisusingthelegacystream,itmay
bepossibletoredefinestream0tobetheper-threadstreamwithnofunctionalchange. SeeBlocking
andnon-blockingstreamsandthedefaultstream.
WhetherastreamisbeingcapturedcanbequeriedwithcudaStreamIsCapturing().
WorkcanbecapturedtoanexistinggraphusingcudaStreamBeginCaptureToGraph(). Insteadof
capturingtoaninternalgraph,workiscapturedtoagraphprovidedbytheuser.
4.2. CUDAGraphs 165

CUDAProgrammingGuide,Release13.1
| 4.2.2.1.2.1 | Cross-streamDependenciesandEvents |     |     |     |     |     |
| ----------- | --------------------------------- | --- | --- | --- | --- | --- |
Streamcapturecanhandlecross-streamdependenciesexpressedwithcudaEventRecord()andcu-
daStreamWaitEvent(), providedtheeventbeingwaiteduponwasrecordedintothesamecapture
graph.
Whenaneventisrecordedinastreamthatisincapturemode,itresultsinacapturedevent. Acaptured
eventrepresentsasetofnodesinacapturegraph.
Whenacapturedeventiswaitedonbyastream,itplacesthestreamincapturemodeifitisnotalready,
andthenextiteminthestreamwillhaveadditionaldependenciesonthenodesinthecapturedevent.
Thetwostreamsarethenbeingcapturedtothesamecapturegraph.
When cross-stream dependencies are present in stream capture, cudaStreamEndCapture() must
still be called in the same stream where cudaStreamBeginCapture() was called; this is the origin
stream. Anyotherstreamswhicharebeingcapturedtothesamecapturegraph,duetoevent-based
dependencies,mustalsobejoinedbacktotheoriginstream. Thisisillustratedbelow. Allstreamsbeing
capturedtothesamecapturegrapharetakenoutofcapturemodeuponcudaStreamEndCapture().
Failuretorejointotheoriginstreamwillresultinfailureoftheoverallcaptureoperation.
| ∕∕ stream1 | is the origin | stream |     |     |     |     |
| ---------- | ------------- | ------ | --- | --- | --- | --- |
cudaStreamBeginCapture(stream1);
| kernel_A<<<                   | ..., stream1 | >>>(...);     |                  |      |     |     |
| ----------------------------- | ------------ | ------------- | ---------------- | ---- | --- | --- |
| ∕∕ Fork                       | into stream2 |               |                  |      |     |     |
| cudaEventRecord(event1,       |              | stream1);     |                  |      |     |     |
| cudaStreamWaitEvent(stream2,  |              |               | event1);         |      |     |     |
| kernel_B<<<                   | ..., stream1 | >>>(...);     |                  |      |     |     |
| kernel_C<<<                   | ..., stream2 | >>>(...);     |                  |      |     |     |
| ∕∕ Join                       | stream2 back | to origin     | stream (stream1) |      |     |     |
| cudaEventRecord(event2,       |              | stream2);     |                  |      |     |     |
| cudaStreamWaitEvent(stream1,  |              |               | event2);         |      |     |     |
| kernel_D<<<                   | ..., stream1 | >>>(...);     |                  |      |     |     |
| ∕∕ End capture                | in the       | origin stream |                  |      |     |     |
| cudaStreamEndCapture(stream1, |              |               | &graph);         |      |     |     |
| ∕∕ stream1                    | and stream2  | no longer     | in capture       | mode |     |     |
ThegraphreturnedbytheabovecodeisshowninFigure22.
Note
Whenastreamistakenoutofcapturemode,thenextnon-capturediteminthestream(ifany)will
still have a dependency on the most recent prior non-captured item, despite intermediate items
havingbeenremoved.
| 166 |     |     |     |     | Chapter4. | CUDAFeatures |
| --- | --- | --- | --- | --- | --------- | ------------ |

CUDAProgrammingGuide,Release13.1
4.2.2.1.2.2 ProhibitedandUnhandledOperations
It is invalid to synchronize or query the execution status of a stream which is being captured or a
captured event, because they do not represent items scheduled for execution. It is also invalid to
querytheexecutionstatusoforsynchronizeabroaderhandlewhichencompassesanactivestream
capture,suchasadeviceorcontexthandlewhenanyassociatedstreamisincapturemode.
When any stream in the same context is being captured, and it was not created with cudaStream-
NonBlocking, any attempted use of the legacy stream is invalid. This is because the legacy stream
handle at all times encompasses these other streams; enqueueing to the legacy stream would cre-
ateadependencyonthestreamsbeingcaptured,andqueryingitorsynchronizingitwouldqueryor
synchronizethestreamsbeingcaptured.
ItisthereforealsoinvalidtocallsynchronousAPIsinthiscase. OneexampleofasynchronousAPIsis
cudaMemcpy()whichenqueuesworktothelegacystreamandsynchronizesonitbeforereturning.
Note
As a general rule, when a dependency relation would connect something that is captured with
somethingthatwasnotcapturedandinsteadenqueuedforexecution,CUDApreferstoreturnan
errorratherthanignorethedependency. Anexceptionismadeforplacingastreamintooroutof
capturemode;thisseversadependencyrelationbetweenitemsaddedtothestreamimmediately
beforeandafterthemodetransition.
Itisinvalidtomergetwoseparatecapturegraphsbywaitingonacapturedeventfromastreamwhich
isbeingcapturedandisassociatedwithadifferentcapturegraphthantheevent. Itisinvalidtowait
onanon-capturedeventfromastreamwhichisbeingcapturedwithoutspecifyingthecudaEvent-
WaitExternalflag.
A small number of APIs that enqueue asynchronous operations into streams are not currently sup-
portedingraphsandwillreturnanerrorifcalledwithastreamwhichisbeingcaptured,suchascud-
aStreamAttachMemAsync().
4.2.2.1.2.3 Invalidation
When an invalid operation is attempted during stream capture, any associated capture graphs are
invalidated. Whenacapturegraphisinvalidated,furtheruseofanystreamswhicharebeingcaptured
or captured events associated with the graph is invalid and will return an error, until stream capture
is ended with cudaStreamEndCapture(). This call will take the associated streams out of capture
mode,butwillalsoreturnanerrorvalueandaNULLgraph.
4.2.2.1.2.4 CaptureIntrospection
Active stream capture operations can be inspected using cudaStreamGetCaptureInfo(). This al-
lowstheusertoobtainthestatusofthecapture,aunique(per-process)IDforthecapture,theunder-
lyinggraphobject,anddependencies/edgedataforthenextnodetobecapturedinthestream. This
dependencyinformationcanbeusedtoobtainahandletothenode(s)whichwerelastcapturedinthe
stream.
4.2.2.1.3 PuttingItAllTogether
The example in Figure 22 is a simplistic example intended to show a small graph conceptually. In an
applicationthatutilizesCUDAgraphs,thereismorecomplexitytousingeitherthegraphAPIorstream
4.2. CUDAGraphs 167

CUDAProgrammingGuide,Release13.1
capture. The following code snippet shows a side by side comparison of the Graph API and Stream
CapturetocreateaCUDAgraphtoexecuteasimpletwostagereductionalgorithm.
Figure 23 is an illustration of this CUDA graph and was generated using the cudaGraphDebugDot-
Printfunctionappliedtothecodebelow,withsmalladjustmentsforreadability,andthenrendered
usingGraphviz.
Figure23: CUDAgraphexampleusingtwostagereductionkernel
GraphAPI
| void cudaGraphsManual(float | *inputVec_h, |     |     |     |
| --------------------------- | ------------ | --- | --- | --- |
float *inputVec_d,
double *outputVec_d,
double *result_d,
size_t inputSize,
size_t numOfBlocks)
{
cudaStream_t streamForGraph;
cudaGraph_t graph;
std::vector<cudaGraphNode_t> nodeDependencies;
| cudaGraphNode_t | memcpyNode, | kernelNode, | memsetNode; |     |
| --------------- | ----------- | ----------- | ----------- | --- |
| double          | result_h    | = 0.0;      |             |     |
cudaStreamCreate(&streamForGraph));
| cudaKernelNodeParams | kernelNodeParams | = {0}; |     |     |
| -------------------- | ---------------- | ------ | --- | --- |
| cudaMemcpy3DParms    | memcpyParams     | = {0}; |     |     |
| cudaMemsetParams     | memsetParams     | = {0}; |     |     |
(continuesonnextpage)
| 168 |     |     | Chapter4. | CUDAFeatures |
| --- | --- | --- | --------- | ------------ |

CUDAProgrammingGuide,Release13.1
(continuedfrompreviouspage)
| memcpyParams.srcArray |     | = NULL;           |        |     |
| --------------------- | --- | ----------------- | ------ | --- |
| memcpyParams.srcPos   |     | = make_cudaPos(0, | 0, 0); |     |
memcpyParams.srcPtr = make_cudaPitchedPtr(inputVec_h, sizeof(float) *
| ,→inputSize,          | inputSize, | 1);               |        |     |
| --------------------- | ---------- | ----------------- | ------ | --- |
| memcpyParams.dstArray |            | = NULL;           |        |     |
| memcpyParams.dstPos   |            | = make_cudaPos(0, | 0, 0); |     |
memcpyParams.dstPtr = make_cudaPitchedPtr(inputVec_d, sizeof(float) *
| ,→inputSize, | inputSize, | 1); |     |     |
| ------------ | ---------- | --- | --- | --- |
memcpyParams.extent = make_cudaExtent(sizeof(float) * inputSize, 1, 1);
| memcpyParams.kind  |     | = cudaMemcpyHostToDevice; |                |     |
| ------------------ | --- | ------------------------- | -------------- | --- |
| memsetParams.dst   |     | = (void                   | *)outputVec_d; |     |
| memsetParams.value |     | = 0;                      |                |     |
| memsetParams.pitch |     | = 0;                      |                |     |
memsetParams.elementSize = sizeof(float); ∕∕ elementSize can be max 4 bytes
| memsetParams.width      |     | = numOfBlocks | * 2; |     |
| ----------------------- | --- | ------------- | ---- | --- |
| memsetParams.height     |     | = 1;          |      |     |
| cudaGraphCreate(&graph, |     | 0);           |      |     |
cudaGraphAddMemcpyNode(&memcpyNode, graph, NULL, 0, &memcpyParams);
cudaGraphAddMemsetNode(&memsetNode, graph, NULL, 0, &memsetParams);
nodeDependencies.push_back(memsetNode);
nodeDependencies.push_back(memcpyNode);
void *kernelArgs[4] = {(void *)&inputVec_d, (void *)&outputVec_d, &
| ,→inputSize,                    | &numOfBlocks}; |     |                           |        |
| ------------------------------- | -------------- | --- | ------------------------- | ------ |
| kernelNodeParams.func           |                |     | = (void *)reduce;         |        |
| kernelNodeParams.gridDim        |                |     | = dim3(numOfBlocks,       | 1, 1); |
| kernelNodeParams.blockDim       |                |     | = dim3(THREADS_PER_BLOCK, | 1, 1); |
| kernelNodeParams.sharedMemBytes |                |     | = 0;                      |        |
| kernelNodeParams.kernelParams   |                |     | = (void **)kernelArgs;    |        |
| kernelNodeParams.extra          |                |     | = NULL;                   |        |
cudaGraphAddKernelNode(
&kernelNode, graph, nodeDependencies.data(), nodeDependencies.size(), &
,→kernelNodeParams);
nodeDependencies.clear();
nodeDependencies.push_back(kernelNode);
| memset(&memsetParams,    |     | 0, sizeof(memsetParams)); |     |     |
| ------------------------ | --- | ------------------------- | --- | --- |
| memsetParams.dst         |     | = result_d;               |     |     |
| memsetParams.value       |     | = 0;                      |     |     |
| memsetParams.elementSize |     | = sizeof(float);          |     |     |
| memsetParams.width       |     | = 2;                      |     |     |
| memsetParams.height      |     | = 1;                      |     |     |
cudaGraphAddMemsetNode(&memsetNode, graph, NULL, 0, &memsetParams);
nodeDependencies.push_back(memsetNode);
(continuesonnextpage)
4.2. CUDAGraphs 169

CUDAProgrammingGuide,Release13.1
(continuedfrompreviouspage)
| memset(&kernelNodeParams,       | 0,  | sizeof(kernelNodeParams)); |                |     |     |
| ------------------------------- | --- | -------------------------- | -------------- | --- | --- |
| kernelNodeParams.func           |     | = (void                    | *)reduceFinal; |     |     |
| kernelNodeParams.gridDim        |     | = dim3(1,                  | 1, 1);         |     |     |
| kernelNodeParams.blockDim       |     | = dim3(THREADS_PER_BLOCK,  |                | 1,  | 1); |
| kernelNodeParams.sharedMemBytes |     | = 0;                       |                |     |     |
void *kernelArgs2[3] = {(void *)&outputVec_d, (void *)&result_d,
,→ &numOfBlocks};
| kernelNodeParams.kernelParams |     | = kernelArgs2; |     |     |     |
| ----------------------------- | --- | -------------- | --- | --- | --- |
| kernelNodeParams.extra        |     | = NULL;        |     |     |     |
cudaGraphAddKernelNode(
&kernelNode, graph, nodeDependencies.data(), nodeDependencies.size(), &
,→kernelNodeParams);
nodeDependencies.clear();
nodeDependencies.push_back(kernelNode);
| memset(&memcpyParams, | 0, sizeof(memcpyParams)); |     |        |     |     |
| --------------------- | ------------------------- | --- | ------ | --- | --- |
| memcpyParams.srcArray | = NULL;                   |     |        |     |     |
| memcpyParams.srcPos   | = make_cudaPos(0,         |     | 0, 0); |     |     |
memcpyParams.srcPtr = make_cudaPitchedPtr(result_d, sizeof(double), 1,
,→1);
| memcpyParams.dstArray | = NULL;           |     |        |     |     |
| --------------------- | ----------------- | --- | ------ | --- | --- |
| memcpyParams.dstPos   | = make_cudaPos(0, |     | 0, 0); |     |     |
memcpyParams.dstPtr = make_cudaPitchedPtr(&result_h, sizeof(double), 1,
,→1);
| memcpyParams.extent | = make_cudaExtent(sizeof(double), |     |     | 1, 1); |     |
| ------------------- | --------------------------------- | --- | --- | ------ | --- |
| memcpyParams.kind   | = cudaMemcpyDeviceToHost;         |     |     |        |     |
cudaGraphAddMemcpyNode(&memcpyNode, graph, nodeDependencies.data(),
| ,→nodeDependencies.size(), | &memcpyParams); |     |     |     |     |
| -------------------------- | --------------- | --- | --- | --- | --- |
nodeDependencies.clear();
nodeDependencies.push_back(memcpyNode);
| cudaGraphNode_t            | hostNode;             |                       |     |     |     |
| -------------------------- | --------------------- | --------------------- | --- | --- | --- |
| cudaHostNodeParams         | hostParams            | = {0};                |     |     |     |
| hostParams.fn              |                       | = myHostNodeCallback; |     |     |     |
| callBackData_t hostFnData; |                       |                       |     |     |     |
| hostFnData.data            | = &result_h;          |                       |     |     |     |
| hostFnData.fn_name         | = "cudaGraphsManual"; |                       |     |     |     |
| hostParams.userData        | = &hostFnData;        |                       |     |     |     |
cudaGraphAddHostNode(&hostNode, graph, nodeDependencies.data(),
| ,→nodeDependencies.size(), | &hostParams); |     |     |     |     |
| -------------------------- | ------------- | --- | --- | --- | --- |
}
| 170 |     |     |     | Chapter4. | CUDAFeatures |
| --- | --- | --- | --- | --------- | ------------ |

CUDAProgrammingGuide,Release13.1
StreamCapture
| void cudaGraphsUsingStreamCapture(float |     |     |     | *inputVec_h, |     |
| --------------------------------------- | --- | --- | --- | ------------ | --- |
float *inputVec_d,
double *outputVec_d,
double *result_d,
size_t inputSize,
size_t numOfBlocks)
{
| cudaStream_t | stream1,         | stream2, | stream3,      | streamForGraph; |               |
| ------------ | ---------------- | -------- | ------------- | --------------- | ------------- |
| cudaEvent_t  | forkStreamEvent, |          | memsetEvent1, |                 | memsetEvent2; |
| cudaGraph_t  | graph;           |          |               |                 |               |
| double       | result_h         | = 0.0;   |               |                 |               |
cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);
cudaStreamCreate(&stream3);
cudaStreamCreate(&streamForGraph);
cudaEventCreate(&forkStreamEvent);
cudaEventCreate(&memsetEvent1);
cudaEventCreate(&memsetEvent2);
cudaStreamBeginCapture(stream1, cudaStreamCaptureModeGlobal);
| cudaEventRecord(forkStreamEvent, |     |     | stream1);        |     |     |
| -------------------------------- | --- | --- | ---------------- | --- | --- |
| cudaStreamWaitEvent(stream2,     |     |     | forkStreamEvent, |     | 0); |
| cudaStreamWaitEvent(stream3,     |     |     | forkStreamEvent, |     | 0); |
cudaMemcpyAsync(inputVec_d, inputVec_h, sizeof(float) * inputSize,
| ,→cudaMemcpyDefault, | stream1); |     |     |     |     |
| -------------------- | --------- | --- | --- | --- | --- |
cudaMemsetAsync(outputVec_d, 0, sizeof(double) * numOfBlocks, stream2);
| cudaEventRecord(memsetEvent1, |     |     | stream2);          |     |           |
| ----------------------------- | --- | --- | ------------------ | --- | --------- |
| cudaMemsetAsync(result_d,     |     |     | 0, sizeof(double), |     | stream3); |
| cudaEventRecord(memsetEvent2, |     |     | stream3);          |     |           |
| cudaStreamWaitEvent(stream1,  |     |     | memsetEvent1,      |     | 0);       |
reduce<<<numOfBlocks, THREADS_PER_BLOCK, 0, stream1>>>(inputVec_d,
| ,→outputVec_d,               | inputSize, | numOfBlocks); |               |     |     |
| ---------------------------- | ---------- | ------------- | ------------- | --- | --- |
| cudaStreamWaitEvent(stream1, |            |               | memsetEvent2, |     | 0); |
reduceFinal<<<1, THREADS_PER_BLOCK, 0, stream1>>>(outputVec_d, result_d,
,→numOfBlocks);
cudaMemcpyAsync(&result_h, result_d, sizeof(double), cudaMemcpyDefault,
,→stream1);
| callBackData_t  | hostFnData |     | = {0};       |     |     |
| --------------- | ---------- | --- | ------------ | --- | --- |
| hostFnData.data |            |     | = &result_h; |     |     |
(continuesonnextpage)
4.2. CUDAGraphs 171

CUDAProgrammingGuide,Release13.1
(continuedfrompreviouspage)
| hostFnData.fn_name            |     |     | =   | "cudaGraphsUsingStreamCapture"; |     |     |     |
| ----------------------------- | --- | --- | --- | ------------------------------- | --- | --- | --- |
| cudaHostFn_t                  |     | fn  | =   | myHostNodeCallback;             |     |     |     |
| cudaLaunchHostFunc(stream1,   |     |     |     | fn, &hostFnData);               |     |     |     |
| cudaStreamEndCapture(stream1, |     |     |     | &graph);                        |     |     |     |
}
4.2.2.2 GraphInstantiation
Onceagraphhasbeencreated,eitherbytheuseofthegraphAPIorstreamcapture,thegraphmustbe
instantiatedtocreateanexecutablegraph,whichcanthenbelaunched. AssumingthecudaGraph_t
graph has been created successfully, the following code will instantiate the graph and create the
| executablegraphcudaGraphExec_t   |     |            |     | graphExec:   |           |     |     |
| -------------------------------- | --- | ---------- | --- | ------------ | --------- | --- | --- |
| cudaGraphExec_t                  |     | graphExec; |     |              |           |     |     |
| cudaGraphInstantiate(&graphExec, |     |            |     | graph, NULL, | NULL, 0); |     |     |
4.2.2.3 GraphExecution
Afteragraphhasbeencreatedandinstantiatedtocreateanexecutablegraph,itcanbelaunched. As-
sumingthecudaGraphExec_t graphExechasbeencreatedsuccessfully,thefollowingcodesnippet
willlaunchthegraphintothespecifiedstream:
| cudaGraphLaunch(graphExec, |     |     | stream); |     |     |     |     |
| -------------------------- | --- | --- | -------- | --- | --- | --- | --- |
PullingitalltogetherandusingthestreamcaptureexamplefromSection4.2.2.1.2,thefollowingcode
snippetwillcreateagraph,instantiateit,andlaunchit:
| cudaGraph_t | graph; |     |     |     |     |     |     |
| ----------- | ------ | --- | --- | --- | --- | --- | --- |
cudaStreamBeginCapture(stream);
| kernel_A<<< | ..., | stream | >>>(...); |     |     |     |     |
| ----------- | ---- | ------ | --------- | --- | --- | --- | --- |
| kernel_B<<< | ..., | stream | >>>(...); |     |     |     |     |
libraryCall(stream);
| kernel_C<<<                      | ...,     | stream     | >>>(...);    |              |           |     |     |
| -------------------------------- | -------- | ---------- | ------------ | ------------ | --------- | --- | --- |
| cudaStreamEndCapture(stream,     |          |            | &graph);     |              |           |     |     |
| cudaGraphExec_t                  |          | graphExec; |              |              |           |     |     |
| cudaGraphInstantiate(&graphExec, |          |            |              | graph, NULL, | NULL, 0); |     |     |
| cudaGraphLaunch(graphExec,       |          |            | stream);     |              |           |     |     |
| 4.2.3.                           | Updating |            | Instantiated | Graphs       |           |     |     |
Whenaworkflowchanges,thegraphbecomesoutofdateandmustbemodified. Majorchangesto
graph structure (such as topology or node types) require re-instantiation because topology-related
optimizations must be re-applied. However, it is common for only node parameters (such as kernel
parametersandmemoryaddresses)tochangewhilethegraphtopologyremainsthesame. Forthis
case, CUDA provides a lightweight “Graph Update” mechanism that allows certain node parameters
to be modified in-place without rebuilding the entire graph, which is much more efficient than re-
instantiation.
| 172 |     |     |     |     |     | Chapter4. | CUDAFeatures |
| --- | --- | --- | --- | --- | --- | --------- | ------------ |

CUDAProgrammingGuide,Release13.1
Updates take effect the next time the graph is launched, so they do not impact previous graph
launches,eveniftheyarerunningatthetimeoftheupdate. Agraphmaybeupdatedandrelaunched
repeatedly,somultipleupdates/launchescanbequeuedonastream.
CUDAprovidestwomechanismsforupdatinginstantiatedgraphparameters,wholegraphupdateand
individual node update. Whole graph update allows the user to supply a topologically identical cud-
aGraph_t object whose nodes contain updated parameters. Individual node update allows the user
toexplicitlyupdatetheparametersofindividualnodes. UsinganupdatedcudaGraph_tismorecon-
venientwhenalargenumberofnodesarebeingupdated,orwhenthegraphtopologyisunknownto
thecaller(i.e.,Thegraphresultedfromstreamcaptureofalibrarycall). Usingindividualnodeupdate
ispreferredwhenthenumberofchangesissmallandtheuserhasthehandlestothenodesrequiring
updates. Individualnodeupdateskipsthetopologychecksandcomparisonsforunchangednodes,so
itcanbemoreefficientinmanycases.
CUDAalso providesa mechanismfor enabling and disabling individual nodes withoutaffecting their
currentparameters.
Thefollowingsectionsexplaineachapproachinmoredetail.
4.2.3.1 WholeGraphUpdate
cudaGraphExecUpdate()allowsaninstantiatedgraph(the“originalgraph”)tobeupdatedwiththe
parametersfromatopologicallyidenticalgraph(the“updating”graph). Thetopologyoftheupdating
graphmustbeidenticaltotheoriginalgraphusedtoinstantiatethecudaGraphExec_t. Inaddition,
the order in which the dependencies are specified must match. Finally, CUDA needs to consistently
orderthesinknodes(nodeswithnodependencies). CUDAreliesontheorderofspecificapicallsto
achieveconsistentsinknodeordering.
Moreexplicitly,followingthefollowingruleswillcausecudaGraphExecUpdate()topairthenodesin
theoriginalgraphandtheupdatinggraphdeterministically:
1. Foranycapturingstream,theAPIcallsoperatingonthatstreammustbemadeinthesameorder,
includingeventwaitandotherapicallsnotdirectlycorrespondingtonodecreation.
2. TheAPIcallswhichdirectlymanipulateagivengraphnode’sincomingedges(includingcaptured
stream APIs, node add APIs, and edge addition / removal APIs) must be made in the same or-
der. Moreover, whendependenciesarespecifiedinarraystotheseAPIs, theorderinwhichthe
dependenciesarespecifiedinsidethosearraysmustmatch.
3. Sinknodesmustbeconsistentlyordered. Sinknodesarenodeswithoutdependentnodes/out-
goingedgesinthefinalgraphatthetimeofthecudaGraphExecUpdate()invocation. Thefol-
lowingoperationsaffectsinknodeordering(ifpresent)andmust(asacombinedset)bemade
inthesameorder:
▶ NodeaddAPIsresultinginasinknode.
▶ Edgeremovalresultinginanodebecomingasinknode.
▶ cudaStreamUpdateCaptureDependencies(),ifitremovesasinknodefromacapturing
stream’sdependencyset.
▶ cudaStreamEndCapture().
ThefollowingexampleshowshowtheAPIcouldbeusedtoupdateaninstantiatedgraph:
cudaGraphExec_t graphExec = NULL;
for (int i = 0; i < 10; i++) {
cudaGraph_t graph;
(continuesonnextpage)
4.2. CUDAGraphs 173

CUDAProgrammingGuide,Release13.1
(continuedfrompreviouspage)
| cudaGraphExecUpdateResult |     |     |     |            |     | updateResult; |     |     |     |     |     |
| ------------------------- | --- | --- | --- | ---------- | --- | ------------- | --- | --- | --- | --- | --- |
| cudaGraphNode_t           |     |     |     | errorNode; |     |               |     |     |     |     |     |
∕∕ In this example we use stream capture to create the graph.
| ∕∕                             | You  | can             | also | use the | Graph  | API                           | to produce |     | a graph. |         |     |
| ------------------------------ | ---- | --------------- | ---- | ------- | ------ | ----------------------------- | ---------- | --- | -------- | ------- | --- |
| cudaStreamBeginCapture(stream, |      |                 |      |         |        | cudaStreamCaptureModeGlobal); |            |     |          |         |     |
| ∕∕                             | Call | a user-defined, |      |         | stream | based                         | workload,  |     | for      | example |     |
do_cuda_work(stream);
| cudaStreamEndCapture(stream, |     |     |     |     |     | &graph); |     |     |     |     |     |
| ---------------------------- | --- | --- | --- | --- | --- | -------- | --- | --- | --- | --- | --- |
∕∕ If we've already instantiated the graph, try to update it directly
| ∕∕  | and        | avoid | the | instantiation |     | overhead |     |     |     |     |     |
| --- | ---------- | ----- | --- | ------------- | --- | -------- | --- | --- | --- | --- | --- |
| if  | (graphExec |       | !=  | NULL)         | {   |          |     |     |     |     |     |
∕∕ If the graph fails to update, errorNode will be set to the
∕∕ node causing the failure and updateResult will be set to a
|     | ∕∕  | reason | code. |     |     |     |     |     |     |     |     |
| --- | --- | ------ | ----- | --- | --- | --- | --- | --- | --- | --- | --- |
cudaGraphExecUpdate(graphExec, graph, &errorNode, &updateResult);
}
∕∕ Instantiate during the first iteration or whenever the update
| ∕∕  | fails | for | any | reason |     |     |     |     |     |     |     |
| --- | ----- | --- | --- | ------ | --- | --- | --- | --- | --- | --- | --- |
if (graphExec == NULL || updateResult != cudaGraphExecUpdateSuccess) {
|     | ∕∕  | If a       | previous         |     | update | failed, | destroy |     | the cudaGraphExec_t |     |     |
| --- | --- | ---------- | ---------------- | --- | ------ | ------- | ------- | --- | ------------------- | --- | --- |
|     | ∕∕  | before     | re-instantiating |     |        | it      |         |     |                     |     |     |
|     | if  | (graphExec |                  | !=  | NULL)  | {       |         |     |                     |     |     |
cudaGraphExecDestroy(graphExec);
}
|     | ∕∕                               | Instantiate |         | graphExec |            | from | graph. | The    | error | node and  |     |
| --- | -------------------------------- | ----------- | ------- | --------- | ---------- | ---- | ------ | ------ | ----- | --------- | --- |
|     | ∕∕                               | error       | message |           | parameters | are  | unused | here.  |       |           |     |
|     | cudaGraphInstantiate(&graphExec, |             |         |           |            |      |        | graph, | NULL, | NULL, 0); |     |
}
cudaGraphDestroy(graph);
| cudaGraphLaunch(graphExec, |     |     |     |     |     | stream); |     |     |     |     |     |
| -------------------------- | --- | --- | --- | --- | --- | -------- | --- | --- | --- | --- | --- |
cudaStreamSynchronize(stream);
}
AtypicalworkflowistocreatetheinitialcudaGraph_tusingeitherthestreamcaptureorgraphAPI.
The cudaGraph_t is then instantiated and launched as normal. After the initial launch, a new cud-
aGraph_t is created using the same method as the initial graph and cudaGraphExecUpdate() is
called. If the graph update is successful, indicated by the updateResult parameter in the above
example, the updated cudaGraphExec_t is launched. If the update fails for any reason, the cud-
aGraphExecDestroy() and cudaGraphInstantiate() are called to destroy the original cuda-
GraphExec_tandinstantiateanewone.
ItisalsopossibletoupdatethecudaGraph_tnodesdirectly(i.e.,UsingcudaGraphKernelNodeSet-
Params()) and subsequently update the cudaGraphExec_t, however it is more efficient to use the
explicitnodeupdateAPIscoveredinthenextsection.
Conditionalhandleflagsanddefaultvaluesareupdatedaspartofthegraphupdate.
| 174 |     |     |     |     |     |     |     |     |     | Chapter4. | CUDAFeatures |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --------- | ------------ |

CUDAProgrammingGuide,Release13.1
PleaseseetheGraphAPIformoreinformationonusageandcurrentlimitations.
4.2.3.2 IndividualNodeUpdate
Instantiatedgraphnodeparameterscanbeupdateddirectly. Thiseliminatestheoverheadofinstanti-
ationaswellastheoverheadofcreatinganewcudaGraph_t. Ifthenumberofnodesrequiringupdate
issmallrelativetothetotalnumberofnodesinthegraph,itisbettertoupdatethenodesindividually.
ThefollowingmethodsareavailableforupdatingcudaGraphExec_tnodes:
Table8: IndividualNodeUpdateAPIs
API NodeType
cudaGraphExecKernelNodeSetParams() Kernelnode
cudaGraphExecMemcpyNodeSetParams() Memorycopynode
cudaGraphExecMemsetNodeSetParams() Memorysetnode
cudaGraphExecHostNodeSetParams() Hostnode
cudaGraphExecChildGraphNodeSetParams() Childgraphnode
cudaGraphExecEventRecordNodeSetEvent() Eventrecordnode
cudaGraphExecEventWaitNodeSetEvent() Eventwaitnode
cudaGraphExecExternalSemaphoresSignalNodeSet- External semaphore signal
Params() node
cudaGraphExecExternalSemaphoresWaitNodeSetParams() Externalsemaphorewaitnode
PleaseseetheGraphAPIformoreinformationonusageandcurrentlimitations.
4.2.3.3 IndividualNodeEnable
Kernel, memset and memcpy nodes in an instantiated graph can be enabled or disabled using the
cudaGraphNodeSetEnabled()API.Thisallowsthecreationofagraphwhichcontainsasupersetof
thedesiredfunctionalitywhichcanbecustomizedforeachlaunch. Theenablestateofanodecanbe
queriedusingthecudaGraphNodeGetEnabled()API.
Adisablednodeisfunctionallyequivalenttoemptynodeuntilitisre-enabled. Nodeparametersare
not affected by enabling/disabling a node. Enable state is unaffected by individual node update or
whole graph update with cudaGraphExecUpdate(). Parameter updates while the node is disabled
willtakeeffectwhenthenodeisre-enabled.
RefertotheGraphAPIformoreinformationonusageandcurrentlimitations.
4.2.3.4 GraphUpdateLimitations
Kernelnodes:
▶ Theowningcontextofthefunctioncannotchange.
▶ AnodewhosefunctionoriginallydidnotuseCUDAdynamicparallelismcannotbeupdatedtoa
functionwhichusesCUDAdynamicparallelism.
cudaMemsetandcudaMemcpynodes:
▶ TheCUDAdevice(s)towhichtheoperand(s)wasallocated/mappedcannotchange.
4.2. CUDAGraphs 175

CUDAProgrammingGuide,Release13.1
▶ The source/destination memory must be allocated from the same context as the original
source/destinationmemory.
▶ Only1DcudaMemset/cudaMemcpynodescanbechanged.
Additionalmemcpynoderestrictions:
▶ Changing either the source or destination memory type (i.e., cudaPitchedPtr, cudaArray_t,
etc.),orthetypeoftransfer(i.e.,cudaMemcpyKind)isnotsupported.
Externalsemaphorewaitnodesandrecordnodes:
▶ Changingthenumberofsemaphoresisnotsupported.
Conditionalnodes:
▶ Theorderofhandlecreationandassignmentmustmatchbetweenthegraphs.
▶ Changingnodeparametersisnotsupported(i.e. numberofgraphsintheconditional,nodecon-
text,etc).
▶ Changingparametersofnodeswithintheconditionalbodygraphissubjecttotherulesabove.
Memorynodes:
▶ It is not possible to update a cudaGraphExec_t with a cudaGraph_t if the cudaGraph_t is
currentlyinstantiatedasadifferentcudaGraphExec_t.
Therearenorestrictionsonupdatestohostnodes,eventrecordnodes,oreventwaitnodes.
4.2.4. Conditional Graph Nodes
Conditionalnodesallowconditionalexecutionandloopingofagraphcontainedwithintheconditional
node. This allows dynamic and iterative workflows to be represented completely within a graph and
freesupthehostCPUtoperformotherworkinparallel.
Evaluationoftheconditionvalueisperformedonthedevicewhenthedependenciesoftheconditional
nodehavebeenmet. Conditionalnodescanbeoneofthefollowingtypes:
▶ ConditionalIFnodesexecutetheirbodygraphonceiftheconditionvalueisnon-zerowhenthe
nodeisexecuted. Anoptionalsecondbodygraphcanbeprovidedandthiswillbeexecutedonce
iftheconditionvalueiszerowhenthenodeisexecuted.
▶ Conditional WHILEnodes execute their body graph if the condition value is non-zero when the
nodeisexecutedandwillcontinuetoexecutetheirbodygraphuntiltheconditionvalueiszero.
▶ ConditionalSWITCHnodesexecutethezero-indexednthbodygraphonceiftheconditionvalueis
equalton. Iftheconditionvaluedoesnotcorrespondtoabodygraph,nobodygraphislaunched.
Aconditionvalueisaccessedbyaconditionalhandle,whichmustbecreatedbeforethenode. Thecon-
ditionvaluecanbesetbydevicecodeusingcudaGraphSetConditional(). Adefaultvalue,applied
oneachgraphlaunch,canalsobespecifiedwhenthehandleiscreated.
When the conditional node is created, an empty graph is created and the handle is returned to the
usersothatthegraphcanbepopulated. Thisconditionalbodygraphcanbepopulatedusingeither
thegraphAPIsorcudaStreamBeginCaptureToGraph().
Conditionalnodescanbenested.
176 Chapter4. CUDAFeatures

CUDAProgrammingGuide,Release13.1
4.2.4.1 ConditionalHandles
AconditionvalueisrepresentedbycudaGraphConditionalHandleandiscreatedbycudaGraph-
ConditionalHandleCreate().
The handle must be associated with a single conditional node. Handles cannot be destroyed and as
suchthereisnoneedtokeeptrackofthem.
IfcudaGraphCondAssignDefaultisspecifiedwhenthehandleiscreated,theconditionvaluewillbe
initializedtothespecifieddefaultatthebeginningofeachgraphexecution. Ifthisflagisnotprovided,
theconditionvalueisundefinedatthestartofeachgraphexecutionandcodeshouldnotassumethat
theconditionvaluepersistsacrossexecutions.
Thedefaultvalueandflagsassociatedwithahandlewillbeupdatedduringwholegraphupdate.
4.2.4.2 ConditionalNodeBodyGraphRequirements
Generalrequirements:
▶ Thegraph’snodesmustallresideonasingledevice.
▶ The graph can only contain kernel nodes, empty nodes, memcpy nodes, memset nodes, child
graphnodes,andconditionalnodes.
Kernelnodes:
▶ UseofCUDADynamicParallelismorDeviceGraphLaunchbykernelsinthegraphisnotpermitted.
▶ CooperativelaunchesarepermittedsolongasMPSisnotinuse.
Memcpy/Memsetnodes:
▶ Onlycopies/memsetsinvolvingdevicememoryand/orpinneddevice-mappedhostmemoryare
permitted.
▶ Copies/memsetsinvolvingCUDAarraysarenotpermitted.
▶ Both operands must be accessible from the current device at time of instantiation. Note that
the copy operation will be performed from the device on which the graph resides, even if it is
targetingmemoryonanotherdevice.
4.2.4.3 ConditionalIFNodes
The body graph of an IF node will be executed once if the condition is non-zero when the node is
executed. The following diagram depicts a 3 node graph where the middle node, B, is a conditional
node:
Figure24: ConditionalIFNode
4.2. CUDAGraphs 177

CUDAProgrammingGuide,Release13.1
The following code illustrates the creation of a graph containing an IF conditional node. The default
valueoftheconditionissetusinganupstreamkernel. Thebodyoftheconditionalispopulatedusing
thegraphAPI.
__global__ void setHandle(cudaGraphConditionalHandle handle, int value)
{
...
| ∕∕ Set                          | the condition | value | to the | value   | passed |     | to the | kernel |     |
| ------------------------------- | ------------- | ----- | ------ | ------- | ------ | --- | ------ | ------ | --- |
| cudaGraphSetConditional(handle, |               |       |        | value); |        |     |        |        |     |
...
}
| void graphSetup()       | {          |     |     |     |     |     |     |     |     |
| ----------------------- | ---------- | --- | --- | --- | --- | --- | --- | --- | --- |
| cudaGraph_t             | graph;     |     |     |     |     |     |     |     |     |
| cudaGraphExec_t         | graphExec; |     |     |     |     |     |     |     |     |
| cudaGraphNode_t         | node;      |     |     |     |     |     |     |     |     |
| void *kernelArgs[2];    |            |     |     |     |     |     |     |     |     |
| int value               | = 1;       |     |     |     |     |     |     |     |     |
| ∕∕ Create               | the graph  |     |     |     |     |     |     |     |     |
| cudaGraphCreate(&graph, |            | 0); |     |     |     |     |     |     |     |
∕∕ Create the conditional handle; because no default value is provided,
,→the condition value is undefined at the start of each graph execution
| cudaGraphConditionalHandle                |     |     | handle; |     |     |         |     |     |     |
| ----------------------------------------- | --- | --- | ------- | --- | --- | ------- | --- | --- | --- |
| cudaGraphConditionalHandleCreate(&handle, |     |     |         |     |     | graph); |     |     |     |
∕∕ Use a kernel upstream of the conditional to set the handle value
| cudaGraphNodeParams |     | params  | = {          | cudaGraphNodeTypeKernel |     |     |     | };  |     |
| ------------------- | --- | ------- | ------------ | ----------------------- | --- | --- | --- | --- | --- |
| params.kernel.func  |     | = (void | *)setHandle; |                         |     |     |     |     |     |
params.kernel.gridDim.x = params.kernel.gridDim.y = params.kernel.gridDim.
,→z = 1;
params.kernel.blockDim.x = params.kernel.blockDim.y = params.kernel.
| ,→blockDim.z               | = 1;       |                 |               |       |              |     |     |     |     |
| -------------------------- | ---------- | --------------- | ------------- | ----- | ------------ | --- | --- | --- | --- |
| params.kernel.kernelParams |            |                 | = kernelArgs; |       |              |     |     |     |     |
| kernelArgs[0]              | = &handle; |                 |               |       |              |     |     |     |     |
| kernelArgs[1]              | = &value;  |                 |               |       |              |     |     |     |     |
| cudaGraphAddNode(&node,    |            | graph,          |               | NULL, | 0, &params); |     |     |     |     |
| ∕∕ Create                  | and add    | the conditional |               | node  |              |     |     |     |     |
cudaGraphNodeParams cParams = { cudaGraphNodeTypeConditional };
| cParams.conditional.handle |     |     | = handle;              |     |     |     |     |     |     |
| -------------------------- | --- | --- | ---------------------- | --- | --- | --- | --- | --- | --- |
| cParams.conditional.type   |     |     | = cudaGraphCondTypeIf; |     |     |     |     |     |     |
cParams.conditional.size = 1; ∕∕ There is only an "if" body graph
| cudaGraphAddNode(&node, |                | graph,                                |             | &node, | 1,          | &cParams); |      |     |     |
| ----------------------- | -------------- | ------------------------------------- | ----------- | ------ | ----------- | ---------- | ---- | --- | --- |
| ∕∕ Get                  | the body graph | of the                                | conditional |        | node        |            |      |     |     |
| cudaGraph_t             | bodyGraph      | = cParams.conditional.phGraph_out[0]; |             |        |             |            |      |     |     |
| ∕∕ Populate             | the body       | graph                                 | of the      | IF     | conditional |            | node |     |     |
...
| cudaGraphAddNode(&node, |     | bodyGraph, |     |     | NULL, | 0, &params); |     |     |     |
| ----------------------- | --- | ---------- | --- | --- | ----- | ------------ | --- | --- | --- |
(continuesonnextpage)
| 178 |     |     |     |     |     |     |     | Chapter4. | CUDAFeatures |
| --- | --- | --- | --- | --- | --- | --- | --- | --------- | ------------ |

CUDAProgrammingGuide,Release13.1
(continuedfrompreviouspage)
| ∕∕ Instantiate                   |     | and | launch | the graph |        |       |       |     |
| -------------------------------- | --- | --- | ------ | --------- | ------ | ----- | ----- | --- |
| cudaGraphInstantiate(&graphExec, |     |     |        |           | graph, | NULL, | NULL, | 0); |
| cudaGraphLaunch(graphExec,       |     |     |        | 0);       |        |       |       |     |
cudaDeviceSynchronize();
| ∕∕ Clean | up  |     |     |     |     |     |     |     |
| -------- | --- | --- | --- | --- | --- | --- | --- | --- |
cudaGraphExecDestroy(graphExec);
cudaGraphDestroy(graph);
}
IFnodescanalsohaveanoptionalsecondbodygraphwhichisexecutedoncewhenthenodeisexe-
cutediftheconditionvalueiszero.
| void graphSetup() |     | {          |     |     |     |     |     |     |
| ----------------- | --- | ---------- | --- | --- | --- | --- | --- | --- |
| cudaGraph_t       |     | graph;     |     |     |     |     |     |     |
| cudaGraphExec_t   |     | graphExec; |     |     |     |     |     |     |
| cudaGraphNode_t   |     | node;      |     |     |     |     |     |     |
void *kernelArgs[2];
| int value               | =   | 1;    |     |     |     |     |     |     |
| ----------------------- | --- | ----- | --- | --- | --- | --- | --- | --- |
| ∕∕ Create               | the | graph |     |     |     |     |     |     |
| cudaGraphCreate(&graph, |     |       |     | 0); |     |     |     |     |
∕∕ Create the conditional handle; because no default value is provided,
,→the condition value is undefined at the start of each graph execution
| cudaGraphConditionalHandle                |     |     |     | handle; |     |         |     |     |
| ----------------------------------------- | --- | --- | --- | ------- | --- | ------- | --- | --- |
| cudaGraphConditionalHandleCreate(&handle, |     |     |     |         |     | graph); |     |     |
∕∕ Use a kernel upstream of the conditional to set the handle value
| cudaGraphNodeParams |     |     | params  | = {          | cudaGraphNodeTypeKernel |     |     | };  |
| ------------------- | --- | --- | ------- | ------------ | ----------------------- | --- | --- | --- |
| params.kernel.func  |     |     | = (void | *)setHandle; |                         |     |     |     |
params.kernel.gridDim.x = params.kernel.gridDim.y = params.kernel.gridDim.
,→z = 1;
params.kernel.blockDim.x = params.kernel.blockDim.y = params.kernel.
| ,→blockDim.z               | = 1; |            |        |               |       |              |     |     |
| -------------------------- | ---- | ---------- | ------ | ------------- | ----- | ------------ | --- | --- |
| params.kernel.kernelParams |      |            |        | = kernelArgs; |       |              |     |     |
| kernelArgs[0]              |      | = &handle; |        |               |       |              |     |     |
| kernelArgs[1]              |      | = &value;  |        |               |       |              |     |     |
| cudaGraphAddNode(&node,    |      |            |        | graph,        | NULL, | 0, &params); |     |     |
| ∕∕ Create                  | and  | add        | the IF | conditional   |       | node         |     |     |
cudaGraphNodeParams cParams = { cudaGraphNodeTypeConditional };
| cParams.conditional.handle |     |     |     | = handle;              |     |     |     |     |
| -------------------------- | --- | --- | --- | ---------------------- | --- | --- | --- | --- |
| cParams.conditional.type   |     |     |     | = cudaGraphCondTypeIf; |     |     |     |     |
cParams.conditional.size = 2; ∕∕ There is both an "if" and an "else"
,→body graph
| cudaGraphAddNode(&node, |          |        |     | graph,             | &node, | 1, &cParams); |     |     |
| ----------------------- | -------- | ------ | --- | ------------------ | ------ | ------------- | --- | --- |
| ∕∕ Get                  | the body | graphs |     | of the conditional |        | node          |     |     |
cudaGraph_t ifBodyGraph = cParams.conditional.phGraph_out[0];
cudaGraph_t elseBodyGraph = cParams.conditional.phGraph_out[1];
(continuesonnextpage)
4.2. CUDAGraphs 179

CUDAProgrammingGuide,Release13.1
(continuedfrompreviouspage)
| ∕∕ Populate | the body | graphs | of the | IF conditional | node |     |     |
| ----------- | -------- | ------ | ------ | -------------- | ---- | --- | --- |
...
| cudaGraphAddNode(&node, |     | ifBodyGraph, |     | NULL, | 0, &params); |     |     |
| ----------------------- | --- | ------------ | --- | ----- | ------------ | --- | --- |
...
| cudaGraphAddNode(&node,          |     | elseBodyGraph, |        | NULL, | 0, &params); |     |     |
| -------------------------------- | --- | -------------- | ------ | ----- | ------------ | --- | --- |
| ∕∕ Instantiate                   | and | launch the     | graph  |       |              |     |     |
| cudaGraphInstantiate(&graphExec, |     |                | graph, | NULL, | NULL,        | 0); |     |
| cudaGraphLaunch(graphExec,       |     |                | 0);    |       |              |     |     |
cudaDeviceSynchronize();
| ∕∕ Clean | up  |     |     |     |     |     |     |
| -------- | --- | --- | --- | --- | --- | --- | --- |
cudaGraphExecDestroy(graphExec);
cudaGraphDestroy(graph);
}
4.2.4.4 ConditionalWHILENodes
ThebodygraphofaWHILEnodewillbeexecutedaslongastheconditionisnon-zero. Thecondition
will be evaluated when the node is executed and after completion of the body graph. The following
diagramdepictsa3nodegraphwherethemiddlenode,B,isaconditionalnode:
|     |     | Figure25: | ConditionalWHILENode |     |     |     |     |
| --- | --- | --------- | -------------------- | --- | --- | --- | --- |
ThefollowingcodeillustratesthecreationofagraphcontainingaWHILEconditionalnode. Thehandle
iscreatedusingcudaGraphCondAssignDefaulttoavoidtheneedforanupstreamkernel. Thebodyof
theconditionalispopulatedusingthegraphAPI.
__global__ void loopKernel(cudaGraphConditionalHandle handle, char *dPtr)
{
∕∕ Decrement the value of dPtr and set the condition value to 0 once dPtr
,→is 0
| if (--(*dPtr)                   | == 0) | {   |     |     |     |     |     |
| ------------------------------- | ----- | --- | --- | --- | --- | --- | --- |
| cudaGraphSetConditional(handle, |       |     |     | 0); |     |     |     |
}
}
| void graphSetup() | {          |     |     |     |     |     |     |
| ----------------- | ---------- | --- | --- | --- | --- | --- | --- |
| cudaGraph_t       | graph;     |     |     |     |     |     |     |
| cudaGraphExec_t   | graphExec; |     |     |     |     |     |     |
(continuesonnextpage)
| 180 |     |     |     |     |     | Chapter4. | CUDAFeatures |
| --- | --- | --- | --- | --- | --- | --------- | ------------ |

CUDAProgrammingGuide,Release13.1
(continuedfrompreviouspage)
| cudaGraphNode_t |     | node; |     |     |     |     |     |     |
| --------------- | --- | ----- | --- | --- | --- | --- | --- | --- |
void *kernelArgs[2];
| ∕∕ Allocate | a   | byte | of device | memory | to  | use | as input |     |
| ----------- | --- | ---- | --------- | ------ | --- | --- | -------- | --- |
char *dPtr;
| cudaMalloc((void                          |     | **)&dPtr,   |     | 1);     |      |           |            |      |
| ----------------------------------------- | --- | ----------- | --- | ------- | ---- | --------- | ---------- | ---- |
| ∕∕ Create                                 | the | graph       |     |         |      |           |            |      |
| cudaGraphCreate(&graph,                   |     |             |     | 0);     |      |           |            |      |
| ∕∕ Create                                 | the | conditional |     | handle  | with | a default | value      | of 1 |
| cudaGraphConditionalHandle                |     |             |     | handle; |      |           |            |      |
| cudaGraphConditionalHandleCreate(&handle, |     |             |     |         |      |           | graph, 1, |      |
,→cudaGraphCondAssignDefault);
| ∕∕ Create | and | add the | WHILE | conditional |     | node |     |     |
| --------- | --- | ------- | ----- | ----------- | --- | ---- | --- | --- |
cudaGraphNodeParams cParams = { cudaGraphNodeTypeConditional };
| cParams.conditional.handle |           |       |         | = handle;                           |                         |               |      |     |
| -------------------------- | --------- | ----- | ------- | ----------------------------------- | ----------------------- | ------------- | ---- | --- |
| cParams.conditional.type   |           |       |         | = cudaGraphCondTypeWhile;           |                         |               |      |     |
| cParams.conditional.size   |           |       |         | = 1;                                |                         |               |      |     |
| cudaGraphAddNode(&node,    |           |       |         | graph,                              | NULL,                   | 0, &cParams); |      |     |
| ∕∕ Get                     | the body  | graph | of      | the conditional                     |                         | node          |      |     |
| cudaGraph_t                | bodyGraph |       | =       | cParams.conditional.phGraph_out[0]; |                         |               |      |     |
| ∕∕ Populate                | the       | body  | graph   | of the                              | conditional             |               | node |     |
| cudaGraphNodeParams        |           |       | params  | = {                                 | cudaGraphNodeTypeKernel |               |      | };  |
| params.kernel.func         |           |       | = (void | *)loopKernel;                       |                         |               |      |     |
params.kernel.gridDim.x = params.kernel.gridDim.y = params.kernel.gridDim.
,→z = 1;
params.kernel.blockDim.x = params.kernel.blockDim.y = params.kernel.
| ,→blockDim.z               | = 1; |            |     |               |       |     |              |     |
| -------------------------- | ---- | ---------- | --- | ------------- | ----- | --- | ------------ | --- |
| params.kernel.kernelParams |      |            |     | = kernelArgs; |       |     |              |     |
| kernelArgs[0]              |      | = &handle; |     |               |       |     |              |     |
| kernelArgs[1]              |      | = &dPtr;   |     |               |       |     |              |     |
| cudaGraphAddNode(&node,    |      |            |     | bodyGraph,    | NULL, |     | 0, &params); |     |
∕∕ Initialize device memory, instantiate, and launch the graph
cudaMemset(dPtr, 10, 1); ∕∕ Set dPtr to 10; the loop will run until dPtr
,→is 0
| cudaGraphInstantiate(&graphExec, |     |     |     |     | graph, | NULL, | NULL, | 0); |
| -------------------------------- | --- | --- | --- | --- | ------ | ----- | ----- | --- |
| cudaGraphLaunch(graphExec,       |     |     |     | 0); |        |       |       |     |
cudaDeviceSynchronize();
| ∕∕ Clean | up  |     |     |     |     |     |     |     |
| -------- | --- | --- | --- | --- | --- | --- | --- | --- |
cudaGraphExecDestroy(graphExec);
cudaGraphDestroy(graph);
cudaFree(dPtr);
}
4.2. CUDAGraphs 181

CUDAProgrammingGuide,Release13.1
4.2.4.5 ConditionalSWITCHNodes
Thezero-indexednthbodygraphofaSWITCHnodewillbeexecutedonceiftheconditionisequalto
nwhenthenodeisexecuted. Thefollowingdiagramdepictsa3nodegraphwherethemiddlenode,
B,isaconditionalnode:
|     |     | Figure26: | ConditionalSWITCHNode |     |     |     |
| --- | --- | --------- | --------------------- | --- | --- | --- |
ThefollowingcodeillustratesthecreationofagraphcontainingaSWITCHconditionalnode. Thevalue
of the condition is set using an upstream kernel. The bodies of the conditional are populated using
thegraphAPI.
__global__ void setHandle(cudaGraphConditionalHandle handle, int value)
{
...
| ∕∕ Set the | condition | value | to the value | passed | to the kernel |     |
| ---------- | --------- | ----- | ------------ | ------ | ------------- | --- |
(continuesonnextpage)
| 182 |     |     |     |     | Chapter4. | CUDAFeatures |
| --- | --- | --- | --- | --- | --------- | ------------ |

CUDAProgrammingGuide,Release13.1
(continuedfrompreviouspage)
| cudaGraphSetConditional(handle, |     |     |     |     | value); |     |     |     |
| ------------------------------- | --- | --- | --- | --- | ------- | --- | --- | --- |
...
}
| void graphSetup() |     | {          |     |     |     |     |     |     |
| ----------------- | --- | ---------- | --- | --- | --- | --- | --- | --- |
| cudaGraph_t       |     | graph;     |     |     |     |     |     |     |
| cudaGraphExec_t   |     | graphExec; |     |     |     |     |     |     |
| cudaGraphNode_t   |     | node;      |     |     |     |     |     |     |
void *kernelArgs[2];
| int value               | =   | 1;    |     |     |     |     |     |     |
| ----------------------- | --- | ----- | --- | --- | --- | --- | --- | --- |
| ∕∕ Create               | the | graph |     |     |     |     |     |     |
| cudaGraphCreate(&graph, |     |       |     | 0); |     |     |     |     |
∕∕ Create the conditional handle; because no default value is provided,
,→the condition value is undefined at the start of each graph execution
| cudaGraphConditionalHandle                |     |     |     |     | handle; |         |     |     |
| ----------------------------------------- | --- | --- | --- | --- | ------- | ------- | --- | --- |
| cudaGraphConditionalHandleCreate(&handle, |     |     |     |     |         | graph); |     |     |
∕∕ Use a kernel upstream of the conditional to set the handle value
| cudaGraphNodeParams |     |     | params  |     | = { cudaGraphNodeTypeKernel |     |     | };  |
| ------------------- | --- | --- | ------- | --- | --------------------------- | --- | --- | --- |
| params.kernel.func  |     |     | = (void |     | *)setHandle;                |     |     |     |
params.kernel.gridDim.x = params.kernel.gridDim.y = params.kernel.gridDim.
,→z = 1;
params.kernel.blockDim.x = params.kernel.blockDim.y = params.kernel.
| ,→blockDim.z               | = 1; |            |                 |        |               |              |     |     |
| -------------------------- | ---- | ---------- | --------------- | ------ | ------------- | ------------ | --- | --- |
| params.kernel.kernelParams |      |            |                 |        | = kernelArgs; |              |     |     |
| kernelArgs[0]              |      | = &handle; |                 |        |               |              |     |     |
| kernelArgs[1]              |      | = &value;  |                 |        |               |              |     |     |
| cudaGraphAddNode(&node,    |      |            |                 | graph, | NULL,         | 0, &params); |     |     |
| ∕∕ Create                  | and  | add        | the conditional |        | SWITCH        | node         |     |     |
cudaGraphNodeParams cParams = { cudaGraphNodeTypeConditional };
| cParams.conditional.handle |          |             |        |                                    | = handle;                  |                    |     |      |
| -------------------------- | -------- | ----------- | ------ | ---------------------------------- | -------------------------- | ------------------ | --- | ---- |
| cParams.conditional.type   |          |             |        |                                    | = cudaGraphCondTypeSwitch; |                    |     |      |
| cParams.conditional.size   |          |             |        |                                    | = 5;                       |                    |     |      |
| cudaGraphAddNode(&node,    |          |             |        | graph,                             | &node,                     | 1, &cParams);      |     |      |
| ∕∕ Get                     | the body | graphs      |        | of the                             | conditional                | node               |     |      |
| cudaGraph_t                |          | *bodyGraphs |        | = cParams.conditional.phGraph_out; |                            |                    |     |      |
| ∕∕ Populate                |          | the body    | graphs |                                    | of the                     | SWITCH conditional |     | node |
...
| cudaGraphAddNode(&node, |     |     |     | bodyGraphs[0], |     | NULL, | 0, &params); |     |
| ----------------------- | --- | --- | --- | -------------- | --- | ----- | ------------ | --- |
...
| cudaGraphAddNode(&node,          |     |     |        | bodyGraphs[4], |        | NULL, | 0, &params); |     |
| -------------------------------- | --- | --- | ------ | -------------- | ------ | ----- | ------------ | --- |
| ∕∕ Instantiate                   |     | and | launch | the            | graph  |       |              |     |
| cudaGraphInstantiate(&graphExec, |     |     |        |                | graph, | NULL, | NULL,        | 0); |
| cudaGraphLaunch(graphExec,       |     |     |        |                | 0);    |       |              |     |
cudaDeviceSynchronize();
(continuesonnextpage)
4.2. CUDAGraphs 183

CUDAProgrammingGuide,Release13.1
(continuedfrompreviouspage)
∕∕ Clean up
cudaGraphExecDestroy(graphExec);
cudaGraphDestroy(graph);
}
4.2.5. Graph Memory Nodes
4.2.5.1 Introduction
Graphmemorynodesallowgraphstocreateandownmemoryallocations. Graphmemorynodeshave
GPUorderedlifetimesemantics,whichdictatewhenmemoryisallowedtobeaccessedonthedevice.
TheseGPUorderedlifetimesemanticsenabledriver-managedmemoryreuse,andmatchthoseofthe
streamorderedallocationAPIscudaMallocAsyncandcudaFreeAsync,whichmaybecapturedwhen
creatingagraph.
Graphallocationshavefixedaddressesoverthelifeofagraphincludingrepeatedinstantiationsand
launches. Thisallowsthememorytobedirectlyreferencedbyotheroperationswithinthegraphwith-
out the need of a graph update, even when CUDA changes the backing physical memory. Within a
graph,allocationswhosegraphorderedlifetimesdonotoverlapmayusethesameunderlyingphysical
memory.
CUDAmayreusethesamephysicalmemoryforallocationsacrossmultiplegraphs,aliasingvirtualad-
dressmappingsaccordingtotheGPUorderedlifetimesemantics. Forexamplewhendifferentgraphs
arelaunchedintothesamestream,CUDAmayvirtuallyaliasthesamephysicalmemorytosatisfythe
needsofallocationswhichhavesingle-graphlifetimes.
4.2.5.2 APIFundamentals
Graph memory nodes are graph nodes representing either memory allocation or free actions. As a
shorthand,nodesthatallocatememoryarecalledallocationnodes. Likewise,nodesthatfreememory
arecalledfreenodes. Allocationscreatedbyallocationnodesarecalledgraphallocations. CUDAas-
signsvirtualaddressesforthegraphallocationatnodecreationtime. Whilethesevirtualaddresses
are fixed for the lifetime of the allocation node, the allocation contents are not persistent past the
freeingoperationandmaybeoverwrittenbyaccessesreferringtoadifferentallocation.
Graphallocationsareconsideredrecreatedeverytimeagraphruns. Agraphallocation’slifetime,which
differs from the node’s lifetime, begins when GPU execution reaches the allocating graph node and
endswhenoneofthefollowingoccurs:
▶ GPUexecutionreachesthefreeinggraphnode
▶ GPUexecutionreachesthefreeingcudaFreeAsync()streamcall
▶ immediatelyuponthefreeingcalltocudaFree()
Note
Graph destruction does not automatically free any live graph-allocated memory, even though it
ends the lifetime of the allocation node. The allocation must subsequently be freed in another
graph,orusingcudaFreeAsync()∕cudaFree().
Justlikeothergraph-structure,graphmemorynodesareorderedwithinagraphbydependencyedges.
Aprogrammustguaranteethatoperationsaccessinggraphmemory:
184 Chapter4. CUDAFeatures

CUDAProgrammingGuide,Release13.1
▶
areorderedaftertheallocationnode
▶ areorderedbeforetheoperationfreeingthememory
GraphallocationlifetimesbeginandusuallyendaccordingtoGPUexecution(asopposedtoAPIinvo-
cation). GPUorderingistheorderthatworkrunsontheGPUasopposedtotheorderthatthework
isenqueuedordescribed. Thus,graphallocationsareconsidered‘GPUordered.’
| 4.2.5.2.1 | GraphNodeAPIs |     |     |     |     |     |     |     |
| --------- | ------------- | --- | --- | --- | --- | --- | --- | --- |
GraphmemorynodesmaybeexplicitlycreatedwiththenodecreationAPI,cudaGraphAddNode. The
address allocated when adding a cudaGraphNodeTypeMemAlloc node is returned to the user in the
alloc::dptr field of the passed cudaGraphNodeParams structure. All operations using graph al-
locations inside the allocating graph must be ordered after the allocating node. Similarly, any free
nodesmustbeorderedafterallusesoftheallocationwithinthegraph. Freenodesarecreatedusing
cudaGraphAddNodeandanodetypeofcudaGraphNodeTypeMemFree.
In the following figure, there is an example graph with an alloc and a free node. Kernel nodes a, b,
andcareorderedaftertheallocationnodeandbeforethefreenodesuchthatthekernelscanaccess
the allocation. Kernel node e is not ordered after the alloc node and therefore cannot safely access
thememory. Kernelnodedisnotorderedbeforethefreenode,thereforeitcannotsafelyaccessthe
memory.
Thefollowingcodesnippetestablishesthegraphinthisfigure:
| ∕∕ Create               | the graph | - it    | starts     | out                         | empty |     |     |     |
| ----------------------- | --------- | ------- | ---------- | --------------------------- | ----- | --- | --- | --- |
| cudaGraphCreate(&graph, |           |         | 0);        |                             |       |     |     |     |
| ∕∕ parameters           | for       | a basic | allocation |                             |       |     |     |     |
| cudaGraphNodeParams     |           | params  | =          | { cudaGraphNodeTypeMemAlloc |       |     |     | };  |
params.alloc.poolProps.allocType = cudaMemAllocationTypePinned;
params.alloc.poolProps.location.type = cudaMemLocationTypeDevice;
| ∕∕ specify                         | device   | 0 as | the resident |          | device |            |              |     |
| ---------------------------------- | -------- | ---- | ------------ | -------- | ------ | ---------- | ------------ | --- |
| params.alloc.poolProps.location.id |          |      |              |          | = 0;   |            |              |     |
| params.alloc.bytesize              |          | =    | size;        |          |        |            |              |     |
| cudaGraphAddNode(&allocNode,       |          |      |              | graph,   | NULL,  | NULL,      | 0, &params); |     |
| ∕∕ create                          | a kernel | node | that         | uses the | graph  | allocation |              |     |
cudaGraphNodeParams nodeParams = { cudaGraphNodeTypeKernel };
| nodeParams.kernel.kernelParams[0] |            |        |                    |       | = params.alloc.dptr; |     |     |     |
| --------------------------------- | ---------- | ------ | ------------------ | ----- | -------------------- | --- | --- | --- |
| ∕∕ ...set                         | other      | kernel | node parameters... |       |                      |     |     |     |
| ∕∕ add                            | the kernel | node   | to the             | graph |                      |     |     |     |
cudaGraphAddNode(&a, graph, &allocNode, 1, NULL, &nodeParams);
| cudaGraphAddNode(&b, |     | graph,           |     | &a, 1, | NULL, | &nodeParams); |     |     |
| -------------------- | --- | ---------------- | --- | ------ | ----- | ------------- | --- | --- |
| cudaGraphAddNode(&c, |     | graph,           |     | &a, 1, | NULL, | &nodeParams); |     |     |
| cudaGraphNode_t      |     | dependencies[2]; |     |        |       |               |     |     |
∕∕ kernel nodes b and c are using the graph allocation, so the freeing node
,→must depend on them. Since the dependency of node b on node a establishes
,→an indirect dependency, the free node does not need to explicitly depend on
| ,→node          | a.  |      |     |     |     |     |     |     |
| --------------- | --- | ---- | --- | --- | --- | --- | --- | --- |
| dependencies[0] |     | = b; |     |     |     |     |     |     |
| dependencies[1] |     | = c; |     |     |     |     |     |     |
cudaGraphNodeParams freeNodeParams = { cudaGraphNodeTypeMemFree };
(continuesonnextpage)
4.2. CUDAGraphs 185

CUDAProgrammingGuide,Release13.1
Figure27: KernelNodes
| 186 | Chapter4. | CUDAFeatures |
| --- | --------- | ------------ |

CUDAProgrammingGuide,Release13.1
(continuedfrompreviouspage)
| freeNodeParams.free.dptr |     |     |     | = params.alloc.dptr; |     |     |     |
| ------------------------ | --- | --- | --- | -------------------- | --- | --- | --- |
cudaGraphAddNode(&freeNode, graph, dependencies, NULL, 2, freeNodeParams);
∕∕ free node does not depend on kernel node d, so it must not access the freed
| ,→graph              | allocation. |     |        |     |       |     |               |
| -------------------- | ----------- | --- | ------ | --- | ----- | --- | ------------- |
| cudaGraphAddNode(&d, |             |     | graph, | &c, | NULL, | 1,  | &nodeParams); |
∕∕ node e does not depend on the allocation node, so it must not access the
,→allocation. This would be true even if the freeNode depended on kernel node
,→e.
| cudaGraphAddNode(&e, |               |     | graph, | NULL, |     | NULL, | 0, &nodeParams); |
| -------------------- | ------------- | --- | ------ | ----- | --- | ----- | ---------------- |
| 4.2.5.2.2            | StreamCapture |     |        |       |     |       |                  |
Graph memory nodes can be created by capturing the corresponding stream ordered allocation and
freecallscudaMallocAsyncandcudaFreeAsync. Inthiscase,thevirtualaddressesreturnedbythe
captured allocation API can be used by other operations inside the graph. Since the stream ordered
dependencies will be captured into the graph, the ordering requirements of the stream ordered al-
location APIs guarantee that the graph memory nodes will be properly ordered with respect to the
capturedstreamoperations(forcorrectlywrittenstreamcode).
Ignoringkernelnodesdande,forclarity,thefollowingcodesnippetshowshowtousestreamcapture
tocreatethegraphfromthepreviousfigure:
| cudaMallocAsync(&dptr,       |      |              | size,     |           | stream1); |       |     |
| ---------------------------- | ---- | ------------ | --------- | --------- | --------- | ----- | --- |
| kernel_A<<<                  |      | ..., stream1 |           | >>>(dptr, |           | ...); |     |
| ∕∕ Fork                      | into | stream2      |           |           |           |       |     |
| cudaEventRecord(event1,      |      |              | stream1); |           |           |       |     |
| cudaStreamWaitEvent(stream2, |      |              |           |           | event1);  |       |     |
| kernel_B<<<                  |      | ..., stream1 |           | >>>(dptr, |           | ...); |     |
∕∕ event dependencies translated into graph dependencies, so the kernel node
,→created by the capture of kernel C will depend on the allocation node
| ,→created                     | by      | capturing    | the       | cudaMallocAsync |          |           | call.   |
| ----------------------------- | ------- | ------------ | --------- | --------------- | -------- | --------- | ------- |
| kernel_C<<<                   |         | ..., stream2 |           | >>>(dptr,       |          | ...);     |         |
| ∕∕ Join                       | stream2 | back         | to origin |                 | stream   | (stream1) |         |
| cudaEventRecord(event2,       |         |              | stream2); |                 |          |           |         |
| cudaStreamWaitEvent(stream1,  |         |              |           |                 | event2); |           |         |
| ∕∕ Free                       | depends | on           | all work  | accessing       |          | the       | memory. |
| cudaFreeAsync(dptr,           |         |              | stream1); |                 |          |           |         |
| ∕∕ End                        | capture | in the       | origin    | stream          |          |           |         |
| cudaStreamEndCapture(stream1, |         |              |           |                 | &graph); |           |         |
4.2.5.2.3 AccessingandFreeingGraphMemoryOutsideoftheAllocatingGraph
Graph allocations do not have to be freed by the allocating graph. When a graph does not free an
allocation, that allocation persists beyond the execution of the graph and can be accessed by sub-
sequent CUDA operations. These allocations may be accessed in another graph or directly using a
stream operation as long as the accessing operation is ordered after the allocation through CUDA
4.2. CUDAGraphs 187

CUDAProgrammingGuide,Release13.1
events and other stream ordering mechanisms. An allocation may subsequently be freed by regu-
larcallstocudaFree,cudaFreeAsync,orbythelaunchofanothergraphwithacorrespondingfree
node, or a subsequent launch of the allocating graph (if it was instantiated with the graph-memory-
nodes-cudagraphinstantiateflagautofreeonlaunchflag). Itisillegaltoaccessmemoryafterithasbeen
freed - the free operation must be ordered after all operations accessing the memory using graph
dependencies,CUDAevents,andotherstreamorderingmechanisms.
Note
Since graph allocations may share underlying physical memory, free operations must be ordered
after all device operations complete. Out-of-band synchronization (such as memory-based syn-
chronizationwithinacomputekernel)isinsufficientfororderingbetweenmemorywritesandfree
operations. Formoreinformation,seetheVirtualAliasingSupportrulesrelatingtoconsistencyand
coherency.
Thethreefollowingcodesnippetsdemonstrateaccessinggraphallocationsoutsideoftheallocating
graphwithorderingproperlyestablishedby: usingasinglestream,usingeventsbetweenstreams,and
usingeventsbakedintotheallocatingandfreeinggraph.
First,orderingestablishedbyusingasinglestream:
| ∕∕ Contents | of allocating | graph |     |     |     |     |
| ----------- | ------------- | ----- | --- | --- | --- | --- |
void *dptr;
| cudaGraphNodeParams |     | params | = { cudaGraphNodeTypeMemAlloc |     | };  |     |
| ------------------- | --- | ------ | ----------------------------- | --- | --- | --- |
params.alloc.poolProps.allocType = cudaMemAllocationTypePinned;
params.alloc.poolProps.location.type = cudaMemLocationTypeDevice;
| params.alloc.bytesize |     | = size; |     |     |     |     |
| --------------------- | --- | ------- | --- | --- | --- | --- |
cudaGraphAddNode(&allocNode, allocGraph, NULL, NULL, 0, &params);
| dptr = | params.alloc.dptr; |     |     |     |     |     |
| ------ | ------------------ | --- | --- | --- | --- | --- |
cudaGraphInstantiate(&allocGraphExec, allocGraph, NULL, NULL, 0);
| cudaGraphLaunch(allocGraphExec, |             |           | stream); |     |     |     |
| ------------------------------- | ----------- | --------- | -------- | --- | --- | --- |
| kernel<<<                       | ..., stream | >>>(dptr, | ...);    |     |     |     |
| cudaFreeAsync(dptr,             |             | stream);  |          |     |     |     |
Second,orderingestablishedbyrecordingandwaitingonCUDAevents:
| ∕∕ Contents | of allocating | graph |     |     |     |     |
| ----------- | ------------- | ----- | --- | --- | --- | --- |
void *dptr;
| ∕∕ Contents | of allocating | graph |     |     |     |     |
| ----------- | ------------- | ----- | --- | --- | --- | --- |
cudaGraphAddNode(&allocNode, allocGraph, NULL, NULL, 0, &allocNodeParams);
| dptr =      | allocNodeParams.alloc.dptr; |     |       |     |     |     |
| ----------- | --------------------------- | --- | ----- | --- | --- | --- |
| ∕∕ contents | of consuming∕freeing        |     | graph |     |     |     |
kernelNodeParams.kernel.kernelParams[0] = allocNodeParams.alloc.dptr;
| cudaGraphAddNode(&freeNode, |     |     | freeGraph, | NULL, NULL, | 1, dptr); |     |
| --------------------------- | --- | --- | ---------- | ----------- | --------- | --- |
cudaGraphInstantiate(&allocGraphExec, allocGraph, NULL, NULL, 0);
cudaGraphInstantiate(&freeGraphExec, freeGraph, NULL, NULL, 0);
| cudaGraphLaunch(allocGraphExec, |     |     | allocStream); |     |     |     |
| ------------------------------- | --- | --- | ------------- | --- | --- | --- |
(continuesonnextpage)
| 188 |     |     |     |     | Chapter4. | CUDAFeatures |
| --- | --- | --- | --- | --- | --------- | ------------ |

CUDAProgrammingGuide,Release13.1
(continuedfrompreviouspage)
∕∕ establish the dependency of stream2 on the allocation node
∕∕ note: the dependency could also have been established with a stream
| ,→synchronize                | operation    |            |               |     |     |
| ---------------------------- | ------------ | ---------- | ------------- | --- | --- |
| cudaEventRecord(allocEvent,  |              |            | allocStream); |     |     |
| cudaStreamWaitEvent(stream2, |              |            | allocEvent);  |     |     |
| kernel<<<                    | ..., stream2 | >>> (dptr, | ...);         |     |     |
∕∕ establish the dependency between the stream 3 and the allocation use
| cudaStreamRecordEvent(streamUseDoneEvent, |     |     |                      | stream2); |     |
| ----------------------------------------- | --- | --- | -------------------- | --------- | --- |
| cudaStreamWaitEvent(stream3,              |     |     | streamUseDoneEvent); |           |     |
∕∕ it is now safe to launch the freeing graph, which may also access the memory
| cudaGraphLaunch(freeGraphExec, |     |     | stream3); |     |     |
| ------------------------------ | --- | --- | --------- | --- | --- |
Third,orderingestablishedbyusinggraphexternaleventnodes:
| ∕∕ Contents | of allocating | graph |     |     |     |
| ----------- | ------------- | ----- | --- | --- | --- |
void *dptr;
cudaEvent_t allocEvent; ∕∕ event indicating when the allocation will be ready
,→for use.
cudaEvent_t streamUseDoneEvent; ∕∕ event indicating when the stream operations
| ,→are       | done with the | allocation. |            |        |      |
| ----------- | ------------- | ----------- | ---------- | ------ | ---- |
| ∕∕ Contents | of allocating | graph       | with event | record | node |
cudaGraphAddNode(&allocNode, allocGraph, NULL, NULL, 0, &allocNodeParams);
| dptr =   | allocNodeParams.alloc.dptr; |             |         |        |            |
| -------- | --------------------------- | ----------- | ------- | ------ | ---------- |
| ∕∕ note: | this event                  | record node | depends | on the | alloc node |
cudaGraphNodeParams allocEventNodeParams = { cudaGraphNodeTypeEventRecord };
| allocEventParams.eventRecord.event |     |     | = allocEvent; |     |     |
| ---------------------------------- | --- | --- | ------------- | --- | --- |
cudaGraphAddNode(&recordNode, allocGraph, &allocNode, NULL, 1,
,→allocEventNodeParams);
cudaGraphInstantiate(&allocGraphExec, allocGraph, NULL, NULL, 0);
| ∕∕ contents | of consuming∕freeing |     | graph with | event | wait nodes |
| ----------- | -------------------- | --- | ---------- | ----- | ---------- |
cudaGraphNodeParams streamWaitEventNodeParams = { cudaGraphNodeTypeEventWait }
,→;
streamWaitEventNodeParams.eventWait.event = streamUseDoneEvent;
cudaGraphAddNode(&streamUseDoneEventNode, waitAndFreeGraph, NULL, NULL, 0,
,→streamWaitEventNodeParams);
cudaGraphNodeParams allocWaitEventNodeParams = { cudaGraphNodeTypeEventWait };
| allocWaitEventNodeParams.eventWait.event |     |     |     | = allocEvent; |     |
| ---------------------------------------- | --- | --- | --- | ------------- | --- |
cudaGraphAddNode(&allocReadyEventNode, waitAndFreeGraph, NULL, NULL, 0,
,→allocWaitEventNodeParams);
kernelNodeParams->kernelParams[0] = allocNodeParams.alloc.dptr;
∕∕ The allocReadyEventNode provides ordering with the alloc node for use in a
| ,→consuming | graph. |     |     |     |     |
| ----------- | ------ | --- | --- | --- | --- |
(continuesonnextpage)
4.2. CUDAGraphs 189

CUDAProgrammingGuide,Release13.1
(continuedfrompreviouspage)
cudaGraphAddNode(&kernelNode, waitAndFreeGraph, &allocReadyEventNode, NULL, 1,
&kernelNodeParams);
,→
∕∕ The free node has to be ordered after both external and internal users.
| ∕∕ Thus | the node must | depend | on both the | kernelNode | and the |     |
| ------- | ------------- | ------ | ----------- | ---------- | -------- | --- |
,→streamUseDoneEventNode.
| dependencies[0] | = kernelNode;             |     |     |     |     |     |
| --------------- | ------------------------- | --- | --- | --- | --- | --- |
| dependencies[1] | = streamUseDoneEventNode; |     |     |     |     |     |
cudaGraphNodeParams freeNodeParams = { cudaGraphNodeTypeMemFree };
| freeNodeParams.free.dptr |     | = dptr; |     |     |     |     |
| ------------------------ | --- | ------- | --- | --- | --- | --- |
cudaGraphAddNode(&freeNode, waitAndFreeGraph, &dependencies, NULL, 2,
,→freeNodeParams);
cudaGraphInstantiate(&waitAndFreeGraphExec, waitAndFreeGraph, NULL, NULL, 0);
| cudaGraphLaunch(allocGraphExec, |     |     | allocStream); |     |     |     |
| ------------------------------- | --- | --- | ------------- | --- | --- | --- |
∕∕ establish the dependency of stream2 on the event node satisfies the ordering
,→requirement
| cudaStreamWaitEvent(stream2,              |              |            | allocEvent); |           |     |     |
| ----------------------------------------- | ------------ | ---------- | ------------ | --------- | --- | --- |
| kernel<<<                                 | ..., stream2 | >>> (dptr, | ...);        |           |     |     |
| cudaStreamRecordEvent(streamUseDoneEvent, |              |            |              | stream2); |     |     |
∕∕ the event wait node in the waitAndFreeGraphExec establishes the dependency
,→on the "readyForFreeEvent" that is needed to prevent the kernel running in
,→stream two from accessing the allocation after the free node in execution
,→order.
| cudaGraphLaunch(waitAndFreeGraphExec, |                                          |     |     | stream3); |     |     |
| ------------------------------------- | ---------------------------------------- | --- | --- | --------- | --- | --- |
| 4.2.5.2.4                             | cudaGraphInstantiateFlagAutoFreeOnLaunch |     |     |           |     |     |
Under normal circumstances, CUDA will prevent a graph from being relaunched if it has unfreed
memory allocations because multiple allocations at the same address will leak memory. Instantiat-
ing a graph with the cudaGraphInstantiateFlagAutoFreeOnLaunch flag allows the graph to be
relaunchedwhileitstillhasunfreedallocations. Inthiscase,thelaunchautomaticallyinsertsanasyn-
chronousfreeoftheunfreedallocations.
Autofree on launch is usefulfor single-producer multiple-consumeralgorithms. At each iteration, a
producergraphcreatesseveralallocations,and,dependingonruntimeconditions,avaryingsetofcon-
sumersaccessesthoseallocations. Thistypeofvariableexecutionsequencemeansthatconsumers
cannotfreetheallocationsbecauseasubsequentconsumermayrequireaccess. Autofreeonlaunch
meansthatthelaunchloopdoesnotneedtotracktheproducer’sallocations-instead,thatinforma-
tionremainsisolatedtotheproducer’screationanddestructionlogic. Ingeneral,autofreeonlaunch
simplifiesanalgorithmwhichwouldotherwiseneedtofreealltheallocationsownedbyagraphbefore
eachrelaunch.
Note
The cudaGraphInstantiateFlagAutoFreeOnLaunch flag does not change the behavior of
graphdestruction. Theapplicationmustexplicitlyfreetheunfreedmemoryinordertoavoidmem-
ory leaks, even for graphs instantiated with the flag. The following code shows the use of cud-
aGraphInstantiateFlagAutoFreeOnLaunchtosimplifyasingle-producer/multiple-consumer
| 190 |     |     |     |     | Chapter4. | CUDAFeatures |
| --- | --- | --- | --- | --- | --------- | ------------ |

CUDAProgrammingGuide,Release13.1
algorithm:
∕∕ Create producer graph which allocates memory and populates it with data
cudaStreamBeginCapture(cudaStreamPerThread, cudaStreamCaptureModeGlobal);
cudaMallocAsync(&data1, blocks * threads, cudaStreamPerThread);
cudaMallocAsync(&data2, blocks * threads, cudaStreamPerThread);
produce<<<blocks, threads, 0, cudaStreamPerThread>>>(data1, data2);
...
cudaStreamEndCapture(cudaStreamPerThread, &graph);
cudaGraphInstantiateWithFlags(&producer,
graph,
cudaGraphInstantiateFlagAutoFreeOnLaunch);
cudaGraphDestroy(graph);
∕∕ Create first consumer graph by capturing an asynchronous library call
cudaStreamBeginCapture(cudaStreamPerThread, cudaStreamCaptureModeGlobal);
| consumerFromLibrary(data1, |     | cudaStreamPerThread); |
| -------------------------- | --- | --------------------- |
cudaStreamEndCapture(cudaStreamPerThread, &graph);
cudaGraphInstantiateWithFlags(&consumer1, graph, 0); ∕∕regular instantiation
cudaGraphDestroy(graph);
| ∕∕ Create | second consumer | graph |
| --------- | --------------- | ----- |
cudaStreamBeginCapture(cudaStreamPerThread, cudaStreamCaptureModeGlobal);
consume2<<<blocks, threads, 0, cudaStreamPerThread>>>(data2);
...
cudaStreamEndCapture(cudaStreamPerThread, &graph);
cudaGraphInstantiateWithFlags(&consumer2, graph, 0);
cudaGraphDestroy(graph);
| ∕∕ Launch            | in a loop |        |
| -------------------- | --------- | ------ |
| bool launchConsumer2 | =         | false; |
do {
| cudaGraphLaunch(producer,  |                   | myStream); |
| -------------------------- | ----------------- | ---------- |
| cudaGraphLaunch(consumer1, |                   | myStream); |
| if                         | (launchConsumer2) | {          |
cudaGraphLaunch(consumer2, myStream);
}
| } while              | (determineAction(&launchConsumer2)); |     |
| -------------------- | ------------------------------------ | --- |
| cudaFreeAsync(data1, | myStream);                           |     |
| cudaFreeAsync(data2, | myStream);                           |     |
cudaGraphExecDestroy(producer);
cudaGraphExecDestroy(consumer1);
cudaGraphExecDestroy(consumer2);
| 4.2.5.2.5 | MemoryNodesinChildGraphs |     |
| --------- | ------------------------ | --- |
CUDA12.9introducestheabilitytomovechildgraphownershiptoaparentgraph. Childgraphswhich
aremovedtotheparentareallowedtocontainmemoryallocationandfreenodes. Thisallowsachild
graph containing allocation or free nodes to be independently constructed prior to its addition in a
parentgraph.
4.2. CUDAGraphs 191

CUDAProgrammingGuide,Release13.1
Thefollowingrestrictionsapplytochildgraphsaftertheyhavebeenmoved:
▶ Cannotbeindependentlyinstantiatedordestroyed.
▶ Cannotbeaddedasachildgraphofaseparateparentgraph.
▶
CannotbeusedasanargumenttocuGraphExecUpdate.
▶ Cannothaveadditionalmemoryallocationorfreenodesadded.
| ∕∕ Create               | the child | graph              |     |     |     |
| ----------------------- | --------- | ------------------ | --- | --- | --- |
| cudaGraphCreate(&child, |           | 0);                |     |     |     |
| ∕∕ parameters           | for       | a basic allocation |     |     |     |
cudaGraphNodeParams allocNodeParams = { cudaGraphNodeTypeMemAlloc };
allocNodeParams.alloc.poolProps.allocType = cudaMemAllocationTypePinned;
allocNodeParams.alloc.poolProps.location.type = cudaMemLocationTypeDevice;
| ∕∕ specify                                  | device | 0 as the resident | device |     |     |
| ------------------------------------------- | ------ | ----------------- | ------ | --- | --- |
| allocNodeParams.alloc.poolProps.location.id |        |                   | =      | 0;  |     |
| allocNodeParams.alloc.bytesize              |        | =                 | size;  |     |     |
cudaGraphAddNode(&allocNode, graph, NULL, NULL, 0, &allocNodeParams);
| ∕∕ Additional | nodes | using the allocation | could | be added here |     |
| ------------- | ----- | -------------------- | ----- | ------------- | --- |
cudaGraphNodeParams freeNodeParams = { cudaGraphNodeTypeMemFree };
| freeNodeParams.free.dptr |     | = allocNodeParams.alloc.dptr; |     |     |     |
| ------------------------ | --- | ----------------------------- | --- | --- | --- |
cudaGraphAddNode(&freeNode, graph, &allocNode, NULL, 1, freeNodeParams);
| ∕∕ Create                | the parent | graph               |       |     |     |
| ------------------------ | ---------- | ------------------- | ----- | --- | --- |
| cudaGraphCreate(&parent, |            | 0);                 |       |     |     |
| ∕∕ Move                  | the child  | graph to the parent | graph |     |     |
cudaGraphNodeParams childNodeParams = { cudaGraphNodeTypeGraph };
| childNodeParams.graph.graph |     | = child; |     |     |     |
| --------------------------- | --- | -------- | --- | --- | --- |
childNodeParams.graph.ownership = cudaGraphChildGraphOwnershipMove;
cudaGraphAddNode(&parentNode, parent, NULL, NULL, 0, &childNodeParams);
4.2.5.3 OptimizedMemoryReuse
CUDAreusesmemoryintwoways:
▶ Virtualandphysicalmemoryreusewithinagraphisbasedonvirtualaddressassignment,likein
thestreamorderedallocator.
▶ Physical memory reuse between graphs is done with virtual aliasing: different graphs can map
thesamephysicalmemorytotheiruniquevirtualaddresses.
| 4.2.5.3.1 | AddressReusewithinaGraph |     |     |     |     |
| --------- | ------------------------ | --- | --- | --- | --- |
CUDA may reuse memory within a graph by assigning the same virtual address ranges to different
allocationswhoselifetimesdonotoverlap. Sincevirtualaddressesmaybereused,pointerstodifferent
allocationswithdisjointlifetimesarenotguaranteedtobeunique.
The following figure shows adding a new allocation node (2) that can reuse the address freed by a
dependentnode(1).
| 192 |     |     |     | Chapter4. | CUDAFeatures |
| --- | --- | --- | --- | --------- | ------------ |

CUDAProgrammingGuide,Release13.1
Figure28: AddingNewAllocNode2
Thefollowingfigureshowsaddinganewallocnode(4). Thenewallocnodeisnotdependentonthefreenode(2)
socannotreusetheaddressfromtheassociatedallocnode(2). Iftheallocnode(2)usedtheaddressfreedby
freenode(1),thenewallocnode3wouldneedanewaddress.
4.2. CUDAGraphs 193

CUDAProgrammingGuide,Release13.1
Figure29: AddingNewAllocNode3
| 194 | Chapter4. | CUDAFeatures |
| --- | --------- | ------------ |

CUDAProgrammingGuide,Release13.1
4.2.5.3.2 PhysicalMemoryManagementandSharing
CUDA is responsible for mapping physical memory to the virtual address before the allocating node
is reached in GPU order. As an optimization for memory footprint and mapping overhead, multiple
graphsmayusethesamephysicalmemoryfordistinctallocationsiftheywillnotrunsimultaneously;
however,physicalpagescannotbereusediftheyareboundtomorethanoneexecutinggraphatthe
sametime,ortoagraphallocationwhichremainsunfreed.
CUDAmayupdatephysicalmemorymappingsatanytimeduringgraphinstantiation,launch,orexe-
cution. CUDAmayalsointroducesynchronizationbetweenfuturegraphlaunchesinordertoprevent
live graph allocations from referring to the same physical memory. As for any allocate-free-allocate
pattern,ifaprogramaccessesapointeroutsideofanallocation’slifetime,theerroneousaccessmay
silentlyreadorwritelivedataownedbyanotherallocation(evenifthevirtualaddressoftheallocation
isunique). Useofcomputesanitizertoolscancatchthiserror.
The following figure shows graphs sequentially launched in the same stream. In this example, each
graphfreesallthememoryitallocates. Sincethegraphsinthesamestreamneverrunconcurrently,
CUDAcanandshouldusethesamephysicalmemorytosatisfyalltheallocations.
4.2.5.4 PerformanceConsiderations
Whenmultiplegraphsarelaunchedintothesamestream,CUDAattemptstoallocatethesamephys-
icalmemorytothembecausetheexecutionofthesegraphscannotoverlap. Physicalmappingsfora
graphareretainedbetweenlaunchesasanoptimizationtoavoidthecostofremapping. If,atalater
time,oneofthegraphsislaunchedsuchthatitsexecutionmayoverlapwiththeothers(forexampleif
itislaunchedintoadifferentstream)thenCUDAmustperformsomeremappingbecauseconcurrent
graphsrequiredistinctmemorytoavoiddatacorruption.
Ingeneral,remappingofgraphmemoryinCUDAislikelycausedbytheseoperations:
▶ Changingthestreamintowhichagraphislaunched
▶ Atrimoperationonthegraphmemorypool,whichexplicitlyfreesunusedmemory(discussedin
graph-memory-nodes-physical-memory-footprint)
▶ Relaunchingagraphwhileanunfreedallocationfromanothergraphismappedtothesamemem-
orywillcausearemapofmemorybeforerelaunch
Remappingmusthappeninexecutionorder,butafteranypreviousexecutionofthatgraphiscomplete
(otherwisememorythatisstillinusecouldbeunmapped). Duetothisorderingdependency,aswellas
becausemappingoperationsareOScalls,mappingoperationscanberelativelyexpensive. Applications
canavoidthiscostbylaunchinggraphscontainingallocationmemorynodesconsistentlyintothesame
stream.
4.2.5.4.1 FirstLaunch/cudaGraphUpload
Physical memory cannot be allocated or mapped during graph instantiation because the stream in
whichthegraphwillexecuteisunknown. Mappingisdoneinsteadduringgraphlaunch. Callingcud-
aGraphUpload can separate out the cost of allocation from the launch by performing all mappings
for that graph immediately and associating the graph with the upload stream. If the graph is then
launchedintothesamestream,itwilllaunchwithoutanyadditionalremapping.
Using different streams for graph upload and graph launch behaves similarly to switching streams,
likelyresultinginremapoperations. Inaddition,unrelatedmemorypoolmanagementispermittedto
pullmemoryfromanidlestream,whichcouldnegatetheimpactoftheuploads.
4.2. CUDAGraphs 195

CUDAProgrammingGuide,Release13.1
Figure30: SequentiallyLaunchedGraphs
| 196 | Chapter4. | CUDAFeatures |
| --- | --------- | ------------ |

CUDAProgrammingGuide,Release13.1
4.2.5.5 PhysicalMemoryFootprint
Thepool-managementbehaviorofasynchronousallocationmeansthatdestroyingagraphwhichcon-
tainsmemorynodes(eveniftheirallocationsarefree)willnotimmediatelyreturnphysicalmemoryto
theOSforusebyotherprocesses. ToexplicitlyreleasememorybacktotheOS,anapplicationshould
usethecudaDeviceGraphMemTrimAPI.
cudaDeviceGraphMemTrimwillunmapandreleaseanyphysicalmemoryreservedbygraphmemory
nodes that is not actively in use. Allocations that have not been freed and graphs that are sched-
uled or running are considered to be actively using the physical memory and will not be impacted.
Use of the trim API will make physical memory available to other allocation APIs and other applica-
tions or processes, but will cause CUDA to reallocate and remap memory when the trimmed graphs
arenextlaunched. NotethatcudaDeviceGraphMemTrimoperatesonadifferentpoolfromcudaMem-
PoolTrimTo(). Thegraphmemorypoolisnotexposedtothesteamorderedmemoryallocator. CUDA
allowsapplicationstoquerytheirgraphmemoryfootprintthroughthecudaDeviceGetGraphMemAt-
tributeAPI.QueryingtheattributecudaGraphMemAttrReservedMemCurrentreturnstheamount
ofphysicalmemoryreservedbythedriverforgraphallocationsinthecurrentprocess. Queryingcud-
|     |     | returns | the amount | of physical memory | currently mapped | by at |
| --- | --- | ------- | ---------- | ------------------ | ---------------- | ----- |
aGraphMemAttrUsedMemCurrent
least one graph. Either of these attributes can be used to track when new physical memory is ac-
quiredbyCUDAforthesakeofanallocatinggraph. Bothoftheseattributesareusefulforexamining
howmuchmemoryissavedbythesharingmechanism.
4.2.5.6 PeerAccess
GraphallocationscanbeconfiguredforaccessfrommultipleGPUs,inwhichcaseCUDAwillmapthe
allocations onto the peer GPUs as required. CUDA allows graph allocations requiring different map-
pingstoreusethesamevirtualaddress. Whenthisoccurs,theaddressrangeismappedontoallGPUs
requiredbythedifferentallocations. Thismeansanallocationmaysometimesallowmorepeeraccess
thanwasrequestedduringitscreation;however,relyingontheseextramappingsisstillanerror.
| 4.2.5.6.1 | PeerAccesswithGraphNodeAPIs |     |     |     |     |     |
| --------- | --------------------------- | --- | --- | --- | --- | --- |
The cudaGraphAddNode API accepts mapping requests in the accessDescs array field of the alloc
nodeparametersstructures. ThepoolProps.locationembeddedstructurespecifiestheresident
devicefortheallocation. AccessfromtheallocatingGPUisassumedtobeneeded,thustheapplication
doesnotneedtospecifyanentryfortheresidentdeviceintheaccessDescsarray.
cudaGraphNodeParams allocNodeParams = { cudaGraphNodeTypeMemAlloc };
allocNodeParams.alloc.poolProps.allocType = cudaMemAllocationTypePinned;
allocNodeParams.alloc.poolProps.location.type = cudaMemLocationTypeDevice;
| ∕∕ specify                                  | device | 1 as the resident | device |      |     |     |
| ------------------------------------------- | ------ | ----------------- | ------ | ---- | --- | --- |
| allocNodeParams.alloc.poolProps.location.id |        |                   |        | = 1; |     |     |
| allocNodeParams.alloc.bytesize              |        | =                 | size;  |      |     |     |
∕∕ allocate an allocation resident on device 1 accessible from device 1
cudaGraphAddNode(&allocNode, graph, NULL, NULL, 0, &allocNodeParams);
accessDescs[2];
∕∕ boilerplate for the access descs (only ReadWrite and Device access supported
| ,→by the                     | add node | api)                               |     |     |     |     |
| ---------------------------- | -------- | ---------------------------------- | --- | --- | --- | --- |
| accessDescs[0].flags         |          | = cudaMemAccessFlagsProtReadWrite; |     |     |     |     |
| accessDescs[0].location.type |          | = cudaMemLocationTypeDevice;       |     |     |     |     |
| accessDescs[1].flags         |          | = cudaMemAccessFlagsProtReadWrite; |     |     |     |     |
| accessDescs[1].location.type |          | = cudaMemLocationTypeDevice;       |     |     |     |     |
(continuesonnextpage)
| 4.2. CUDAGraphs |     |     |     |     |     | 197 |
| --------------- | --- | --- | --- | --- | --- | --- |

CUDAProgrammingGuide,Release13.1
(continuedfrompreviouspage)
∕∕ access being requested for device 0 & 2. Device 1 access requirement left
,→implicit.
| accessDescs[0].location.id      |         |       | = 0;           |     |     |     |     |
| ------------------------------- | ------- | ----- | -------------- | --- | --- | --- | --- |
| accessDescs[1].location.id      |         |       | = 2;           |     |     |     |     |
| ∕∕ access                       | request | array | has 2 entries. |     |     |     |     |
| allocNodeParams.accessDescCount |         |       | =              | 2;  |     |     |     |
| allocNodeParams.accessDescs     |         |       | = accessDescs; |     |     |     |     |
∕∕ allocate an allocation resident on device 1 accessible from devices 0, 1 and
,→2. (0 & 2 from the descriptors, 1 from it being the resident device).
cudaGraphAddNode(&allocNode, graph, NULL, NULL, 0, &allocNodeParams);
| 4.2.5.6.2 | PeerAccesswithStreamCapture |     |     |     |     |     |     |
| --------- | --------------------------- | --- | --- | --- | --- | --- | --- |
For stream capture, the allocation node records the peer accessibility of the allocating pool at
the time of the capture. Altering the peer accessibility of the allocating pool after a cudaMal-
locFromPoolAsync call is captured does not affect the mappings that the graph will make for the
allocation.
∕∕ boilerplate for the access descs (only ReadWrite and Device access supported
| ,→by the                 | add node   | api)                               |                              |     |           |     |     |
| ------------------------ | ---------- | ---------------------------------- | ---------------------------- | --- | --------- | --- | --- |
| accessDesc.flags         |            | = cudaMemAccessFlagsProtReadWrite; |                              |     |           |     |     |
| accessDesc.location.type |            |                                    | = cudaMemLocationTypeDevice; |     |           |     |     |
| accessDesc.location.id   |            |                                    | = 1;                         |     |           |     |     |
| ∕∕ let                   | memPool be | resident                           | and accessible               |     | on device | 0   |     |
cudaStreamBeginCapture(stream);
| cudaMallocAsync(&dptr1,       |     |     | size, memPool, | stream); |     |     |     |
| ----------------------------- | --- | --- | -------------- | -------- | --- | --- | --- |
| cudaStreamEndCapture(stream,  |     |     | &graph1);      |          |     |     |     |
| cudaMemPoolSetAccess(memPool, |     |     | &accessDesc,   |          | 1); |     |     |
cudaStreamBeginCapture(stream);
| cudaMallocAsync(&dptr2,      |     |     | size, memPool, | stream); |     |     |     |
| ---------------------------- | --- | --- | -------------- | -------- | --- | --- | --- |
| cudaStreamEndCapture(stream, |     |     | &graph2);      |          |     |     |     |
∕∕The graph node allocating dptr1 would only have the device 0 accessibility
| ,→even | though memPool | now | has device | 1 accessibility. |     |     |     |
| ------ | -------------- | --- | ---------- | ---------------- | --- | --- | --- |
∕∕The graph node allocating dptr2 will have device 0 and device 1 accessibility,
since that was the pool accessibility at the time of the cudaMallocAsync
,→
,→call.
| 4.2.6. | Device | Graph | Launch |     |     |     |     |
| ------ | ------ | ----- | ------ | --- | --- | --- | --- |
Therearemanyworkflowswhichneedtomakedata-dependentdecisionsduringruntimeandexecute
differentoperationsdependingonthosedecisions. Ratherthanoffloadingthisdecision-makingpro-
cess to the host, which may require a round-trip from the device, users may prefer to perform it on
thedevice. Tothatend,CUDAprovidesamechanismtolaunchgraphsfromthedevice.
| 198 |     |     |     |     |     | Chapter4. | CUDAFeatures |
| --- | --- | --- | --- | --- | --- | --------- | ------------ |

CUDAProgrammingGuide,Release13.1
Devicegraphlaunchprovidesaconvenientwaytoperformdynamiccontrolflowfromthedevice,beit
somethingassimpleasalooporascomplexasadevice-sideworkscheduler.
Graphs which can be launched from the device will henceforth be referred to as device graphs, and
graphswhichcannotbelaunchedfromthedevicewillbereferredtoashostgraphs.
Device graphs can be launched from both the host and device, whereas host graphs can only be
launchedfromthehost. Unlikehostlaunches,launchingadevicegraphfromthedevicewhileaprevi-
ouslaunchofthegraphisrunningwillresultinanerror,returningcudaErrorInvalidValue;there-
fore,adevicegraphcannotbelaunchedtwicefromthedeviceatthesametime. Launchingadevice
graphfromthehostanddevicesimultaneouslywillresultinundefinedbehavior.
4.2.6.1 DeviceGraphCreation
In order for a graph to be launched from the device, it must be instantiated explicitly for device
launch. ThisisachievedbypassingthecudaGraphInstantiateFlagDeviceLaunchflagtothecud-
aGraphInstantiate() call. As is the case for host graphs, device graph structure is fixed at time
of instantiation and cannot be updated without re-instantiation, and instantiation can only be per-
formedonthehost. Inorderforagraphtobeabletobeinstantiatedfordevicelaunch,itmustadhere
tovariousrequirements.
4.2.6.1.1 DeviceGraphRequirements
Generalrequirements:
▶ Thegraph’snodesmustallresideonasingledevice.
▶ Thegraphcanonlycontainkernelnodes,memcpynodes,memsetnodes,andchildgraphnodes.
Kernelnodes:
▶ UseofCUDADynamicParallelismbykernelsinthegraphisnotpermitted.
▶ CooperativelaunchesarepermittedsolongasMPSisnotinuse.
Memcpynodes:
▶ Onlycopiesinvolvingdevicememoryand/orpinneddevice-mappedhostmemoryarepermitted.
▶ CopiesinvolvingCUDAarraysarenotpermitted.
▶ Both operands must be accessible from the current device at time of instantiation. Note that
the copy operation will be performed from the device on which the graph resides, even if it is
targetingmemoryonanotherdevice.
4.2.6.1.2 DeviceGraphUpload
Inordertolaunchagraphonthedevice,itmustfirstbeuploadedtothedevicetopopulatethenec-
essarydeviceresources. Thiscanbeachievedinoneoftwoways.
Firstly,thegraphcanbeuploadedexplicitly,eitherviacudaGraphUpload()orbyrequestinganupload
aspartofinstantiationviacudaGraphInstantiateWithParams().
Alternatively, thegraphcanfirstbelaunchedfromthehost, whichwillperformthisuploadstepim-
plicitlyaspartofthelaunch.
Examplesofallthreemethodscanbeseenbelow:
4.2. CUDAGraphs 199

CUDAProgrammingGuide,Release13.1
| ∕∕ Explicit                             | upload | after instantiation |     |     |     |
| --------------------------------------- | ------ | ------------------- | --- | --- | --- |
| cudaGraphInstantiate(&deviceGraphExec1, |        | deviceGraph1,      |     |     |     |
,→cudaGraphInstantiateFlagDeviceLaunch);
cudaGraphUpload(deviceGraphExec1, stream);
| ∕∕ Explicit                | upload | as part of instantiation |        |     |     |
| -------------------------- | ------ | ------------------------ | ------ | --- | --- |
| cudaGraphInstantiateParams |        | instantiateParams        | = {0}; |     |     |
instantiateParams.flags = cudaGraphInstantiateFlagDeviceLaunch |
,→cudaGraphInstantiateFlagUpload;
instantiateParams.uploadStream = stream;
cudaGraphInstantiateWithParams(&deviceGraphExec2, deviceGraph2, &
,→instantiateParams);
| ∕∕ Implicit                             | upload | via host launch |     |     |     |
| --------------------------------------- | ------ | --------------- | --- | --- | --- |
| cudaGraphInstantiate(&deviceGraphExec3, |        | deviceGraph3,  |     |     |     |
,→cudaGraphInstantiateFlagDeviceLaunch);
cudaGraphLaunch(deviceGraphExec3, stream);
| 4.2.6.1.3 DeviceGraphUpdate |     |     |     |     |     |
| --------------------------- | --- | --- | --- | --- | --- |
Devicegraphscanonlybeupdatedfromthehost,andmustbere-uploadedtothedeviceuponexe-
cutable graph update in order for the changes to take effect. This can be achieved using the same
methodsoutlinedinSectiondevice-graph-upload. Unlikehostgraphs,launchingadevicegraphfrom
thedevicewhileanupdateisbeingappliedwillresultinundefinedbehavior.
4.2.6.2 DeviceLaunch
Device graphs can be launched from both the host and the device via cudaGraphLaunch(), which
hasthesamesignatureonthedeviceasonthehost. Devicegraphsarelaunchedviathesamehandle
onthehostandthedevice. Devicegraphsmustbelaunchedfromanothergraphwhenlaunchedfrom
thedevice.
Device-sidegraphlaunchisper-threadandmultiplelaunchesmayoccurfromdifferentthreadsatthe
sametime,sotheuserwillneedtoselectasinglethreadfromwhichtolaunchagivengraph.
Unlike host launch, device graphs cannot be launched into regular CUDA streams, and can only be
launchedintodistinctnamedstreams,whicheachdenoteaspecificlaunchmode. Thefollowingtable
liststheavailablelaunchmodes.
Table9: Device-onlyGraphLaunchStreams
|     | Stream                                |     | LaunchMode          |           |              |
| --- | ------------------------------------- | --- | ------------------- | --------- | ------------ |
|     | cudaStreamGraphFireAndForget          |     | Fireandforgetlaunch |           |              |
|     | cudaStreamGraphTailLaunch             |     | Taillaunch          |           |              |
|     | cudaStreamGraphFireAndForgetAsSibling |     | Siblinglaunch       |           |              |
| 200 |                                       |     |                     | Chapter4. | CUDAFeatures |

CUDAProgrammingGuide,Release13.1
4.2.6.2.1 FireandForgetLaunch
As the name suggests, a fireand forget launch is submittedto the GPU immediately, and it runs in-
dependentlyofthelaunchinggraph. Inafire-and-forgetscenario,thelaunchinggraphistheparent,
andthelaunchedgraphisthechild.
|     |     | Figure31: Fireandforgetlaunch |     |     |
| --- | --- | ----------------------------- | --- | --- |
Theabovediagramcanbegeneratedbythesamplecodebelow:
__global__ void launchFireAndForgetGraph(cudaGraphExec_t graph) {
| cudaGraphLaunch(graph, |     | cudaStreamGraphFireAndForget); |     |     |
| ---------------------- | --- | ------------------------------ | --- | --- |
}
| void graphSetup() | {            |            |            |        |
| ----------------- | ------------ | ---------- | ---------- | ------ |
| cudaGraphExec_t   | gExec1,      | gExec2;    |            |        |
| cudaGraph_t       | g1, g2;      |            |            |        |
| ∕∕ Create,        | instantiate, | and upload | the device | graph. |
create_graph(&g2);
cudaGraphInstantiate(&gExec2, g2, cudaGraphInstantiateFlagDeviceLaunch);
| cudaGraphUpload(gExec2,        |                 | stream);      |                               |     |
| ------------------------------ | --------------- | ------------- | ----------------------------- | --- |
| ∕∕ Create                      | and instantiate | the launching | graph.                        |     |
| cudaStreamBeginCapture(stream, |                 |               | cudaStreamCaptureModeGlobal); |     |
| launchFireAndForgetGraph<<<1,  |                 | 1,            | 0, stream>>>(gExec2);         |     |
| cudaStreamEndCapture(stream,   |                 | &g1);         |                               |     |
| cudaGraphInstantiate(&gExec1,  |                 | g1);          |                               |     |
∕∕ Launch the host graph, which will in turn launch the device graph.
| cudaGraphLaunch(gExec1, |     | stream); |     |     |
| ----------------------- | --- | -------- | --- | --- |
}
Agraphcanhaveupto120totalfire-and-forgetgraphsduringthecourseofitsexecution. Thistotal
resetsbetweenlaunchesofthesameparentgraph.
4.2. CUDAGraphs 201

CUDAProgrammingGuide,Release13.1
4.2.6.2.1.1 GraphExecutionEnvironments
Inordertofullyunderstandthedevice-sidesynchronizationmodel,itisfirstnecessarytounderstand
theconceptofanexecutionenvironment.
When a graph is launched from the device, it is launched into its own execution environment. The
executionenvironmentofagivengraphencapsulatesallworkinthegraphaswellasallgeneratedfire
andforgetwork. Thegraphcanbeconsideredcompletewhenithascompletedexecutionandwhen
allgeneratedchildworkiscomplete.
The below diagram shows the environment encapsulation that would be generated by the fire-and-
forgetsamplecodeintheprevioussection.
Figure32: Fireandforgetlaunch,withexecutionenvironments
Theseenvironmentsarealsohierarchical,soagraphenvironmentcanincludemultiplelevelsofchild-
environmentsfromfireandforgetlaunches.
Whenagraphislaunchedfromthehost,thereexistsastreamenvironmentthatparentstheexecution
environmentofthelaunchedgraph. Thestreamenvironmentencapsulatesallworkgeneratedaspart
oftheoveralllaunch. Thestreamlaunchiscomplete(i.e. downstreamdependentworkmaynowrun)
whentheoverallstreamenvironmentismarkedascomplete.
4.2.6.2.2 TailLaunch
Unlike on the host, it is not possible to synchronize with device graphs from the GPU via traditional
methods such as cudaDeviceSynchronize() or cudaStreamSynchronize(). Rather, in order to
enable serial work dependencies, a different launch mode - tail launch - is offered, to provide similar
functionality.
202 Chapter4. CUDAFeatures

CUDAProgrammingGuide,Release13.1
Figure33: Nestedfireandforgetenvironments
Figure34: Thestreamenvironment,visualized
4.2. CUDAGraphs 203

CUDAProgrammingGuide,Release13.1
A tail launch executes when a graph’s environment is considered complete - ie, when the graph and
allitschildrenarecomplete. Whenagraphcompletes, theenvironmentofthenextgraphinthetail
launchlistwillreplacethecompletedenvironmentasachildoftheparentenvironment. Likefire-and-
forgetlaunches,agraphcanhavemultiplegraphsenqueuedfortaillaunch.
|     |     | Figure35: | Asimpletaillaunch |     |     |     |
| --- | --- | --------- | ----------------- | --- | --- | --- |
Theaboveexecutionflowcanbegeneratedbythecodebelow:
| __global__             | void launchTailGraph(cudaGraphExec_t |                             |     | graph) | {   |     |
| ---------------------- | ------------------------------------ | --------------------------- | --- | ------ | --- | --- |
| cudaGraphLaunch(graph, |                                      | cudaStreamGraphTailLaunch); |     |        |     |     |
}
void graphSetup() {
| cudaGraphExec_t | gExec1,      | gExec2;    |            |        |     |     |
| --------------- | ------------ | ---------- | ---------- | ------ | --- | --- |
| cudaGraph_t     | g1, g2;      |            |            |        |     |     |
| ∕∕ Create,      | instantiate, | and upload | the device | graph. |     |     |
create_graph(&g2);
cudaGraphInstantiate(&gExec2, g2, cudaGraphInstantiateFlagDeviceLaunch);
| cudaGraphUpload(gExec2,        |                 | stream);                 |                               |     |     |     |
| ------------------------------ | --------------- | ------------------------ | ----------------------------- | --- | --- | --- |
| ∕∕ Create                      | and instantiate | the launching            | graph.                        |     |     |     |
| cudaStreamBeginCapture(stream, |                 |                          | cudaStreamCaptureModeGlobal); |     |     |     |
| launchTailGraph<<<1,           |                 | 1, 0, stream>>>(gExec2); |                               |     |     |     |
| cudaStreamEndCapture(stream,   |                 | &g1);                    |                               |     |     |     |
| cudaGraphInstantiate(&gExec1,  |                 | g1);                     |                               |     |     |     |
∕∕ Launch the host graph, which will in turn launch the device graph.
| cudaGraphLaunch(gExec1, |     | stream); |     |     |     |     |
| ----------------------- | --- | -------- | --- | --- | --- | --- |
}
Tail launches enqueued by a given graph will execute one at a time, in order of when they were en-
queued. Sothefirstenqueuedgraphwillrunfirst,andthenthesecond,andsoon.
Taillaunchesenqueuedbyatailgraphwillexecutebeforetaillaunchesenqueuedbypreviousgraphs
inthetaillaunchlist. Thesenewtaillauncheswillexecuteintheordertheyareenqueued.
| 204 |     |     |     |     | Chapter4. | CUDAFeatures |
| --- | --- | --- | --- | --- | --------- | ------------ |

CUDAProgrammingGuide,Release13.1
|     |                                                            | Figure36: Taillaunchordering |     |
| --- | ---------------------------------------------------------- | ---------------------------- | --- |
|     | Figure37: Taillaunchorderingwhenenqueuedfrommultiplegraphs |                              |     |
Agraphcanhaveupto255pendingtaillaunches.
| 4.2.6.2.2.1 | TailSelf-launch |     |     |
| ----------- | --------------- | --- | --- |
Itispossibleforadevicegraphtoenqueueitselfforataillaunch,althoughagivengraphcanonlyhave
oneself-launchenqueuedatatime. Inordertoquerythecurrentlyrunningdevicegraphsothatitcan
berelaunched,anewdevice-sidefunctionisadded:
| cudaGraphExec_t | cudaGetCurrentGraphExec(); |     |     |
| --------------- | -------------------------- | --- | --- |
Thisfunctionreturnsthehandleofthecurrentlyrunninggraphifitisadevicegraph. Ifthecurrently
executingkernelisnotanodewithinadevicegraph,thisfunctionwillreturnNULL.
Belowissamplecodeshowingusageofthisfunctionforarelaunchloop:
| __device__      | int relaunchCount   | = 0;           |     |
| --------------- | ------------------- | -------------- | --- |
| __global__      | void relaunchSelf() | {              |     |
| int             | relaunchMax = 100;  |                |     |
| if (threadIdx.x | ==                  | 0) {           |     |
|                 | if (relaunchCount   | < relaunchMax) | {   |
cudaGraphLaunch(cudaGetCurrentGraphExec(),
,→cudaStreamGraphTailLaunch);
}
(continuesonnextpage)
4.2. CUDAGraphs 205

CUDAProgrammingGuide,Release13.1
(continuedfrompreviouspage)
relaunchCount++;
}
}
4.2.6.2.3 SiblingLaunch
Sibling launch is a variation of fire-and-forget launch in which the graph is launched not as a child
ofthelaunchinggraph’sexecutionenvironment,butratherasachildofthelaunchinggraph’sparent
environment. Siblinglaunchisequivalenttoafire-and-forgetlaunchfromthelaunchinggraph’sparent
environment.
Figure38: Asimplesiblinglaunch
Theabovediagramcanbegeneratedbythesamplecodebelow:
__global__ void launchSiblingGraph(cudaGraphExec_t graph) {
cudaGraphLaunch(graph, cudaStreamGraphFireAndForgetAsSibling);
}
void graphSetup() {
(continuesonnextpage)
206 Chapter4. CUDAFeatures

CUDAProgrammingGuide,Release13.1
(continuedfrompreviouspage)
| cudaGraphExec_t |         |              | gExec1, | gExec2;    |     |        |        |     |     |
| --------------- | ------- | ------------ | ------- | ---------- | --- | ------ | ------ | --- | --- |
| cudaGraph_t     |         | g1,          | g2;     |            |     |        |        |     |     |
| ∕∕              | Create, | instantiate, |         | and upload | the | device | graph. |     |     |
create_graph(&g2);
cudaGraphInstantiate(&gExec2, g2, cudaGraphInstantiateFlagDeviceLaunch);
| cudaGraphUpload(gExec2,        |        |     |             | stream);      |                               |        |     |     |     |
| ------------------------------ | ------ | --- | ----------- | ------------- | ----------------------------- | ------ | --- | --- | --- |
| ∕∕                             | Create | and | instantiate | the launching |                               | graph. |     |     |     |
| cudaStreamBeginCapture(stream, |        |     |             |               | cudaStreamCaptureModeGlobal); |        |     |     |     |
| launchSiblingGraph<<<1,        |        |     |             | 1, 0,         | stream>>>(gExec2);            |        |     |     |     |
| cudaStreamEndCapture(stream,   |        |     |             | &g1);         |                               |        |     |     |     |
| cudaGraphInstantiate(&gExec1,  |        |     |             |               | g1);                          |        |     |     |     |
∕∕ Launch the host graph, which will in turn launch the device graph.
| cudaGraphLaunch(gExec1, |     |     |     | stream); |     |     |     |     |     |
| ----------------------- | --- | --- | --- | -------- | --- | --- | --- | --- | --- |
}
Since sibling launches are not launched into the launching graph’s execution environment, they will
notgatetaillaunchesenqueuedbythelaunchinggraph.
| 4.2.7. | Using |     | Graph | APIs |     |     |     |     |     |
| ------ | ----- | --- | ----- | ---- | --- | --- | --- | --- | --- |
cudaGraph_tobjectsarenotthread-safe. Itistheresponsibilityoftheusertoensurethatmultiple
threadsdonotconcurrentlyaccessthesamecudaGraph_t.
| A               |     |     | cannot | run concurrently | with | itself. | A launch | of a            | will be |
| --------------- | --- | --- | ------ | ---------------- | ---- | ------- | -------- | --------------- | ------- |
| cudaGraphExec_t |     |     |        |                  |      |         |          | cudaGraphExec_t |         |
orderedafterpreviouslaunchesofthesameexecutablegraph.
Graphexecutionisdoneinstreamsfororderingwithotherasynchronouswork. However,thestream
isfororderingonly;itdoesnotconstraintheinternalparallelismofthegraph,nordoesitaffectwhere
graphnodesexecute.
SeeGraphAPI.
| 4.2.8. | CUDA |     | User | Objects |     |     |     |     |     |
| ------ | ---- | --- | ---- | ------- | --- | --- | --- | --- | --- |
CUDAUserObjectscanbeusedtohelpmanagethelifetimeofresourcesusedbyasynchronouswork
inCUDA.Inparticular,thisfeatureisusefulforcuda-graphsandstreamcapture.
VariousresourcemanagementschemesarenotcompatiblewithCUDAgraphs. Considerforexample
anevent-basedpoolorasynchronous-create,asynchronous-destroyscheme.
| ∕∕ Library                    |           | API with | pool                             | allocation |     |     |     |     |     |
| ----------------------------- | --------- | -------- | -------------------------------- | ---------- | --- | --- | --- | --- | --- |
| void libraryWork(cudaStream_t |           |          |                                  | stream)    | {   |     |     |     |     |
| auto                          | &resource |          | = pool.claimTemporaryResource(); |            |     |     |     |     |     |
resource.waitOnReadyEventInStream(stream);
| launchWork(stream, |     |     |     | resource); |     |     |     |     |     |
| ------------------ | --- | --- | --- | ---------- | --- | --- | --- | --- | --- |
resource.recordReadyEvent(stream);
}
| 4.2. CUDAGraphs |     |     |     |     |     |     |     |     | 207 |
| --------------- | --- | --- | --- | --- | --- | --- | --- | --- | --- |

CUDAProgrammingGuide,Release13.1
| ∕∕ Library                    | API with  | asynchronous |                | resource | deletion |     |     |
| ----------------------------- | --------- | ------------ | -------------- | -------- | -------- | --- | --- |
| void libraryWork(cudaStream_t |           |              | stream)        |          | {        |     |     |
| Resource                      | *resource | = new        | Resource(...); |          |          |     |     |
| launchWork(stream,            |           | resource);   |                |          |          |     |     |
cudaLaunchHostFunc(
stream,
|     | [](void *resource) |                      | {   |     |               |     |     |
| --- | ------------------ | -------------------- | --- | --- | ------------- | --- | --- |
|     | delete             | static_cast<Resource |     |     | *>(resource); |     |     |
},
resource,
0);
| ∕∕ Error | handling | considerations |     | not | shown |     |     |
| -------- | -------- | -------------- | --- | --- | ----- | --- | --- |
}
These schemes are difficult with CUDA graphs because of the non-fixed pointer or handle for the
resourcewhichrequiresindirectionorgraphupdate,andthesynchronousCPUcodeneededeachtime
theworkissubmitted. Theyalsodonotworkwithstreamcaptureiftheseconsiderationsarehidden
fromthecallerofthelibrary,andbecauseofuseofdisallowedAPIsduringcapture. Varioussolutions
existsuchasexposingtheresourcetothecaller. CUDAuserobjectspresentanotherapproach.
ACUDAuserobjectassociatesauser-specifieddestructorcallbackwithaninternalrefcount,similarto
C++shared_ptr. ReferencesmaybeownedbyusercodeontheCPUandbyCUDAgraphs. Notethat
foruser-ownedreferences, unlikeC++smartpointers, thereisnoobjectrepresentingthereference;
usersmusttrackuser-ownedreferencesmanually. Atypicalusecasewouldbetoimmediatelymove
thesoleuser-ownedreferencetoaCUDAgraphaftertheuserobjectiscreated.
When a reference is associated to a CUDA graph, CUDA will manage the graph operations automat-
ically. A cloned cudaGraph_t retains a copy of every reference owned by the source cudaGraph_t,
with the same multiplicity. An instantiated cudaGraphExec_t retains a copy of every reference in
the source cudaGraph_t. When a cudaGraphExec_t is destroyed without being synchronized, the
referencesareretaineduntiltheexecutioniscompleted.
Hereisanexampleuse.
| cudaGraph_t | graph; | ∕∕ Preexisting |     | graph |     |     |     |
| ----------- | ------ | -------------- | --- | ----- | --- | --- | --- |
Object *object = new Object; ∕∕ C++ object with possibly nontrivial destructor
| cudaUserObject_t |     | cuObject; |     |     |     |     |     |
| ---------------- | --- | --------- | --- | --- | --- | --- | --- |
cudaUserObjectCreate(
&cuObject,
object, ∕∕ Here we use a CUDA-provided template wrapper for this API,
∕∕ which supplies a callback to delete the C++ object pointer
| 1,  | ∕∕ Initial | refcount |     |     |     |     |     |
| --- | ---------- | -------- | --- | --- | --- | --- | --- |
cudaUserObjectNoDestructorSync ∕∕ Acknowledge that the callback cannot be
|     |     |     |     | ∕∕  | waited on via | CUDA |     |
| --- | --- | --- | --- | --- | ------------- | ---- | --- |
);
cudaGraphRetainUserObject(
graph,
cuObject,
| 1,  | ∕∕ Number | of references |     |     |     |     |     |
| --- | --------- | ------------- | --- | --- | --- | --- | --- |
cudaGraphUserObjectMove ∕∕ Transfer a reference owned by the caller (do
|     |     |     | ∕∕  | not modify | the total | reference | count) |
| --- | --- | --- | --- | ---------- | --------- | --------- | ------ |
);
∕∕ No more references owned by this thread; no need to call release API
(continuesonnextpage)
| 208 |     |     |     |     |     | Chapter4. | CUDAFeatures |
| --- | --- | --- | --- | --- | --- | --------- | ------------ |

CUDAProgrammingGuide,Release13.1
(continuedfrompreviouspage)
cudaGraphExec_t graphExec;
cudaGraphInstantiate(&graphExec, graph, nullptr, nullptr, 0); ∕∕ Will retain
,→a
∕∕ new
,→reference
cudaGraphDestroy(graph); ∕∕ graphExec still owns a reference
cudaGraphLaunch(graphExec, 0); ∕∕ Async launch has access to the user objects
cudaGraphExecDestroy(graphExec); ∕∕ Launch is not synchronized; the release
∕∕ will be deferred if needed
cudaStreamSynchronize(0); ∕∕ After the launch is synchronized, the remaining
∕∕ reference is released and the destructor will
∕∕ execute. Note this happens asynchronously.
∕∕ If the destructor callback had signaled a synchronization object, it would
∕∕ be safe to wait on it at this point.
Referencesownedbygraphsinchildgraphnodesareassociatedtothechildgraphs,nottheparents. If
achildgraphisupdatedordeleted,thereferenceschangeaccordingly. Ifanexecutablegraphorchild
graphisupdatedwithcudaGraphExecUpdateorcudaGraphExecChildGraphNodeSetParams,the
referencesinthenewsourcegraphareclonedandreplacethereferencesinthetargetgraph. Ineither
case,ifpreviouslaunchesarenotsynchronized,anyreferenceswhichwouldbereleasedarehelduntil
thelauncheshavefinishedexecuting.
ThereisnotcurrentlyamechanismtowaitonuserobjectdestructorsviaaCUDAAPI.Usersmaysignal
a synchronization object manually from the destructor code. In addition, it is not legal to call CUDA
APIsfromthedestructor,similartotherestrictiononcudaLaunchHostFunc. Thisistoavoidblocking
aCUDAinternalsharedthreadandpreventingforwardprogress. Itislegaltosignalanotherthreadto
performanAPIcall,ifthedependencyisonewayandthethreaddoingthecallcannotblockforward
progressofCUDAwork.
UserobjectsarecreatedwithcudaUserObjectCreate,whichisagoodstartingpointtobrowsere-
latedAPIs.
4.3. Stream-Ordered Memory Allocator
4.3.1. Introduction
Managing memory allocations using cudaMalloc and cudaFree causes the GPU to synchronize
across all executing CUDA streams. The stream-ordered memory allocator enables applications to
ordermemoryallocationanddeallocationwithotherworklaunchedintoaCUDAstreamsuchasker-
nel launches and asynchronous copies. This improves application memory use by taking advantage
ofstream-orderingsemanticstoreusememoryallocations. Theallocatoralsoallowsapplicationsto
controltheallocator’smemorycachingbehavior. Whensetupwithanappropriatereleasethreshold,
the caching behavior allows the allocator to avoid expensive calls into the OS when the application
indicatesitiswillingtoacceptabiggermemoryfootprint. Theallocatoralsosupportseasyandsecure
allocationsharingbetweenprocesses.
Stream-OrderedMemoryAllocator:
▶ Reducestheneedforcustommemorymanagementabstractions,andmakesiteasiertocreate
high-performancecustommemorymanagementforapplicationsthatneedit.
▶ Enablesmultiplelibrariestoshareacommonmemorypoolmanagedbythedriver. Thiscanre-
duceexcessmemoryconsumption.
4.3. Stream-OrderedMemoryAllocator 209

CUDAProgrammingGuide,Release13.1
▶ Allows, the driver to perform optimizations based on its awareness of the allocator and other
streammanagementAPIs.
Note
NsightComputeandtheNext-GenCUDAdebuggerisawareoftheallocatorsinceCUDA11.3.
4.3.2. Memory Management
cudaMallocAsyncandcudaFreeAsyncaretheAPIswhichenablestream-orderedmemorymanage-
ment. cudaMallocAsync returns an allocation and cudaFreeAsync frees an allocation. Both APIs
acceptstreamargumentstodefinewhentheallocationwillbecomeandstopbeingavailableforuse.
ThesefunctionsallowmemoryoperationstobetiedtospecificCUDAstreams,enablingthemtooc-
curwithoutblockingthehostorotherstreams. Applicationperformancecanbeimprovedbyavoiding
potentiallycostlysynchronizationofcudaMallocandcudaFree.
TheseAPIscanbeusedforfurtherperformanceoptimizationthroughmemorypools,whichmanage
andreuselargeblocksofmemoryformoreefficientallocationanddeallocation. Memorypoolshelp
reduceoverheadandpreventfragmentation,improvingperformanceinscenarioswithfrequentmem-
oryallocationoperations.
4.3.2.1 AllocatingMemory
The cudaMallocAsync function triggers asynchronous memory allocation on the GPU, linked to a
specific CUDA stream. cudaMallocAsync allows memory allocation to occur without hindering the
hostorotherstreams,eliminatingtheneedforexpensivesynchronization.
Note
cudaMallocAsyncignoresthecurrentdevice/contextwhendeterminingwheretheallocationwill
reside. Instead, cudaMallocAsync determines the appropriate device based on the specified
memorypoolorthesuppliedstream.
Thelistingbelowillustratesafundamentalusepattern: thememoryisallocated,used,andthenfreed
backintothesamestream.
void *ptr;
size_t size = 512;
cudaMallocAsync(&ptr, size, cudaStreamPerThread);
∕∕ do work using the allocation
kernel<<<..., cudaStreamPerThread>>>(ptr, ...);
∕∕ An asynchronous free can be specified without synchronizing the cpu and GPU
cudaFreeAsync(ptr, cudaStreamPerThread);
Note
Whenaccessingallocationfromastreamotherthanthestreamthatmadetheallocation,theuser
must guarantee that the access occurs after the allocation operation, otherwise the behavior is
undefined.
210 Chapter4. CUDAFeatures

CUDAProgrammingGuide,Release13.1
4.3.2.2 FreeingMemory
cudaFreeAsync() asynchronously frees device memory in a stream-ordered fashion, meaning the
memory deallocation is assigned to a specific CUDA stream and does not block the host or other
streams.
Theusermustguaranteethatthefreeoperationhappensaftertheallocationoperationandanyuses
oftheallocation. Anyuseoftheallocationafterthefreeoperationstartsresultsinundefinedbehavior.
Eventsand/orstreamsynchronizingoperationsshouldbeusedtoguaranteeanyaccesstothealloca-
tion from other streams is complete before the free operation begins, as illustrated in the following
example.
| cudaMallocAsync(&ptr,   |     | size,     | stream1); |
| ----------------------- | --- | --------- | --------- |
| cudaEventRecord(event1, |     | stream1); |           |
∕∕stream2 must wait for the allocation to be ready before accessing
| cudaStreamWaitEvent(stream2, |                 |           | event1); |
| ---------------------------- | --------------- | --------- | -------- |
| kernel<<<...,                | stream2>>>(ptr, |           | ...);    |
| cudaEventRecord(event2,      |                 | stream2); |          |
∕∕ stream3 must wait for stream2 to finish accessing the allocation before
| ∕∕ freeing                   | the allocation |           |          |
| ---------------------------- | -------------- | --------- | -------- |
| cudaStreamWaitEvent(stream3, |                |           | event2); |
| cudaFreeAsync(ptr,           |                | stream3); |          |
Memory allocated with cudaMalloc() can be freed with with cudaFreeAsync(). As above, all ac-
cessestothememorymustbecompletebeforethefreeoperationbegins.
| cudaMalloc(&ptr,   | size);         |          |       |
| ------------------ | -------------- | -------- | ----- |
| kernel<<<...,      | stream>>>(ptr, |          | ...); |
| cudaFreeAsync(ptr, |                | stream); |       |
Likewise, memory allocated with can be freed with cudaFree(). When freeing
cudaMallocAsync
suchallocationsthroughthecudaFree()API,thedriverassumesthatallaccessestotheallocation
are complete and performs no further synchronization. The user can use cudaStreamQuery / cu-
daStreamSynchronize/cudaEventQuery/cudaEventSynchronize/cudaDeviceSynchronize
to guarantee that the appropriate asynchronous work is complete and that the GPU will not try to
accesstheallocation.
| cudaMallocAsync(&ptr, |                | size,stream); |       |
| --------------------- | -------------- | ------------- | ----- |
| kernel<<<...,         | stream>>>(ptr, |               | ...); |
∕∕ synchronize is needed to avoid prematurely freeing the memory
cudaStreamSynchronize(stream);
cudaFree(ptr);
| 4.3.3. | Memory | Pools |     |
| ------ | ------ | ----- | --- |
Memorypoolsencapsulatevirtualaddressandphysicalmemoryresourcesthatareallocatedandman-
aged according to the pools attributes and properties. The primary aspect of a memory pool is the
kindandlocationofmemoryitmanages.
AllcallstocudaMallocAsyncuseresourcesfrommemorypool. Ifamemorypoolisnotspecified,cu-
daMallocAsyncusesthecurrentmemorypoolofthesuppliedstream’sdevice. Thecurrentmemory
pool for a device may be set with cudaDeviceSetMempool and queried with cudaDeviceGetMem-
pool. Each device has a default memory pool, which is active if cudaDeviceSetMempool has not
beencalled.
4.3. Stream-OrderedMemoryAllocator 211

CUDAProgrammingGuide,Release13.1
TheAPIcudaMallocFromPoolAsyncandc++overloadsofcudaMallocAsyncallowausertospecify
thepooltobeusedforanallocationwithoutsettingitasthecurrentpool. TheAPIscudaDeviceGet-
DefaultMempoolandcudaMemPoolCreatereturnhandlestomemorypools. cudaMemPoolSetAt-
tributeandcudaMemPoolGetAttributecontroltheattributesofmemorypools.
Note
The mempool current to a device will be local to that device. So allocating without specifying a
memorypoolwillalwaysyieldanallocationlocaltothestream’sdevice.
4.3.3.1 Default/ImplicitPools
The default memory pool of a device can be retrieved by calling cudaDeviceGetDefaultMempool.
Allocations from the default memory pool of a device are non-migratable device allocation located
on that device. These allocations will always be accessible from that device. The accessibility of the
defaultmemorypoolcanbemodifiedwithcudaMemPoolSetAccessandqueriedwithcudaMemPool-
GetAccess. Sincethedefaultpoolsdonotneedtobeexplicitlycreated,theyaresometimesreferred
toasimplicitpools. ThedefaultmemorypoolofadevicedoesnotsupportIPC.
4.3.3.2 ExplicitPools
cudaMemPoolCreatecreatesanexplicitpool. Thisallowsapplicationstorequestpropertiesfortheir
allocation beyond what is provided by the default/implicit pools. These include properties such as
IPC capability, maximum pool size, allocations resident on a specific CPU NUMA node on supported
platformsetc.
| ∕∕ create                   | a pool similar | to the implicit              | pool on device | 0   |     |
| --------------------------- | -------------- | ---------------------------- | -------------- | --- | --- |
| int device                  | = 0;           |                              |                |     |     |
| cudaMemPoolProps            | poolProps      | = { };                       |                |     |     |
| poolProps.allocType         | =              | cudaMemAllocationTypePinned; |                |     |     |
| poolProps.location.id       |                | = device;                    |                |     |     |
| poolProps.location.type     |                | = cudaMemLocationTypeDevice; |                |     |     |
| cudaMemPoolCreate(&memPool, |                | &poolProps));                |                |     |     |
ThefollowingcodesnippetillustratesanexampleofcreatinganIPCcapablememorypoolonavalid
CPUNUMAnode.
∕∕ create a pool resident on a CPU NUMA node that is capable of IPC sharing
| ,→(via                         | a file descriptor). |                                       |     |           |              |
| ------------------------------ | ------------------- | ------------------------------------- | --- | --------- | ------------ |
| int cpu_numa_id                | = 0;                |                                       |     |           |              |
| cudaMemPoolProps               | poolProps           | = { };                                |     |           |              |
| poolProps.allocType            | =                   | cudaMemAllocationTypePinned;          |     |           |              |
| poolProps.location.id          |                     | = cpu_numa_id;                        |     |           |              |
| poolProps.location.type        |                     | = cudaMemLocationTypeHostNuma;        |     |           |              |
| poolProps.handleType           | =                   | cudaMemHandleTypePosixFileDescriptor; |     |           |              |
| cudaMemPoolCreate(&ipcMemPool, |                     | &poolProps));                         |     |           |              |
| 212                            |                     |                                       |     | Chapter4. | CUDAFeatures |

CUDAProgrammingGuide,Release13.1
4.3.3.3 DeviceAccessibilityforMulti-GPUSupport
LikeallocationaccessibilitycontrolledthroughthevirtualmemorymanagementAPIs,memorypoolal-
locationaccessibilitydoesnotfollowcudaDeviceEnablePeerAccessorcuCtxEnablePeerAccess.
For memory pools, the API modifies what devices can access allocations
cudaMemPoolSetAccess
from a pool. By default, allocations are accessible only from the device where the allocations are lo-
cated. Thisaccesscannotberevoked. Toenableaccessfromotherdevices,theaccessingdevicemust
bepeercapablewiththememorypool’sdevice. ThiscanbeverifiedwithcudaDeviceCanAccessPeer.
Ifthepeercapabilityisnotchecked,thesetaccessmayfailwithcudaErrorInvalidDevice. How-
ever, if no allocations had been made from the pool, the call may succeed
cudaMemPoolSetAccess
evenwhenthedevicesarenotpeercapable. Inthiscase,thenextallocationfromthepoolwillfail.
ItisworthnotingthatcudaMemPoolSetAccessaffectsallallocationsfromthememorypool,notjust
futureones. Likewise,theaccessibilityreportedbycudaMemPoolGetAccessappliestoallallocations
from the pool, not just future ones. Changing the accessibility settings of a pool for a given GPU
frequently is not recommended. That is, once a pool is made accessible from a given GPU, it should
remainaccessiblefromthatGPUforthelifetimeofthepool.
| ∕∕ snippet | showing usage | of  | cudaMemPoolSetAccess: |     |     |
| ---------- | ------------- | --- | --------------------- | --- | --- |
cudaError_t setAccessOnDevice(cudaMemPool_t memPool, int residentDevice,
|                          | int accessingDevice) |                                    | {                            |     |     |
| ------------------------ | -------------------- | ---------------------------------- | ---------------------------- | --- | --- |
| cudaMemAccessDesc        |                      | accessDesc                         | = {};                        |     |     |
| accessDesc.location.type |                      |                                    | = cudaMemLocationTypeDevice; |     |     |
| accessDesc.location.id   |                      |                                    | = accessingDevice;           |     |     |
| accessDesc.flags         |                      | = cudaMemAccessFlagsProtReadWrite; |                              |     |     |
| int                      | canAccess =          | 0;                                 |                              |     |     |
cudaError_t error = cudaDeviceCanAccessPeer(&canAccess, accessingDevice,
residentDevice);
| if (error | != cudaSuccess) |     | {   |     |     |
| --------- | --------------- | --- | --- | --- | --- |
return error;
| } else | if (canAccess | ==  | 0) { |     |     |
| ------ | ------------- | --- | ---- | --- | --- |
return cudaErrorPeerAccessUnsupported;
}
| ∕∕ Make | the address                   | accessible |     |              |     |
| ------- | ----------------------------- | ---------- | --- | ------------ | --- |
| return  | cudaMemPoolSetAccess(memPool, |            |     | &accessDesc, | 1); |
}
4.3.3.4 EnablingMemoryPoolsforIPC
Memorypoolscanbeenabledforinterprocesscommunication(IPC)toalloweasy,efficientandsecure
sharing of GPU memory between processes. CUDA’s IPC memory pools provide the same security
benefitsasCUDA’svirtualmemorymanagementAPIs.
There are two steps to sharing memory between processes with memory pools: the processes first
needs to share access to the pool, then share specific allocations from that pool. The first step es-
tablishesandenforcessecurity. Thesecondstepcoordinateswhatvirtualaddressesareusedineach
processandwhenmappingsneedtobevalidintheimportingprocess.
4.3. Stream-OrderedMemoryAllocator 213

CUDAProgrammingGuide,Release13.1
| 4.3.3.4.1 | CreatingandSharingIPCMemoryPools |     |     |     |     |     |     |     |     |
| --------- | -------------------------------- | --- | --- | --- | --- | --- | --- | --- | --- |
Sharing access to a pool involves retrieving an OS-native handle to the pool with cudaMemPoolEx-
portToShareableHandle(), transferring the handle to the importing process using OS-native IPC
mechanisms,andthencreatinganimportedmemorypoolwiththecudaMemPoolImportFromShare-
ableHandle() API. For cudaMemPoolExportToShareableHandle to succeed, the memory pool
musthavebeencreatedwiththerequestedhandletypespecifiedinthepoolpropertiesstructure.
Please reference samples for the appropriate IPC mechanisms to transfer the OS-native handle be-
tweenprocesses. Therestoftheprocedurecanbefoundinthefollowingcodesnippets.
| ∕∕ in exporting         |     | process    |                                |                              |      |           |     |     |     |
| ----------------------- | --- | ---------- | ------------------------------ | ---------------------------- | ---- | --------- | --- | --- | --- |
| ∕∕ create               | an  | exportable |                                | IPC capable                  | pool | on device | 0   |     |     |
| cudaMemPoolProps        |     | poolProps  |                                | =                            | { }; |           |     |     |     |
| poolProps.allocType     |     |            | = cudaMemAllocationTypePinned; |                              |      |           |     |     |     |
| poolProps.location.id   |     |            | =                              | 0;                           |      |           |     |     |     |
| poolProps.location.type |     |            |                                | = cudaMemLocationTypeDevice; |      |           |     |     |     |
∕∕ Setting handleTypes to a non zero value will make the pool exportable (IPC
,→capable)
poolProps.handleTypes = CU_MEM_HANDLE_TYPE_POSIX_FILE_DESCRIPTOR;
| cudaMemPoolCreate(&memPool, |     |              |     |         | &poolProps)); |     |     |     |     |
| --------------------------- | --- | ------------ | --- | ------- | ------------- | --- | --- | --- | --- |
| ∕∕ FD based                 |     | handles      | are | integer | types         |     |     |     |     |
| int fdHandle                |     | = 0;         |     |         |               |     |     |     |     |
| ∕∕ Retrieve                 |     | an OS native |     | handle  | to the pool.  |     |     |     |     |
∕∕ Note that a pointer to the handle memory is passed in here.
cudaMemPoolExportToShareableHandle(&fdHandle,
memPool,
CU_MEM_HANDLE_TYPE_POSIX_FILE_DESCRIPTOR,
0);
∕∕ The handle must be sent to the importing process with the appropriate
| ∕∕ OS-specific  |     | APIs.   |     |     |     |     |     |     |     |
| --------------- | --- | ------- | --- | --- | --- | --- | --- | --- | --- |
| ∕∕ in importing |     | process |     |     |     |     |     |     |     |
int fdHandle;
∕∕ The handle needs to be retrieved from the exporting process with the
| ∕∕ appropriate |      | OS-specific |      | APIs.     |               |         |     |     |     |
| -------------- | ---- | ----------- | ---- | --------- | ------------- | ------- | --- | --- | --- |
| ∕∕ Create      | an   | imported    | pool | from      | the shareable | handle. |     |     |     |
| ∕∕ Note        | that | the handle  |      | is passed | by value      | here.   |     |     |     |
cudaMemPoolImportFromShareableHandle(&importedMemPool,
(void*)fdHandle,
CU_MEM_HANDLE_TYPE_POSIX_FILE_DESCRIPTOR,
0);
| 214 |     |     |     |     |     |     |     | Chapter4. | CUDAFeatures |
| --- | --- | --- | --- | --- | --- | --- | --- | --------- | ------------ |

CUDAProgrammingGuide,Release13.1
| 4.3.3.4.2 | SetAccessintheImportingProcess |     |     |     |     |     |     |     |
| --------- | ------------------------------ | --- | --- | --- | --- | --- | --- | --- |
Importedmemorypoolsareinitiallyonlyaccessiblefromtheirresidentdevice. Theimportedmemory
pooldoesnotinheritanyaccessibilitysetbytheexportingprocess. Theimportingprocessneedsto
enableaccesswithcudaMemPoolSetAccessfromanyGPUitplanstoaccessthememoryfrom.
Iftheimportedmemorypoolbelongstoadevicethatisnotvisibletoimportingprocess,theusermust
usethecudaMemPoolSetAccessAPItoenableaccessfromtheGPUstheallocationswillbeusedon.
(SeeDeviceAccessibilityforMulti-GPUSupport)
| 4.3.3.4.3 | CreatingandSharingAllocationsfromanExportedPool |     |     |     |     |     |     |     |
| --------- | ----------------------------------------------- | --- | --- | --- | --- | --- | --- | --- |
Oncethepoolhasbeenshared,allocationsmadewithcudaMallocAsync()fromthepoolintheex-
portingprocesscanbesharedwithprocessesthathaveimportedthepool. Sincethepool’ssecurity
policy is established and verified at the pool level, the OS does not need extra bookkeeping to pro-
videsecurityforspecificpoolallocations. Inotherwords,theopaquecudaMemPoolPtrExportData
requiredtoimportapoolallocationmaybesenttotheimportingprocessusinganymechanism.
Whileallocationsmaybeexportedandimportedwithoutsynchronizingwiththeallocatingstreamin
anyway,theimportingprocessmustfollowthesamerulesastheexportingprocesswhenaccessing
theallocation. Specifically,accesstotheallocationmusthappenaftertheallocationoperationinthe
allocatingstreamexecutes. ThetwofollowingcodesnippetsshowcudaMemPoolExportPointer()
andcudaMemPoolImportPointer()sharingtheallocationwithanIPCeventusedtoguaranteethat
theallocationisn’taccessedintheimportingprocessbeforetheallocationisready.
| ∕∕ preparing                    | an             | allocation           |             | in the  | exporting              |             | process |         |
| ------------------------------- | -------------- | -------------------- | ----------- | ------- | ---------------------- | ----------- | ------- | ------- |
| cudaMemPoolPtrExportData        |                |                      | exportData; |         |                        |             |         |         |
| cudaEvent_t                     | readyIpcEvent; |                      |             |         |                        |             |         |         |
| cudaIpcEventHandle_t            |                | readyIpcEventHandle; |             |         |                        |             |         |         |
| ∕∕ ipc                          | event for      | coordinating         |             | between |                        | processes   |         |         |
| ∕∕ cudaEventInterprocess        |                |                      | flag        | makes   | the                    | event       | an ipc  | event   |
| ∕∕ cudaEventDisableTiming       |                |                      |             | is set  | for                    | performance |         | reasons |
| cudaEventCreate(&readyIpcEvent, |                |                      |             |         | cudaEventDisableTiming |             |         | |      |
,→cudaEventInterprocess)
| ∕∕ allocate                                 | from        | the exporting |                     | mem        | pool |       |                 |     |
| ------------------------------------------- | ----------- | ------------- | ------------------- | ---------- | ---- | ----- | --------------- | --- |
| cudaMallocAsync(&ptr,                       |             |               | size,exportMemPool, |            |      |       | stream);        |     |
| ∕∕ event                                    | for sharing | when          | the                 | allocation |      | is    | ready.          |     |
| cudaEventRecord(readyIpcEvent,              |             |               |                     | stream);   |      |       |                 |     |
| cudaMemPoolExportPointer(&exportData,       |             |               |                     |            |      | ptr); |                 |     |
| cudaIpcGetEventHandle(&readyIpcEventHandle, |             |               |                     |            |      |       | readyIpcEvent); |     |
∕∕ Share IPC event and pointer export data with the importing process using
| ∕∕ any                     | mechanism.     | Here        | we          | copy the               | data | into               | shared | memory |
| -------------------------- | -------------- | ----------- | ----------- | ---------------------- | ---- | ------------------ | ------ | ------ |
| shmem->ptrData             | =              | exportData; |             |                        |      |                    |        |        |
| shmem->readyIpcEventHandle |                |             |             | = readyIpcEventHandle; |      |                    |        |        |
| ∕∕ signal                  | consumers      | data        | is          | ready                  |      |                    |        |        |
| ∕∕ Importing               | an             | allocation  |             |                        |      |                    |        |        |
| cudaMemPoolPtrExportData   |                |             | *importData |                        |      | = &shmem->prtData; |        |        |
| cudaEvent_t                | readyIpcEvent; |             |             |                        |      |                    |        |        |
(continuesonnextpage)
4.3. Stream-OrderedMemoryAllocator 215

CUDAProgrammingGuide,Release13.1
(continuedfrompreviouspage)
cudaIpcEventHandle_t *readyIpcEventHandle = &shmem->readyIpcEventHandle;
∕∕ Need to retrieve the ipc event handle and the export data from the
∕∕ exporting process using any mechanism. Here we are using shmem and just
∕∕ need synchronization to make sure the shared memory is filled in.
| cudaIpcOpenEventHandle(&readyIpcEvent, |     |     |     | readyIpcEventHandle); |     |     |
| -------------------------------------- | --- | --- | --- | --------------------- | --- | --- |
∕∕ import the allocation. The operation does not block on the allocation being
,→ready.
| cudaMemPoolImportPointer(&ptr, |     |     | importedMemPool, | importData); |     |     |
| ------------------------------ | --- | --- | ---------------- | ------------ | --- | --- |
∕∕ Wait for the prior stream operations in the allocating stream to complete
,→before
| ∕∕ using                    | the allocation | in  | the importing   | process. |     |     |
| --------------------------- | -------------- | --- | --------------- | -------- | --- | --- |
| cudaStreamWaitEvent(stream, |                |     | readyIpcEvent); |          |     |     |
| kernel<<<...,               | stream>>>(ptr, |     | ...);           |          |     |     |
When freeingthe allocation, the allocationmustbe freedin the importing process beforeit isfreed
in the exporting process. The following code snippet demonstrates the use of CUDA IPC events to
providetherequiredsynchronizationbetweenthecudaFreeAsyncoperationsinbothprocesses. Ac-
cesstotheallocationfromtheimportingprocessisobviouslyrestrictedbythefreeoperationinthe
importing process side. It is worth noting that can be used to free the allocation in both
cudaFree
processesandthatotherstreamsynchronizationAPIsmaybeusedinsteadofCUDAIPCevents.
∕∕ The free must happen in importing process before the exporting process
| kernel<<<...,      | stream>>>(ptr, |           | ...);   |     |     |     |
| ------------------ | -------------- | --------- | ------- | --- | --- | --- |
| ∕∕ Last            | access in      | importing | process |     |     |     |
| cudaFreeAsync(ptr, |                | stream);  |         |     |     |     |
∕∕ Access not allowed in the importing process after the free
| cudaIpcEventRecord(finishedIpcEvent, |         |     |     | stream); |     |     |
| ------------------------------------ | ------- | --- | --- | -------- | --- | --- |
| ∕∕ Exporting                         | process |     |     |          |     |     |
∕∕ The exporting process needs to coordinate its free with the stream order
| ∕∕ of                       | the importing                    | process’s | free.              |       |     |     |
| --------------------------- | -------------------------------- | --------- | ------------------ | ----- | --- | --- |
| cudaStreamWaitEvent(stream, |                                  |           | finishedIpcEvent); |       |     |     |
| kernel<<<...,               | stream>>>(ptrInExportingProcess, |           |                    | ...); |     |     |
∕∕ The free in the importing process doesn’t stop the exporting process
| ∕∕ from | using the | allocation. |     |     |     |     |
| ------- | --------- | ----------- | --- | --- | --- | --- |
cudFreeAsync(ptrInExportingProcess,stream);
| 4.3.3.4.4 | IPCExportPoolLimitations |     |     |     |     |     |
| --------- | ------------------------ | --- | --- | --- | --- | --- |
IPCpoolscurrentlydonotsupportreleasingphysicalblocksbacktotheOS.AsaresultthecudaMem-
PoolTrimToAPIhasnoeffectandthecudaMemPoolAttrReleaseThresholdiseffectivelyignored.
Thisbehavioriscontrolledbythedriver,nottheruntimeandmaychangeinafuturedriverupdate.
| 216 |     |     |     |     | Chapter4. | CUDAFeatures |
| --- | --- | --- | --- | --- | --------- | ------------ |

CUDAProgrammingGuide,Release13.1
| 4.3.3.4.5 | IPCImportPoolLimitations |     |     |     |
| --------- | ------------------------ | --- | --- | --- |
Allocating from an import pool is not allowed; specifically, import pools cannot be set current and
cannotbeusedinthecudaMallocFromPoolAsyncAPI.Assuch,theallocationreusepolicyattributes
donothavemeaningforthesepools.
IPCImportpools,likeIPCexportpools,currentlydonotsupportreleasingphysicalblocksbacktothe
OS.
The resource usage stat attribute queries only reflect the allocations imported into the process and
theassociatedphysicalmemory.
| 4.3.4. | Best | Practices | and Tuning |     |
| ------ | ---- | --------- | ---------- | --- |
4.3.4.1 QueryforSupport
Anapplicationcandeterminewhetherornotadevicesupportsthestream-orderedmemoryallocator
bycallingcudaDeviceGetAttribute()(seedeveloperblog)withthedeviceattributecudaDevAt-
trMemoryPoolsSupported.
IPC memory pool support can be queried with the cudaDevAttrMemoryPoolSupportedHandle-
Types device attribute. This attribute was added in CUDA 11.3, and older drivers will return cud-
aErrorInvalidValuewhenthisattributeisqueried.
| int driverVersion             |     | = 0; |      |     |
| ----------------------------- | --- | ---- | ---- | --- |
| int deviceSupportsMemoryPools |     |      | = 0; |     |
| int poolSupportedHandleTypes  |     |      | = 0; |     |
cudaDriverGetVersion(&driverVersion);
| if (driverVersion |     | >= 11020) | {   |     |
| ----------------- | --- | --------- | --- | --- |
cudaDeviceGetAttribute(&deviceSupportsMemoryPools,
cudaDevAttrMemoryPoolsSupported, device);
}
| if (deviceSupportsMemoryPools |          |          | != 0) {            |                  |
| ----------------------------- | -------- | -------- | ------------------ | ---------------- |
| ∕∕                            | `device` | supports | the Stream-Ordered | Memory Allocator |
}
| if (driverVersion |     | >= 11030) | {   |     |
| ----------------- | --- | --------- | --- | --- |
cudaDeviceGetAttribute(&poolSupportedHandleTypes,
cudaDevAttrMemoryPoolSupportedHandleTypes, device);
}
if (poolSupportedHandleTypes & cudaMemHandleTypePosixFileDescriptor) {
∕∕ Pools on the specified device can be created with posix file descriptor-
| ,→based | IPC |     |     |     |
| ------- | --- | --- | --- | --- |
}
PerformingthedriverversioncheckbeforethequeryavoidshittingacudaErrorInvalidValueerror
ondriverswheretheattributewasnotyetdefined. OnecanusecudaGetLastErrortocleartheerror
insteadofavoidingit.
4.3.4.2 PhysicalPageCachingBehavior
By default, the allocator tries to minimize the physical memory owned by a pool. To mini-
mize the OS calls to allocate and free physical memory, applications must configure a mem-
ory footprint for each pool. Applications can do this with the release threshold attribute
(cudaMemPoolAttrReleaseThreshold).
4.3. Stream-OrderedMemoryAllocator 217

CUDAProgrammingGuide,Release13.1
The release threshold is the amount of memory in bytes a pool should hold onto before trying to
releasememorybacktotheOS.Whenmorethanthereleasethresholdbytesofmemoryareheldby
thememorypool,theallocatorwilltrytoreleasememorybacktotheOSonthenextcalltostream,
eventordevicesynchronize. SettingthereleasethresholdtoUINT64_MAXwillpreventthedriverfrom
attemptingtoshrinkthepoolaftereverysynchronization.
| Cuuint64_t | setVal | = UINT64_MAX; |     |     |     |     |
| ---------- | ------ | ------------- | --- | --- | --- | --- |
cudaMemPoolSetAttribute(memPool, cudaMemPoolAttrReleaseThreshold, &setVal);
ApplicationsthatsetcudaMemPoolAttrReleaseThresholdhighenoughtoeffectivelydisablemem-
ory pool shrinking may wish to explicitly shrink a memory pool’s memory footprint. cudaMem-
allows applications to do so. When trimming a memory pool’s footprint, the
PoolTrimTo min-
BytesToKeep parameter allows an application to hold onto a specified amount of memory, for ex-
ampletheamountitexpectstoneedinasubsequentphaseofexecution.
| Cuuint64_t | setVal | = UINT64_MAX; |     |     |     |     |
| ---------- | ------ | ------------- | --- | --- | --- | --- |
cudaMemPoolSetAttribute(memPool, cudaMemPoolAttrReleaseThreshold, &setVal);
∕∕ application phase needing a lot of memory from the stream-ordered allocator
| for (i=0; | i<10; i++)                        | {    |     |          |     |     |
| --------- | --------------------------------- | ---- | --- | -------- | --- | --- |
| for       | (j=0; j<10;                       | j++) | {   |          |     |     |
|           | cudaMallocAsync(&ptrs[j],size[j], |      |     | stream); |     |     |
}
kernel<<<...,stream>>>(ptrs,...);
| for | (j=0; j<10;            | j++) | {        |     |     |     |
| --- | ---------------------- | ---- | -------- | --- | --- | --- |
|     | cudaFreeAsync(ptrs[j], |      | stream); |     |     |     |
}
}
| ∕∕ Process | does not | need | as much memory for | the next | phase. |     |
| ---------- | -------- | ---- | ------------------ | -------- | ------ | --- |
∕∕ Synchronize so that the trim operation will know that the allocations are no
| ∕∕ longer | in use. |     |     |     |     |     |
| --------- | ------- | --- | --- | --- | --- | --- |
cudaStreamSynchronize(stream);
| cudaMemPoolTrimTo(mempool, |     |     | 0); |     |     |     |
| -------------------------- | --- | --- | --- | --- | --- | --- |
∕∕ Some other process∕allocation mechanism can now use the physical memory
| ∕∕ released | by the | trimming | operation. |     |     |     |
| ----------- | ------ | -------- | ---------- | --- | --- | --- |
4.3.4.3 ResourceUsageStatistics
QueryingthecudaMemPoolAttrReservedMemCurrent attributeofapoolreportsthecurrenttotal
physicalGPUmemoryconsumedbythepool. QueryingthecudaMemPoolAttrUsedMemCurrentofa
poolreturnsthetotalsizeofallofthememoryallocatedfromthepoolandnotavailableforreuse.
ThecudaMemPoolAttr*MemHighattributesarewatermarksrecordingthemaxvalueachievedbythe
respectivecudaMemPoolAttr*MemCurrentattributesincelastreset. Theycanberesettothecur-
rentvaluebyusingthecudaMemPoolSetAttributeAPI.
∕∕ sample helper functions for getting the usage statistics in bulk
| struct     | usageStatistics | {   |     |     |     |     |
| ---------- | --------------- | --- | --- | --- | --- | --- |
| cuuint64_t | reserved;       |     |     |     |     |     |
| cuuint64_t | reservedHigh;   |     |     |     |     |     |
| cuuint64_t | used;           |     |     |     |     |     |
| cuuint64_t | usedHigh;       |     |     |     |     |     |
(continuesonnextpage)
| 218 |     |     |     |     | Chapter4. | CUDAFeatures |
| --- | --- | --- | --- | --- | --------- | ------------ |

CUDAProgrammingGuide,Release13.1
(continuedfrompreviouspage)
};
void getUsageStatistics(cudaMemoryPool_t memPool, struct usageStatistics
,→*statistics)
{
cudaMemPoolGetAttribute(memPool, cudaMemPoolAttrReservedMemCurrent,
,→statistics->reserved);
cudaMemPoolGetAttribute(memPool, cudaMemPoolAttrReservedMemHigh,
,→statistics->reservedHigh);
cudaMemPoolGetAttribute(memPool, cudaMemPoolAttrUsedMemCurrent,
,→statistics->used);
cudaMemPoolGetAttribute(memPool, cudaMemPoolAttrUsedMemHigh, statistics->
,→usedHigh);
}
∕∕ resetting the watermarks will make them take on the current value.
void resetStatistics(cudaMemoryPool_t memPool)
{
cuuint64_t value = 0;
cudaMemPoolSetAttribute(memPool, cudaMemPoolAttrReservedMemHigh, &value);
cudaMemPoolSetAttribute(memPool, cudaMemPoolAttrUsedMemHigh, &value);
}
4.3.4.4 MemoryReusePolicies
In order to service an allocation request, the driver attempts to reuse memory that was previously
freedviacudaFreeAsync() beforeattemptingtoallocatemorememoryfromtheOS.Forexample,
memoryfreedinastreamcanbereusedimmediatelyinasubsequentallocationrequestonthesame
stream. WhenastreamissynchronizedwiththeCPU,thememorythatwaspreviouslyfreedinthat
streambecomesavailableforreuseforanallocationinanystream. Reusepoliciescan beappliedto
bothdefaultandexplicitmemorypools.
The stream-ordered allocator has a few controllable allocation policies. The pool attributes cud-
aMemPoolReuseFollowEventDependencies, cudaMemPoolReuseAllowOpportunistic, and cu-
daMemPoolReuseAllowInternalDependencies control these policies and are detailed below.
These policies can be enabled or disabled through a call to cudaMemPoolSetAttribute. Upgrad-
ing to a newer CUDA driver may change, enhance, augment and/or reorder the enumeration of the
reusepolicies.
4.3.4.4.1 cudaMemPoolReuseFollowEventDependencies
BeforeallocatingmorephysicalGPUmemory,theallocatorexaminesdependencyinformationestab-
lishedbyCUDAeventsandtriestoallocatefrommemoryfreedinanotherstream.
cudaMallocAsync(&ptr, size, originalStream);
kernel<<<..., originalStream>>>(ptr, ...);
cudaFreeAsync(ptr, originalStream);
cudaEventRecord(event,originalStream);
∕∕ waiting on the event that captures the free in another stream
∕∕ allows the allocator to reuse the memory to satisfy
(continuesonnextpage)
4.3. Stream-OrderedMemoryAllocator 219

CUDAProgrammingGuide,Release13.1
(continuedfrompreviouspage)
| ∕∕ a new                                   | allocation                         | request | in the other  | stream when |     |     |
| ------------------------------------------ | ---------------------------------- | ------- | ------------- | ----------- | --- | --- |
| ∕∕ cudaMemPoolReuseFollowEventDependencies |                                    |         |               | is enabled. |     |     |
| cudaStreamWaitEvent(otherStream,           |                                    |         | event);       |             |     |     |
| cudaMallocAsync(&ptr2,                     |                                    | size,   | otherStream); |             |     |     |
| 4.3.4.4.2                                  | cudaMemPoolReuseAllowOpportunistic |         |               |             |     |     |
WhenthecudaMemPoolReuseAllowOpportunisticpolicyisenabled,theallocatorexaminesfreed
allocationstoseeifthefreeoperationsstreamordersemantichasbeenmet,forexamplethestream
has passed the point of execution indicated by the free operation. When this policy is disabled, the
allocatorwillstillreusememorymadeavailablewhenastreamissynchronizedwiththeCPU.Disabling
thispolicydoesnotstopthecudaMemPoolReuseFollowEventDependenciesfromapplying.
| cudaMallocAsync(&ptr, |                        | size,            | originalStream); |         |     |     |
| --------------------- | ---------------------- | ---------------- | ---------------- | ------- | --- | --- |
| kernel<<<...,         | originalStream>>>(ptr, |                  |                  | ...);   |     |     |
| cudaFreeAsync(ptr,    |                        | originalStream); |                  |         |     |     |
| ∕∕ after              | some time,             | the kernel       | finishes         | running |     |     |
wait(10);
∕∕ When cudaMemPoolReuseAllowOpportunistic is enabled this allocation request
∕∕ can be fulfilled with the prior allocation based on the progress of
,→originalStream.
| cudaMallocAsync(&ptr2, |                                           | size, | otherStream); |     |     |     |
| ---------------------- | ----------------------------------------- | ----- | ------------- | --- | --- | --- |
| 4.3.4.4.3              | cudaMemPoolReuseAllowInternalDependencies |       |               |     |     |     |
FailingtoallocateandmapmorephysicalmemoryfromtheOS,thedriverwilllookformemorywhose
availability depends on another stream’s pending progress. If such memory is found, the driver will
inserttherequireddependencyintotheallocatingstreamandreusethememory.
| cudaMallocAsync(&ptr, |                                           | size,            | originalStream); |            |     |     |
| --------------------- | ----------------------------------------- | ---------------- | ---------------- | ---------- | --- | --- |
| kernel<<<...,         | originalStream>>>(ptr,                    |                  |                  | ...);      |     |     |
| cudaFreeAsync(ptr,    |                                           | originalStream); |                  |            |     |     |
| ∕∕ When               | cudaMemPoolReuseAllowInternalDependencies |                  |                  | is enabled |     |     |
∕∕ and the driver fails to allocate more physical memory, the driver may
∕∕ effectively perform a cudaStreamWaitEvent in the allocating stream
∕∕ to make sure that future work in ‘otherStream’ happens after the work
∕∕ in the original stream that would be allowed to access the original
,→allocation.
| cudaMallocAsync(&ptr2, |                        | size, | otherStream); |     |     |     |
| ---------------------- | ---------------------- | ----- | ------------- | --- | --- | --- |
| 4.3.4.4.4              | DisablingReusePolicies |       |               |     |     |     |
Whilethecontrollablereusepoliciesimprovememoryreuse,usersmaywanttodisablethem. Allow-
ing opportunistic reuse (such as cudaMemPoolReuseAllowOpportunistic) introduces run to run
varianceinallocationpatternsbasedontheinterleavingofCPUandGPUexecution. Internaldepen-
dency insertion (such as cudaMemPoolReuseAllowInternalDependencies) can serialize work in
| 220 |     |     |     |     | Chapter4. | CUDAFeatures |
| --- | --- | --- | --- | --- | --------- | ------------ |

CUDAProgrammingGuide,Release13.1
unexpectedandpotentiallynon-deterministicwayswhentheuserwouldratherexplicitlysynchronize
aneventorstreamonallocationfailure.
4.3.4.5 SynchronizationAPIActions
One of the optimizations that comes with the allocator being part of the CUDA driver is integration
withthesynchronizeAPIs. WhentheuserrequeststhattheCUDAdriversynchronize,thedriverwaits
for asynchronous work to complete. Before returning, the driver will determine what frees the syn-
chronizationguaranteedtobecompleted. Theseallocationsaremadeavailableforallocationregard-
lessofspecifiedstreamordisabledallocationpolicies. ThedriveralsocheckscudaMemPoolAttrRe-
leaseThresholdhereandreleasesanyexcessphysicalmemorythatitcan.
4.3.5. Addendums
4.3.5.1 cudaMemcpyAsyncCurrentContext/DeviceSensitivity
In the current CUDA driver, any async memcpy involving memory from cudaMallocAsync should be
doneusingthespecifiedstream’scontextasthecallingthread’scurrentcontext. Thisisnotnecessary
forcudaMemcpyPeerAsync,asthedeviceprimarycontextsspecifiedintheAPIarereferencedinstead
ofthecurrentcontext.
4.3.5.2 cudaPointerGetAttributesQuery
InvokingcudaPointerGetAttributesonanallocationafterinvokingcudaFreeAsynconitresults
in undefined behavior. Specifically, it does not matter if an allocation is still accessible from a given
stream: thebehaviorisstillundefined.
4.3.5.3 cudaGraphAddMemsetNode
cudaGraphAddMemsetNode does not work with memory allocated via the stream ordered allocator.
However,memsetsoftheallocationscanbestreamcaptured.
4.3.5.4 PointerAttributes
ThecudaPointerGetAttributesqueryworksonstream-orderedallocations. Sincestream-ordered
allocationsarenotcontextassociated,queryingCU_POINTER_ATTRIBUTE_CONTEXTwillsucceedbut
returnNULLin*data. TheattributeCU_POINTER_ATTRIBUTE_DEVICE_ORDINALcanbeusedtode-
terminethelocationoftheallocation: thiscanbeusefulwhenselectingacontextformakingp2h2p
copies using cudaMemcpyPeerAsync. The attribute CU_POINTER_ATTRIBUTE_MEMPOOL_HANDLE
was added in CUDA 11.3 and can be useful for debugging and for confirming which pool an alloca-
tioncomesfrombeforedoingIPC.
4.3.5.5 CPUVirtualMemory
WhenusingCUDAstream-orderedmemoryallocatorAPIs,avoidsettingVRAMlimitationswith“ulimit
-v”asthisisnotsupported.
4.4. Cooperative Groups
4.4. CooperativeGroups 221

CUDAProgrammingGuide,Release13.1
| 4.4.1. | Introduction |     |     |     |     |     |
| ------ | ------------ | --- | --- | --- | --- | --- |
CooperativeGroupsareanextensiontotheCUDAprogrammingmodelfororganizinggroupsofcol-
laboratingthreads. CooperativeGroupsallowdeveloperstocontrolthegranularityatwhichthreads
arecollaborating,helpingthemtoexpressricher,moreefficientparalleldecompositions. Cooperative
Groupsalsoprovideimplementationsofcommonparallelprimitiveslikescanandparallelreduce.
Historically,theCUDAprogrammingmodelhasprovidedasingle,simpleconstructforsynchronizing
cooperatingthreads: abarrieracrossallthreadsofathreadblock,asimplementedwiththe__sync-
threads() intrinsic function. In an effort to express broader patterns of parallel interaction, many
performance-orientedprogrammershaveresortedtowritingtheirownadhocandunsafeprimitives
for synchronizing threads within a single warp, or across sets of thread blocks running on a single
GPU. Whilst the performance improvements achieved have often been valuable, this has resulted in
anever-growingcollectionofbrittlecodethatisexpensivetowrite,tune,andmaintainovertimeand
acrossGPUgenerations. CooperativeGroupsprovidesasafeandfuture-proofmechanismforwriting
performantcode.
ThefullCooperativeGroupsAPIisavailableintheCooperativeGroupsAPI.
| 4.4.2. | Cooperative | Group | Handle | & Member | Functions |     |
| ------ | ----------- | ----- | ------ | -------- | --------- | --- |
CooperativeGroupsaremanagedviaaCooperativeGroupHandle. TheCooperativeGrouphandleal-
lowsparticipatingthreadstolearntheirpositioninthegroup, thegroupsize, andothergroupinfor-
mation. Selectgroupmemberfunctionsareshowninthefollowingtable.
|               |     | Table10:                           | SelectMemberFunctions |     |     |     |
| ------------- | --- | ---------------------------------- | --------------------- | --- | --- | --- |
| Accessor      |     | Returns                            |                       |     |     |     |
| thread_rank() |     | Therankofthecallingthread.         |                       |     |     |     |
| num_threads() |     | Thetotalnumberofthreadsinthegroup. |                       |     |     |     |
thread_index() A3-Dimensionalindexofthethreadwithinthelaunchedblock.
dim_threads() The3Ddimensionsofthelaunchedblockinunitsofthreads.
AcompletelistpfmemberfunctionsisavailableintheCooperativeGroupsAPI.
| 4.4.3. | Default | Behavior | / Groupless | Execution |     |     |
| ------ | ------- | -------- | ----------- | --------- | --- | --- |
Groupsrepresentingthegridandthreadblocksareimplicitlycreatedbasedonthekernellaunchcon-
figuration. These“implicit”groupsprovideastartingpointthatdeveloperscanexplicitlydecompose
intofinergrainedgroups. Implicitgroupscanbeaccessedusingthefollowingmethods:
| 222 |     |     |     |     | Chapter4. | CUDAFeatures |
| --- | --- | --- | --- | --- | --------- | ------------ |

CUDAProgrammingGuide,Release13.1
Table 11: Cooperative Groups Implicitly Created by CUDA
Runtime
Accessor GroupScope
this_thread_block() Returnsthehandletoagroupcontainingallthreadsincurrentthread
block.
this_grid() Returnsthehandletoagroupcontainingallthreadsinthegrid.
coalesced_threads()1 Returnsthehandletoagroupofcurrentlyactivethreadsinawarp.
this_cluster()2 Returnsthehandletoagroupofthreadsinthecurrentcluster.
MoreinformationisavailableintheCooperativeGroupsAPI.
4.4.3.1 CreateImplicitGroupHandlesAsEarlyAsPossible
For best performance it is recommended that you create a handle for the implicit group upfront (as
earlyaspossible,beforeanybranchinghasoccurred)andusethathandlethroughoutthekernel.
4.4.3.2 OnlyPassGroupHandlesbyReference
Itisrecommendedthatyoupassgrouphandlesbyreferencetofunctionswhenpassingagrouphandle
intoafunction. Grouphandlesmustbeinitializedatdeclarationtime,asthereisnodefaultconstructor.
Copy-constructinggrouphandlesisdiscouraged.
4.4.4. Creating Cooperative Groups
Groupsarecreatedbypartitioningaparentgroupintosubgroups. Whenagroupispartitioned,agroup
handleiscreatedtomanagetheresultingsubgroup. Thefollowingpartitioningoperationsareavailable
todevelopers:
Table12: CooperativeGroupPartitioningOperations
PartitionType Description
tiled_partition Divides parent group into a series of fixed-size subgroups arranged in a one-
dimensional,row-majorformat.
stride_partition Dividesparentgroupintoequally-sizedsubgroupswherethreadsareassigned
tosubgroupsinaround-robinmanner.
labeled_partition Divides parent group into one-dimensional subgroups based on a conditional
label,whichcanbeanyintegraltype.
binary_partition Specializedformoflabeledpartitioningwherelabelcanonlybe“0”or“1”.
Thefollowingexampleshowshowatiledpartitioniscreated:
1Thecoalesced_threads()operatorreturnsthesetofactivethreadsatthatpointintime,andmakesnoguaranteeabout
whichthreadsarereturned(aslongastheyareactive)orthattheywillstaycoalescedthroughoutexecution.
2Thethis_cluster()assumesa1x1x1clusterwhenanon-clustergridislaunched. RequiresComputeCapability9.0or
greater.
4.4. CooperativeGroups 223

CUDAProgrammingGuide,Release13.1
| namespace        | cg = cooperative_groups; |          |                          |       |           |     |     |
| ---------------- | ------------------------ | -------- | ------------------------ | ----- | --------- | --- | --- |
| ∕∕ Obtain        | the current              | thread's | cooperative              | group |           |     |     |
| cg::thread_block | my_group                 | =        | cg::this_thread_block(); |       |           |     |     |
| ∕∕ Partition     | the cooperative          |          | group into               | tiles | of size 8 |     |     |
cg::thread_block_tile<8> my_subgroup = cg::tiled_partition<8>(cta);
| ∕∕ do work | as my_subgroup |     |     |     |     |     |     |
| ---------- | -------------- | --- | --- | --- | --- | --- | --- |
The best partitioning strategy to use depends on the context. More information is available in the
CooperativeGroupsAPI.
4.4.4.1 AvoidingGroupCreationHazards
Partitioningagroupisacollectiveoperationandallthreadsinthegroupmustparticipate. Ifthegroup
wascreatedinaconditionalbranchthatnotallthreadsreach,thiscanleadtodeadlocksordatacor-
ruption.
| 4.4.5. | Synchronization |     |     |     |     |     |     |
| ------ | --------------- | --- | --- | --- | --- | --- | --- |
PriortotheintroductionofCooperativeGroups,theCUDAprogrammingmodelonlyallowedsynchro-
nizationbetweenthreadblocksatakernelcompletionboundary. Cooperativegroupsallowsdevelop-
erstosynchronizegroupsofcooperatingthreadsatdifferentgranularities.
4.4.5.1 Sync
You can synchronize a group by calling the collective sync() function. Like __syncthreads(), the
sync()functionmakesthefollowingguarantees: -Allmemoryaccesses(e.g.,readsandwrites)made
by threads in the group before the synchronization point are visible to all threads in the group after
thesynchronizationpoint. -Allthreadsinthegroupreachthesynchronizationpointbeforeanythread
isallowedtoproceedbeyondit.
The following example shows a cooperative_groups::sync() that is equivalent to __sync-
threads().
| namespace        | cg = cooperative_groups; |        |                          |     |     |     |     |
| ---------------- | ------------------------ | ------ | ------------------------ | --- | --- | --- | --- |
| cg::thread_block | my_group                 | =      | cg::this_thread_block(); |     |     |     |     |
| ∕∕ Synchronize   | threads                  | in the | block                    |     |     |     |     |
cg::sync(my_group);
Cooperativegroupscanbeusedtosynchronizetheentiregrid. AsofCUDA13,cooperativegroupscan
nolongerbeusedformulti-devicesynchronization. FordetailsseetheLargeScaleGroupssection.
MoreinformationaboutsynchronizationisavailableintheCooperativeGroupsAPI.
4.4.5.2 Barriers
Cooperative Groups provides a barrier API similar to cuda::barrier that can be used for more ad-
vanced synchronization. Cooperative Groups barrier API differs from cuda::barrier in a few key
ways: -CooperativeGroupsbarriersareautomaticallyinitialized-Allthreadsinthegroupmustarrive
and wait at the barrier once per phase. - barrier_arrive returns an arrival_token object that
| 224 |     |     |     |     |     | Chapter4. | CUDAFeatures |
| --- | --- | --- | --- | --- | --- | --------- | ------------ |

CUDAProgrammingGuide,Release13.1
must be passed into the corresponding barrier_wait, where it is consumed and cannot be used
again.
Programmers must take care to avoid hazards when using Cooperative Groups barriers: - No collec-
tive operations can be used by a group between after calling barrier_arrive and before calling
barrier_wait. - barrier_wait only guarantees that all threads in the group have called bar-
rier_arrive. barrier_waitdoesNOTguaranteethatallthreadshavecalledbarrier_wait.
| namespace        | cg =                        | cooperative_groups; |                 |
| ---------------- | --------------------------- | ------------------- | --------------- |
| cg::thread_block |                             | my_group            | = this_block(); |
| auto token       | = cluster.barrier_arrive(); |                     |                 |
∕∕ Optional: Do some local processing to hide the synchronization latency
local_processing(block);
∕∕ Make sure all other blocks in the cluster are running and initialized shared
| ,→data | before | accessing | dsmem |
| ------ | ------ | --------- | ----- |
cluster.barrier_wait(std::move(token));
| 4.4.6. | Collective |     | Operations |
| ------ | ---------- | --- | ---------- |
CooperativeGroupsincludesasetofcollectiveoperationsthatcanbeperformedbyagroupofthreads.
These operations require participation of all threads in the specified group in order to complete the
operation.
All threads in the group must pass the same values for corresponding arguments to each collective
call,unlessdifferentvaluesareexplicitlyallowedintheCooperativeGroupsAPI.Otherwisethebehavior
ofthecallisundefined.
4.4.6.1 Reduce
The reduce function is used to perform a parallel reduction on the data provided by each thread in
thespecifiedgroup. Thetypeofreductionmustbespecifiedbyprovidingoneoftheoperatorsshown
inthefollowingtable.
|          |     | Table13:              | CooperativeGroupsReductionOperators |
| -------- | --- | --------------------- | ----------------------------------- |
| Operator |     | Returns               |                                     |
| plus     |     | Sumofallvaluesingroup |                                     |
| less     |     | Minimumvalue          |                                     |
| greater  |     | Maximumvalue          |                                     |
| bit_and  |     | BitwiseANDreduction   |                                     |
| bit_or   |     | BitwiseORreduction    |                                     |
| bit_xor  |     | BitwiseXORreduction   |                                     |
Hardware acceleration is used for reductions when available (requires Compute Capability 8.0 or
4.4. CooperativeGroups 225

CUDAProgrammingGuide,Release13.1
greater). Asoftwarefallbackisavailableforolderhardwarewherehardwareaccelerationisnotavail-
able. Only4Btypesareacceleratedbyhardware.
MoreinformationaboutreductionsisavailableintheCooperativeGroupsAPI.
Thefollowingexampleshowshowtousecooperative_groups::reduce()toperformablock-wide
sumreduction.
| namespace                  | cg = cooperative_groups; |          |                            |     |     |
| -------------------------- | ------------------------ | -------- | -------------------------- | --- | --- |
| cg::thread_block           |                          | my_group | = cg::this_thread_block(); |     |     |
| int val                    | = data[threadIdx.x];     |          |                            |     |     |
| int sum                    | = cg::reduce(cta,        |          | val, cg::plus<int>());     |     |     |
| ∕∕ Store                   | the result               | from the | reduction                  |     |     |
| if (my_group.thread_rank() |                          |          | == 0) {                    |     |     |
| result[blockIdx.x]         |                          | = sum;   |                            |     |     |
}
4.4.6.2 Scans
CooperativeGroupsincludesimplementationsofinclusive_scanandexclusive_scanthatcanbe
usedonarbitrarygroupsizes. Thefunctionsperformascanoperationonthedataprovidedbyeach
threadnamedinthespecifiedgroup.
Programmerscanoptionallyspecifyareductionoperator,aslistedinReductionOperatorsTableabove.
| namespace        | cg = cooperative_groups;        |          |                            |     |     |
| ---------------- | ------------------------------- | -------- | -------------------------- | --- | --- |
| cg::thread_block |                                 | my_group | = cg::this_thread_block(); |     |     |
| int val          | = data[my_group.thread_rank()]; |          |                            |     |     |
int exclusive_sum = cg::exclusive_scan(my_group, val, cg::plus<int>());
| result[my_group.thread_rank()] |     |     | = exclusive_sum; |     |     |
| ------------------------------ | --- | --- | ---------------- | --- | --- |
MoreinformationaboutscansisavailableintheCooperativeGroupsScanAPI.
4.4.6.3 InvokeOne
Cooperative Groups provides an invoke_one function for use when a single thread must perform a
serialportionofworkonbehalfofagroup. -invoke_one selectsasinglearbitrarythreadfromthe
calling group and uses that thread to call the supplied invocable function using the supplied argu-
ments. - invoke_one_broadcast is the same as invoke_one except the result of the call is also
broadcasttoallthreadsinthegroup.
Thethreadselectionmechanismisnotguaranteedtobedeterministic.
Thefollowingexampleshowsbasicinvoke_oneutilization.
| namespace        | cg = cooperative_groups; |          |                            |     |     |
| ---------------- | ------------------------ | -------- | -------------------------- | --- | --- |
| cg::thread_block |                          | my_group | = cg::this_thread_block(); |     |     |
(continuesonnextpage)
| 226 |     |     |     | Chapter4. | CUDAFeatures |
| --- | --- | --- | --- | --------- | ------------ |

CUDAProgrammingGuide,Release13.1
(continuedfrompreviouspage)
∕∕ Ensure only one thread in the thread block prints the message
| cg::invoke_one(my_group, |     |      | []() {     |                  |     |
| ------------------------ | --- | ---- | ---------- | ---------------- | --- |
| printf("Hello            |     | from | one thread | in the block!"); |     |
});
∕∕ Synchronize to make sure all threads wait until the message is printed
cg::sync(my_group);
Communicationorsynchronizationwithinthecallinggroupisnotallowedinsidetheinvocablefunction.
Communicationwiththreadsoutsideofthecallinggroupisallowed.
| 4.4.7. | Asynchronous |     | Data | Movement |     |
| ------ | ------------ | --- | ---- | -------- | --- |
Cooperative Groups functionality in CUDA provides a way to perform asynchronous
memcpy_async
memory copies between global memory and shared memory. memcpy_async is particularly useful
foroptimizingmemorytransfersandoverlappingcomputationwithdatatransfertoimproveperfor-
mance.
The memcpy_async function is used to start an asynchronous load from global memory to shared
memory. memcpy_async is intended to be used like a “prefetch” where data is loaded before it is
needed.
The wait function forces all threads in a group to wait until the asynchronous memory transfer is
completed. waitmustbecalledbyallthreadsinthegroupbeforethedatacanbeaccessedinshared
memory.
Thefollowingexampleshowshowtousememcpy_asyncandwaittoprefetchdata.
| namespace        | cg  | = cooperative_groups; |                            |     |     |
| ---------------- | --- | --------------------- | -------------------------- | --- | --- |
| cg::thread_group |     | my_group              | = cg::this_thread_block(); |     |     |
| __shared__       | int | shared_data[];        |                            |     |     |
∕∕ Perform an asynchronous copy from global memory to shared memory
cg::memcpy_async(my_group, shared_data + my_group.rank(), input + my_group.
| ,→rank(), | sizeof(int)); |              |            |             |             |
| --------- | ------------- | ------------ | ---------- | ----------- | ----------- |
| ∕∕ Hide   | latency       | by doing     | work here. | Cannot use  | shared_data |
| ∕∕ Wait   | for the       | asynchronous | copy       | to complete |             |
cg::wait(my_group);
| ∕∕ Prefetched |     | data is | now available |     |     |
| ------------- | --- | ------- | ------------- | --- | --- |
SeetheCooperativeGroupsAPIformoreinformation.
4.4.7.1 MemcpyAsyncAlignmentRequirements
is only asynchronous if the source is global memory and the destination is shared
memcpy_async
memory and both are at least 4-byte aligned. For achieving best performance: an alignment of 16
bytesforbothsharedmemoryandglobalmemoryisrecommended.
4.4. CooperativeGroups 227

CUDAProgrammingGuide,Release13.1
4.4.8. Large Scale Groups
CooperativeGroupsallowsforlargegroupsthatspantheentiregrid. AllCooperativeGroupfunction-
alitydescribedpreviouslyisavailabletotheselargegroups,withonenotableexception: synchronizing
theentiregridrequiresusingthecudaLaunchCooperativeKernelruntimelaunchAPI.
Multi-devicelaunchAPIsandrelatedreferencesforCooperativeGroupshavebeenremovedasofCUDA
13.
4.4.8.1 WhentousecudaLaunchCooperativeKernel
cudaLaunchCooperativeKernelisaCUDAruntimeAPIfunctionusedtolaunchasingle-deviceker-
nel that employs cooperative groups, specifically designed for executing kernels that require inter-
block synchronization. This function ensures that all threads in the kernel can synchronize and co-
operate across the entire grid, which is not possible with traditional CUDA kernels that only allow
synchronizationwithinindividualthreadblocks. cudaLaunchCooperativeKernel ensuresthatthe
kernel launch is atomic, i.e. if the API call succeeds, then the provided number of thread blocks will
launchonthespecifieddevice.
It is good practice to first ensure the device supports cooperative launches by querying the device
attributecudaDevAttrCooperativeLaunch:
int dev = 0;
int supportsCoopLaunch = 0;
cudaDeviceGetAttribute(&supportsCoopLaunch, cudaDevAttrCooperativeLaunch,
,→dev);
whichwillsetsupportsCoopLaunchto1ifthepropertyissupportedondevice0. Onlydeviceswith
computecapabilityof6.0andhigheraresupported. Inaddition,youneedtoberunningoneitherof
these:
▶ TheLinuxplatformwithoutMPS
▶ TheLinuxplatformwithMPSandonadevicewithcomputecapability7.0orhigher
▶ ThelatestWindowsplatform
4.5. Programmatic Dependent Launch and
Synchronization
TheProgrammaticDependentLaunch mechanismallowsforadependentsecondary kerneltolaunch
before the primary kernel it depends on in the same CUDA stream has finished executing. Available
startingwithdevicesofcomputecapability9.0,thistechniquecanprovideperformancebenefitswhen
thesecondarykernelcancompletesignificantworkthatdoesnotdependontheresultsoftheprimary
kernel.
4.5.1. Background
ACUDAapplicationutilizestheGPUbylaunchingandexecutingmultiplekernelsonit. AtypicalGPU
activitytimelineisshowninFigure39.
Here, secondary_kernel islaunchedafterprimary_kernel finishesitsexecution. Serializedexe-
cution is usually necessary because secondary_kernel depends on result data produced by pri-
228 Chapter4. CUDAFeatures

CUDAProgrammingGuide,Release13.1
Figure39: GPUactivitytimeline
mary_kernel. If secondary_kernel has no dependency on primary_kernel, both of them can
belaunchedconcurrentlybyusingCUDAStreams. Evenifsecondary_kernelisdependentonpri-
mary_kernel, thereis some potential forconcurrent execution. Forexample, almostall the kernels
have some sort of preamble section during which tasks such as zeroing buffers or loading constant
valuesareperformed.
Figure40: Preamblesectionofsecondary_kernel
Figure40demonstratestheportionofsecondary_kernelthatcouldbeexecutedconcurrentlywith-
outimpactingtheapplication. Notethatconcurrentlaunchalsoallowsustohidethelaunchlatency
ofsecondary_kernelbehindtheexecutionofprimary_kernel.
Figure41: Concurrentexecutionofprimary_kernelandsecondary_kernel
The concurrent launch and execution of secondary_kernel shown in Figure 41 is achievable using
ProgrammaticDependentLaunch.
ProgrammaticDependentLaunchintroduceschangestotheCUDAkernellaunchAPIsasexplainedin
followingsection. TheseAPIsrequireatleastcomputecapability9.0toprovideoverlappingexecution.
4.5. ProgrammaticDependentLaunchandSynchronization 229

CUDAProgrammingGuide,Release13.1
| 4.5.2. | API | Description |     |     |     |     |     |     |
| ------ | --- | ----------- | --- | --- | --- | --- | --- | --- |
InProgrammaticDependentLaunch,aprimaryandasecondarykernelarelaunchedinthesameCUDA
stream. The primary kernel should execute cudaTriggerProgrammaticLaunchCompletion with
all thread blocks when it’s ready for the secondary kernel to launch. The secondary kernel must be
launchedusingtheextensiblelaunchAPIasshown.
| __global__ | void | primary_kernel() |     | {   |     |     |     |     |
| ---------- | ---- | ---------------- | --- | --- | --- | --- | --- | --- |
∕∕ Initial work that should finish before starting secondary kernel
| ∕∕ Trigger |     | the secondary | kernel |     |     |     |     |     |
| ---------- | --- | ------------- | ------ | --- | --- | --- | --- | --- |
cudaTriggerProgrammaticLaunchCompletion();
| ∕∕ Work | that | can coincide | with | the | secondary | kernel |     |     |
| ------- | ---- | ------------ | ---- | --- | --------- | ------ | --- | --- |
}
| __global__ | void | secondary_kernel() |     |     |     |     |     |     |
| ---------- | ---- | ------------------ | --- | --- | --- | --- | --- | --- |
{
| ∕∕ Independent |     | work |     |     |     |     |     |     |
| -------------- | --- | ---- | --- | --- | --- | --- | --- | --- |
∕∕ Will block until all primary kernels the secondary kernel is dependent
| ,→on have | completed | and flushed |     | results | to global | memory |     |     |
| --------- | --------- | ----------- | --- | ------- | --------- | ------ | --- | --- |
cudaGridDependencySynchronize();
| ∕∕ Dependent |     | work |     |     |     |     |     |     |
| ------------ | --- | ---- | --- | --- | --- | --- | --- | --- |
}
| cudaLaunchAttribute |     | attribute[1]; |     |     |     |     |     |     |
| ------------------- | --- | ------------- | --- | --- | --- | --- | --- | --- |
attribute[0].id = cudaLaunchAttributeProgrammaticStreamSerialization;
| attribute[0].val.programmaticStreamSerializationAllowed |     |     |            |     |                    |     | = 1; |     |
| ------------------------------------------------------- | --- | --- | ---------- | --- | ------------------ | --- | ---- | --- |
| configSecondary.attrs                                   |     | =   | attribute; |     |                    |     |      |     |
| configSecondary.numAttrs                                |     |     | = 1;       |     |                    |     |      |     |
| primary_kernel<<<grid_dim,                              |     |     | block_dim, |     | 0, stream>>>();    |     |      |     |
| cudaLaunchKernelEx(&configSecondary,                    |     |     |            |     | secondary_kernel); |     |      |     |
WhenthesecondarykernelislaunchedusingthecudaLaunchAttributeProgrammaticStreamSe-
rializationattribute,theCUDAdriverissafetolaunchthesecondarykernelearlyandnotwaiton
thecompletionandmemoryflushoftheprimarybeforelaunchingthesecondary.
TheCUDAdrivercanlaunchthesecondarykernelwhenallprimarythreadblockshavelaunchedand
executedcudaTriggerProgrammaticLaunchCompletion. Iftheprimarykerneldoesn’texecutethe
trigger,itimplicitlyoccursafterallthreadblocksintheprimarykernelexit.
In either case, the secondary thread blocks might launch before data written by the primary kernel
isvisible. Assuch,whenthesecondarykernelisconfiguredwithProgrammaticDependentLaunch,it
must always use cudaGridDependencySynchronize or other means to verify that the result data
fromtheprimaryisavailable.
Please note that these methods provide the opportunity for the primary and secondary kernels to
executeconcurrently,howeverthisbehaviorisopportunisticandnotguaranteedtoleadtoconcurrent
kernelexecution. Relianceonconcurrentexecutioninthismannerisunsafeandcanleadtodeadlock.
| 230 |     |     |     |     |     |     | Chapter4. | CUDAFeatures |
| --- | --- | --- | --- | --- | --- | --- | --------- | ------------ |

CUDAProgrammingGuide,Release13.1
| 4.5.3. | Use | in CUDA | Graphs |     |     |
| ------ | --- | ------- | ------ | --- | --- |
ProgrammaticDependentLaunchcanbeusedinCUDAGraphsviastreamcaptureordirectlyviaedge
data. To program this feature in a CUDA Graph with edge data, use a cudaGraphDependencyType
valueofcudaGraphDependencyTypeProgrammaticonanedgeconnectingtwokernelnodes. This
edgetypemakestheupstreamkernelvisibletoacudaGridDependencySynchronize()inthedown-
streamkernel. ThistypemustbeusedwithanoutgoingportofeithercudaGraphKernelNodePort-
LaunchCompletionorcudaGraphKernelNodePortProgrammatic.
Theresultinggraphequivalentsforstreamcaptureareasfollows:
| Streamcode(abbreviated) |     |            |     | Resultinggraphedge |           |
| ----------------------- | --- | ---------- | --- | ------------------ | --------- |
| cudaLaunchAttribute     |     | attribute; |     | cudaGraphEdgeData  | edgeData; |
| attribute.id            | =  |            |     | edgeData.type      | =        |
,→cudaLaunchAttributeProgrammaticStreamS e→rciuadlaiGzraatpihoDne;pendencyTypeProgrammatic;
,
| ,→             |     |     |     | ,→                 |     |
| -------------- | --- | --- | --- | ------------------ | --- |
| attribute.val. |     |     |     | edgeData.from_port | =  |
,→programmaticStreamSerializationAllowed , →cudaGraphKernelNodePortProgrammatic;
| ,→= 1;              |     |            |     | ,→                |           |
| ------------------- | --- | ---------- | --- | ----------------- | --------- |
| cudaLaunchAttribute |     | attribute; |     | cudaGraphEdgeData | edgeData; |
| attribute.id        | =  |            |     | edgeData.type     | =        |
,→cudaLaunchAttributeProgrammaticEvent; ,→cudaGraphDependencyTypeProgrammatic;
| ,→                               |     |     |     | ,→                 |     |
| -------------------------------- | --- | --- | --- | ------------------ | --- |
| attribute.val.programmaticEvent. |     |     |     | edgeData.from_port | =  |
,→triggerAtBlockStart = 0; ,→cudaGraphKernelNodePortProgrammatic;
,→
| cudaLaunchAttribute |     | attribute; |     | cudaGraphEdgeData | edgeData; |
| ------------------- | --- | ---------- | --- | ----------------- | --------- |
| attribute.id        | =  |            |     | edgeData.type     | =        |
,→cudaLaunchAttributeProgrammaticEvent; ,→cudaGraphDependencyTypeProgrammatic;
| ,→                               |     |     |     | ,→                 |     |
| -------------------------------- | --- | --- | --- | ------------------ | --- |
| attribute.val.programmaticEvent. |     |     |     | edgeData.from_port | =  |
,→triggerAtBlockStart = 1; ,→cudaGraphKernelNodePortLaunchCompletion;
,→
| 4.6. | Green | Contexts |     |     |     |
| ---- | ----- | -------- | --- | --- | --- |
Agreencontext(GC)isalightweightcontextassociated,fromitscreation,withasetofspecificGPU
resources. Users can partition GPU resources, currently streaming multiprocessors (SMs) and work
queues (WQs), during green context creation, so that GPU work targeting a green context can only
useitsprovisionedSMsandworkqueues. Doingsocanbebeneficialinreducing,orbettercontrolling,
interferenceduetouseofcommonresources. Anapplicationcanhavemultiplegreencontexts.
Using green contexts does not require any GPU code (kernel) changes, just small host-side changes
(e.g.,greencontextcreationandstreamcreationforthisgreencontext). Thegreencontextfunction-
4.6. GreenContexts 231

CUDAProgrammingGuide,Release13.1
alitycanbeusefulinvariousscenarios. Forexample,itcanhelpensuresomeSMsarealwaysavailable
foralatency-sensitivekerneltostartexecuting,assumingnootherconstraints,orprovideaquickway
totesttheeffectofusingfewerSMswithoutanykernelmodifications.
GreencontextsupportfirstbecameavailableviatheCUDADriverAPI.StartingfromCUDA13.1,con-
texts are exposed in the CUDA runtime via the execution context (EC) abstraction. Currently, an ex-
ecution context can correspond to either the primary context (the context runtime API users have
alwaysimplicitlyinteractedwith)oragreencontext. Thissectionwillusethetermsexecutioncontext
andgreencontextinterchangeablywhenreferringtoagreencontext.
Withtheruntimeexposureofgreencontexts,usingtheCUDAruntimeAPIdirectlyisstronglyrecom-
mended. ThissectionwillalsosolelyusetheCUDAruntimeAPI.
The remaining of this section is organized as follows: Section 4.6.1 provides a motivating example,
Section4.6.2highlightseaseofuse,andSection4.6.3presentsthedeviceresourceandresourcede-
scriptor structs. Section 4.6.4 explains how to create a green context, Section 4.6.5 how to launch
workthattargetsit,andSection4.6.6highlightssomeadditionalgreencontextAPIs. Finally,Section
4.6.7wrapsupwithanexample.
4.6.1. Motivation / When to Use
When launching a CUDA kernel, the user has no direct control over the number of SMs that kernel
will execute on. One can only indirectly influence this by changing the kernel’s launch geometry or
anythingthatcanaffectthekernel’smaximumnumberofactivethreadblocksperSM.Additionally,
when multiple kernels execute in parallel on the GPU (kernels running on different CUDA streams or
aspartofaCUDAgraph),theymayalsocontendforthesameSMresources.
Thereare,however,usecaseswheretheuserneedstoensuretherearealwaysGPUresourcesavailable
for latency-sensitive work to start, and thus complete, as soon as possible. Green contexts provide
awaytowardsthatbypartitioningSMresources,soagivengreencontextcanonlyusespecificSMs
(theonesprovisionedduringitscreation).
Figure 42 illustrates such an example. Assume an application where two independent kernels A and
B run on two different non-blocking CUDA streams. Kernel A is launched first and starts executing
occupying all available SM resources. When, later in time, latency-sensitive kernel B is launched, no
SMresourcesareavailable. Asaresult, kernelBcanonlystartexecutingoncekernelArampsdown,
i.e.,oncethreadblocksfromkernelAfinishexecuting. Thefirstgraphillustratesthisscenariowhere
criticalworkBgetsdelayed. They-axisshowsthepercentageofSMsoccupiedandx-axisdepictstime.
Usinggreencontexts,onecouldpartitiontheGPU’sSMs,sothatgreencontextA,targetedbykernel
A, has access to some SMs of the GPU, while green context B, targeted by kernel B, has access to
the remaining SMs. In this setting, kernel A can only use the SMs provisioned for green context A,
irrespectiveofitslaunchconfiguration. Asaresult,whencriticalkernelBgetslaunched,itisguaran-
teedthattherewillbeavailableSMsforittostartexecutingimmediately,barringanyotherresource
constraints. As the second graph in Figure 42 illustrates, even though the duration of kernel A may
increase,latency-sensitiveworkBwillnolongerbedelayedduetounavailableSMs. Thefigureshows
thatgreencontextAisprovisionedwithanSMcountequivalentto80%SMsoftheGPUforillustration
purposes.
ThisbehaviorcanbeachievedwithoutanycodemodificationstokernelsAandB.Onesimplyneedsto
ensuretheyarelaunchedonCUDAstreamsbelongingtotheappropriategreencontexts. Thenumber
of SMs each green context will have access to should be decided by the user during green context
creationonapercasebasis.
WorkQueues:
Streamingmultiprocessorsareoneresourcetypethatcanbeprovisionedforagreencontext. Another
232 Chapter4. CUDAFeatures

CUDAProgrammingGuide,Release13.1
Figure42: Motivation: GCs’staticresourcepartitioningenableslatency-sensitiveworkBtostartand
completesooner
resource type is work queues. Think of a workqueue as a black-box resource abstraction, which can
also influence GPU work execution concurrency, along with other factors. If independent GPU work
tasks(e.g.,kernelssubmittedondifferentCUDAstreams)maptothesameworkqueue,afalsedepen-
dencebetweenthesetasksmaybeintroduced,whichcanleadtotheirserializedexecution. Theuser
can influence the upper limit of work queues on the GPU via the CUDA_DEVICE_MAX_CONNECTIONS
environmentvariable(seeSection5.2,Section3.1).
Buildingontopofthepreviousexample, assumeworkBmapstothesameworkqueueasworkA.In
thatcase, evenifSMresourcesareavailable(greencontextscase), workBmaystillneedtowaitfor
workAtocompleteinitsentirety. SimilartoSMs,theuserhasnodirectcontroloverthespecificwork
queuesthatmaybeusedunderthehood. Butgreencontextsallowtheusertoexpressthemaximum
concurrencytheywouldexpectintermsofexpectednumberofconcurrentstream-orderedworkloads.
Thedrivercanthenusethisvalueasahinttotrytopreventworkfromdifferentexecutioncontexts
fromusingthesameworkqueue(s),thuspreventingunwantedinterferenceacrossexecutioncontexts.
Attention
EvenwhendifferentSMresourcesandworkqueuesareprovisionedpergreencontext,concurrent
execution of independent GPU work is not guaranteed. It is best to think of all the techniques
described under the Green Contexts section as removing factors which can prevent concurrent
execution(i.e.,reducingpotentialinterference).
GreenContextsversusMIGorMPS
Forcompleteness,thissectionbrieflycomparesgreencontextswithtwootherresourcepartitioning
mechanisms: MIG(Multi-InstanceGPU)andMPS(Multi-ProcessService).
MIGstaticallypartitionsaMIG-supportedGPUintomultipleMIGinstances(“smallerGPUs”). Thisparti-
tioninghastohappenbeforethelaunchofanapplication,anddifferentapplicationscanusedifferent
MIGinstances. UsingMIGcanbebeneficialforuserswhoseapplicationsconsistentlyunderutilizethe
availableGPUresources;anissuemorepronouncedasGPUsgetbigger. WithMIG,userscanrunthese
differentapplicationsondifferentMIGinstances,thusimprovingGPUutilization. MIGcanbeattrac-
tiveforcloudserviceproviders(CSPs)notonlyfortheincreasedGPUutilizationforsuchapplications,
butalsoforthequalityofservice(QoS)andisolationitcanprovideacrossclientsrunningondifferent
MIGinstances. PleaserefertotheMIGdocumentationlinkedaboveformoredetails.
4.6. GreenContexts 233

CUDAProgrammingGuide,Release13.1
ButusingMIGcannotaddresstheproblematicscenariodescribedearlier,wherecriticalworkBisde-
layedbecauseallSMresourcesareoccupiedbyotherGPUworkfromthesameapplication. Thisissue
canstillexistforanapplicationrunningonasingleMIGinstance. Toaddressit,onecanusegreencon-
textsalongsideMIG.Inthatcase,theSMresourcesavailableforpartitioningwouldbetheresources
ofthegivenMIGinstance.
MPS primarily targets different processes (e.g., MPI programs), allowing them to run on the GPU at
thesametimewithouttime-slicing. ItrequiresanMPSdaemontoberunningbeforetheapplication
islaunched. Bydefault,MPSclientswillcontendforallavailableSMresourcesoftheGPUortheMIG
instancetheyarerunningon. Inthismultipleclientprocessessetting,MPScansupportdynamicparti-
tioningofSMresources,usingtheactivethreadpercentageoption,whichplacesanupperlimitonthe
percentageofSMsanMPSclientprocesscanuse. Unlikegreencontexts,theactivethreadpercent-
agepartitioninghappenswithMPSattheprocesslevel,andthepercentageistypicallyspecifiedbyan
environmentvariablebeforetheapplicationislaunched. TheMPSactivethreadpercentagesignifies
thatagivenclientapplicationcannotusemorethanx%ofaGPU’sSMs,letthatbeNSMs. However,
theseSMscanbeany NSMsoftheGPU,whichcanalsovaryovertime. Ontheotherhand, agreen
contextprovisionedwithNSMsduringitscreationcanonlyusethesespecificNSMs.
StartingwithCUDA13.1, MPSalsosupportsstaticpartitioning, ifitisexplicitlyenabledwhenstart-
ingtheMPScontroldaemon. Withstaticpartitioning,theuserhastospecifythestaticpartitionan
MPS client process can use, when the application is launched. Dynamic sharing with active thread
percentage is no longer applicable in that case. A key difference between MPS in static partitioning
modeandgreencontextsisthatMPStargetsdifferentprocesses,whilegreencontextsisapplicable
within a single process too. Also, contrary to green contexts, MPS with static partitioning does not
allowoversubscriptionofSMresources.
WithMPS,programmaticpartitioningofSMresourcesisalsopossibleforaCUDAcontextcreatedvia
thecuCtxCreatedriverAPI,withexecutionaffinity. Thisprogrammaticpartitioningallowsdifferent
clientCUDAcontextsfromoneormoreprocessestoeachuseuptoaspecifiednumberofSMs. As
with the active thread percentage partitioning, these SMs can be any SMs of the GPU and can vary
over time, unlike the green contexts case. This option is possible even under the presence of static
MPSpartitioning. Pleasenotethatcreatingagreencontextismuchmorelightweightincomparison
toanMPScontext,asmanyunderlyingstructuresareownedbytheprimarycontextandthusshared.
4.6.2. Green Contexts: Ease of use
To highlight how easy it is to use green contexts, assume you have the following code snippet that
createstwoCUDAstreamsandthencallsafunctionthatlauncheskernelsvia<<<>>>ontheseCUDA
streams. Asdiscussedearlier,otherthanchangingthekernels’launchgeometries,onecannotinflu-
encehowmanySMsthesekernelscanuse.
int gpu_device_index = 0; ∕∕ GPU ordinal
CUDA_CHECK(cudaSetDevice(gpu_device_index));
cudaStream_t strm1, strm2;
CUDA_CHECK(cudaStreamCreateWithFlags(&strm1, cudaStreamNonBlocking));
CUDA_CHECK(cudaStreamCreateWithFlags(&strm2, cudaStreamNonBlocking));
∕∕ No control over how many SMs kernel(s) running on each stream can use
code_that_launches_kernels_on_streams(strm1, strm2); ∕∕ what is abstracted in
,→this function + the kernels is the vast majority of your code
∕∕ cleanup code not shown
234 Chapter4. CUDAFeatures

CUDAProgrammingGuide,Release13.1
Starting with CUDA 13.1, one can control the number of SMs a given kernel can have access to, us-
ing green contexts. The code snippet below shows how easy it is to do that. With a few extra lines
andwithoutanykernelmodifications,youcancontroltheSMsresourceskernel(s)launchedonthese
differentstreamscanuse.
| int gpu_device_index |     |     | = 0; | ∕∕  | GPU ordinal |     |     |
| -------------------- | --- | --- | ---- | --- | ----------- | --- | --- |
CUDA_CHECK(cudaSetDevice(gpu_device_index));
∕* ------------------ Code required to create green contexts -------------------
| ,→--------      | *∕        |                          |        |           |     |     |     |
| --------------- | --------- | ------------------------ | ------ | --------- | --- | --- | --- |
| ∕∕ Get all      | available |                          | GPU SM | resources |     |     |     |
| cudaDevResource |           | initial_GPU_SM_resources |        |           |     | {}; |     |
CUDA_CHECK(cudaDeviceGetDevResource(gpu_device_index, &initial_GPU_SM_
| ,→resources, |     | cudaDevResourceTypeSm)); |     |     |     |     |     |
| ------------ | --- | ------------------------ | --- | --- | --- | --- | --- |
∕∕ Split SM resources. This example creates one group with 16 SMs and one with
| ,→8. Assuming                |     | your | GPU has   | >=   | 24 SMs          |     |     |
| ---------------------------- | --- | ---- | --------- | ---- | --------------- | --- | --- |
| cudaDevSmResource            |     |      | result[2] | {{}, | {}};            |     |     |
| cudaDevSmResourceGroupParams |     |      |           |      | group_params[2] |     | = { |
{.smCount=16, .coscheduledSmCount=0, .preferredCoscheduledSmCount=0, .
,→flags=0},
{.smCount=8, .coscheduledSmCount=0, .preferredCoscheduledSmCount=0, .
,→flags=0}};
CUDA_CHECK(cudaDevSmResourceSplit(&result[0], 2, &initial_GPU_SM_resources,
| ,→nullptr,            | 0,       | &group_params[0])); |                |     |     |               |     |
| --------------------- | -------- | ------------------- | -------------- | --- | --- | ------------- | --- |
| ∕∕ Generate           | resource |                     | descriptors    |     | for | each resource |     |
| cudaDevResourceDesc_t |          |                     | resource_desc1 |     |     | {};           |     |
| cudaDevResourceDesc_t |          |                     | resource_desc2 |     |     | {};           |     |
CUDA_CHECK(cudaDevResourceGenerateDesc(&resource_desc1, &result[0], 1));
CUDA_CHECK(cudaDevResourceGenerateDesc(&resource_desc2, &result[1], 1));
| ∕∕ Create              | green | contexts |               |     |     |     |     |
| ---------------------- | ----- | -------- | ------------- | --- | --- | --- | --- |
| cudaExecutionContext_t |       |          | my_green_ctx1 |     |     | {}; |     |
| cudaExecutionContext_t |       |          | my_green_ctx2 |     |     | {}; |     |
CUDA_CHECK(cudaGreenCtxCreate(&my_green_ctx1, resource_desc1, gpu_device_
| ,→index, | 0)); |     |     |     |     |     |     |
| -------- | ---- | --- | --- | --- | --- | --- | --- |
CUDA_CHECK(cudaGreenCtxCreate(&my_green_ctx2, resource_desc2, gpu_device_
| ,→index, | 0)); |     |     |     |     |     |     |
| -------- | ---- | --- | --- | --- | --- | --- | --- |
∕* ------------------ Modified code --------------------------- *∕
∕∕ You just need to use a different CUDA API to create the streams
| cudaStream_t |     | strm1, | strm2; |     |     |     |     |
| ------------ | --- | ------ | ------ | --- | --- | --- | --- |
CUDA_CHECK(cudaExecutionCtxStreamCreate(&strm1, my_green_ctx1,
| ,→cudaStreamDefault, |     |     | 0)); |     |     |     |     |
| -------------------- | --- | --- | ---- | --- | --- | --- | --- |
CUDA_CHECK(cudaExecutionCtxStreamCreate(&strm2, my_green_ctx2,
| ,→cudaStreamDefault, |     |     | 0)); |     |     |     |     |
| -------------------- | --- | --- | ---- | --- | --- | --- | --- |
∕* ------------------ Unchanged code --------------------------- *∕
(continuesonnextpage)
4.6. GreenContexts 235

CUDAProgrammingGuide,Release13.1
(continuedfrompreviouspage)
∕∕ No need to modify any code in this function or in your kernel(s).
∕∕ Reminder: what is abstracted in this function + kernels is the vast majority
| ,→of your | code |     |     |     |     |     |
| --------- | ---- | --- | --- | --- | --- | --- |
∕∕ Now kernel(s) running on stream strm1 will use at most 16 SMs and kernel(s)
| ,→on strm2                                   | at most | 8 SMs.    |     |         |     |     |
| -------------------------------------------- | ------- | --------- | --- | ------- | --- | --- |
| code_that_launches_kernels_on_streams(strm1, |         |           |     | strm2); |     |     |
| ∕∕ cleanup                                   | code    | not shown |     |         |     |     |
Various execution context APIs, some of which were shown in the previous example, take an explicit
cudaExecutionContext_thandleandthusignorethecontextthatiscurrenttothecallingthread.
Until now, CUDA runtime users who did not use the driver API would by default only interact with
the primary context that is implicitly set as current to a thread via cudaSetDevice(). This shift to
explicitcontext-basedprogrammingprovideseasiertounderstandsemanticsandcanhaveadditional
benefits compared to the previous implicit context-based programming that relied on thread-local
state(TLS).
Thefollowingsectionswillexplainallthestepsshowninthepreviouscodesnippetindetail.
| 4.6.3. | Green | Contexts: | Device | Resource | and Resource |     |
| ------ | ----- | --------- | ------ | -------- | ------------ | --- |
Descriptor
Attheheartofagreencontextisadeviceresource(cudaDevResource)tiedtoaspecificGPUdevice.
Resourcescanbecombinedandencapsulatedintoadescriptor(cudaDevResourceDesc_t). Agreen
contextonlyhasaccesstotheresourcesencapsulatedintothedescriptorusedforitscreation.
CurrentlythecudaDevResourcedatastructureisdefinedas:
| struct | {                   |                                |       |           |     |     |
| ------ | ------------------- | ------------------------------ | ----- | --------- | --- | --- |
| enum   | cudaDevResourceType |                                | type; |           |     |     |
| union  | {                   |                                |       |           |     |     |
|        | struct              | cudaDevSmResource              | sm;   |           |     |     |
|        | struct              | cudaDevWorkqueueConfigResource |       | wqConfig; |     |     |
|        | struct              | cudaDevWorkqueueResource       |       | wq;       |     |     |
};
};
The supported valid resource types are cudaDevResourceTypeSm, cudaDevResourceType-
WorkqueueConfigandcudaDevResourceTypeWorkqueue, whilecudaDevResourceTypeInvalid
identifiesaninvalidresourcetype.
Avaliddeviceresourcecanbeassociatedwith:
▶
aspecificsetofstreamingmultiprocessors(SMs)(resourcetypecudaDevResourceTypeSm),
▶ aspecificworkqueueconfiguration(resourcetypecudaDevResourceTypeWorkqueueConfig)
or
▶ apre-existingworkqueueresource(resourcetypecudaDevResourceTypeWorkqueue).
One can query if a given execution context or CUDA stream is associated with a cudaDevResource
resource of a given type, using the cudaExecutionCtxGetDevResource and cudaStreamGetDe-
vResource APIs respectively. Being associated with different types of device resources (e.g., SMs
andworkqueues)isalsopossibleforanexecutioncontext,whileastreamcanonlybeassociatedwith
anSM-typeresource.
| 236 |     |     |     |     | Chapter4. | CUDAFeatures |
| --- | --- | --- | --- | --- | --------- | ------------ |

CUDAProgrammingGuide,Release13.1
AgivenGPUdevicehas,bydefault,allthreedeviceresourcetypes: anSM-typeresourceencompassing
alltheSMsoftheGPU,aworkqueueconfigurationresourceencompassingallavailableworkqueues
anditscorrespondingworkqueueresource. TheseresourcescanberetrievedviathecudaDeviceGet-
DevResourceAPI.
Overviewofrelevantdeviceresourcestructs
Thedifferentresourcetypestructshavefieldsthatareseteitherexplicitlybytheuserorbyarelevant
CUDAAPIcall. Itisrecommendedtozero-initializealldeviceresourcestructs.
▶ AnSM-typedeviceresource(cudaDevSmResource)hasthefollowingrelevantfields:
▶ unsigned int smCount: numberofSMsavailableinthisresource
▶ unsigned int minSmPartitionSize: minimum SM count required to partition this re-
source
▶ unsigned int smCoscheduledAlignment: number of SMs in the resource guaranteed
tobeco-scheduledonthesameGPUprocessingcluster,whichisrelevantforthreadblock
clusters. smCountisamultipleofthisvaluewhenflagsiszero.
▶ unsigned int flags: supported flags are 0 (default) and cudaDevSmResourceGroup-
Backfill(seecudaDevSmResourceGroupflags).
The above fields will be set via either the appropriate split API
(cudaDevSmResourceSplitByCount or cudaDevSmResourceSplit) used to create this
SM-typeresourceorwillbepopulatedbythecudaDeviceGetDevResourceAPIwhichretrieves
the SM resources of a given GPU device. These fields should never be set directly by the user.
Seenextsectionformoredetails.
▶ Aworkqueueconfigurationdeviceresource(cudaDevWorkqueueConfigResource)hasthefol-
lowingrelevantfields:
▶ int device: thedeviceonwhichtheworkqueueresourcesareavailable
▶ unsigned int wqConcurrencyLimit: thenumberofstream-orderedworkloadsexpected
toavoidfalsedependencies
▶ enum cudaDevWorkqueueConfigScope sharingScope: the sharing scope for the
workqueueresources. Supportedvaluesare: cudaDevWorkqueueConfigScopeDeviceCtx
(default)andcudaDevWorkqueueConfigScopeGreenCtxBalanced. Withthedefaultop-
tion,allworkqueueresourcesaresharedacrossallcontexts,whilewiththebalancedoption
the driver tries to use non-overlapping workqueue resources across green contexts wher-
everpossible,usingtheuser-specifiedwqConcurrencyLimitasahint.
These fields need to be set by the user. There is no CUDA API similar to the split APIs that
generates a workqueue configuration resource, with the exception of the workqueue configu-
ration resource populated by the cudaDeviceGetDevResource API. That API can retrieve the
workqueueconfigurationresourcesofagivenGPUdevice.
▶ Finally, a pre-existing workqueue resource (cudaDevResourceTypeWorkqueue) has no fields
that can be set by the user. As with the other resource types, cudaDevGetDevResource can
retrievethepre-existingworkqueueresourceofagivenGPUdevice.
4.6.4. Green Context Creation Example
Therearefourmainstepsinvolvedingreencontextcreation:
▶ Step1: Startwithaninitialsetofresources,e.g.,byfetchingtheavailableresourcesoftheGPU
4.6. GreenContexts 237

CUDAProgrammingGuide,Release13.1
▶
Step 2: Partition the SM resources into one or more partitions (using one of the available split
APIs).
▶
Step3: Createaresourcedescriptorcombining,ifneeded,differentresources
▶ Step4: Createagreencontextfromthedescriptor,provisioningitsresources
After the green context has been created, you can create CUDA streams belonging to that green
context. GPUworksubsequentlylaunchedonsuchastream,suchasakernellaunchedvia<<< >>>,
willonlyhaveaccesstothisgreencontext’sprovisionedresources. Librariescanalsoeasilyleverage
greencontexts,aslongastheuserpassesastreambelongingtoagreencontexttothem. SeeGreen
Contexts-Launchingworkformoredetails.
| 4.6.4.1 Step1: | GetavailableGPUresources |     |     |     |     |     |     |     |     |
| -------------- | ------------------------ | --- | --- | --- | --- | --- | --- | --- | --- |
Thefirststepingreencontextcreationistogettheavailabledeviceresourcesandpopulatethecu-
daDevResource struct(s). There are currently three possible starting points: a device, an execution
contextoraCUDAstream.
TherelevantCUDAruntimeAPIfunctionsignaturesarelistedbelow:
▶ Foradevice: cudaError_t cudaDeviceGetDevResource(int device, cudaDevResource*
| resource, |     | cudaDevResourceType |     |     | type) |     |     |     |     |
| --------- | --- | ------------------- | --- | --- | ----- | --- | --- | --- | --- |
▶ For an execution context: cudaError_t cudaExecutionCtxGetDevRe-
source(cudaExecutionContext_t ctx, cudaDevResource* resource, cudaDe-
| vResourceType |     |     | type) |     |     |     |     |     |     |
| ------------- | --- | --- | ----- | --- | --- | --- | --- | --- | --- |
▶ Forastream:
cudaError_t cudaStreamGetDevResource(cudaStream_t hStream, cud-
| aDevResource* |     |     | resource, | cudaDevResourceType |     |     | type) |     |     |
| ------------- | --- | --- | --------- | ------------------- | --- | --- | ----- | --- | --- |
All valid types are permitted for each of these APIs, with the exception of
cudaDevResourceType
cudaStreamGetDevResourcewhichonlysupportsanSM-typeresource.
Usually,thestartingpointwillbeaGPUdevice. Thecodesnippetbelowshowshowtogettheavailable
SMresourcesofagivenGPUdevice. AfterasuccessfulcudaDeviceGetDevResourcecall,theuser
canreviewthenumberofSMsavailableinthisresource.
| int current_device |     |     | = 0; ∕∕ | assume | device | ordinal | of 0 |     |     |
| ------------------ | --- | --- | ------- | ------ | ------ | ------- | ---- | --- | --- |
CUDA_CHECK(cudaSetDevice(current_device));
| cudaDevResource |     | initial_SM_resources |     |     | = {}; |     |     |     |     |
| --------------- | --- | -------------------- | --- | --- | ----- | --- | --- | --- | --- |
CUDA_CHECK(cudaDeviceGetDevResource(current_device ∕* GPU device *∕,
|               |     |     |     |     | &initial_SM_resources |     |     | ∕* device   | resource |
| ------------- | --- | --- | --- | --- | --------------------- | --- | --- | ----------- | --------- |
| ,→to populate |     | *∕, |     |     |                       |     |     |             |           |
|               |     |     |     |     | cudaDevResourceTypeSm |     |     | ∕* resource | type*∕)); |
std::cout << "Initial SM resources: " << initial_SM_resources.sm.smCount << "
| ,→SMs" | << std::endl; |     | ∕∕ number |     | of available | SMs |     |     |     |
| ------ | ------------- | --- | --------- | --- | ------------ | --- | --- | --- | --- |
∕∕ Special fields relevant for partitioning (see Step 3 below)
std::cout << "Min. SM partition size: " << initial_SM_resources.sm.
| ,→minSmPartitionSize |     |     | << " | SMs" | << std::endl; |     |     |     |     |
| -------------------- | --- | --- | ---- | ---- | ------------- | --- | --- | --- | --- |
std::cout << "SM co-scheduled alignment: " << initial_SM_resources.sm.
| ,→smCoscheduledAlignment |     |     |     | << " SMs" | << std::endl; |     |     |     |     |
| ------------------------ | --- | --- | --- | --------- | ------------- | --- | --- | --- | --- |
Onecanalsogettheavailableworkqueueconfig. resources,asshowninthecodesnippetbelow.
| 238 |     |     |     |     |     |     |     | Chapter4. | CUDAFeatures |
| --- | --- | --- | --- | --- | --- | --- | --- | --------- | ------------ |

CUDAProgrammingGuide,Release13.1
| int current_device |     | = 0; ∕∕ | assume device ordinal | of 0 |
| ------------------ | --- | ------- | --------------------- | ---- |
CUDA_CHECK(cudaSetDevice(current_device));
| cudaDevResource |     | initial_WQ_config_resources |     | = {}; |
| --------------- | --- | --------------------------- | --- | ----- |
CUDA_CHECK(cudaDeviceGetDevResource(current_device ∕* GPU device *∕,
&initial_WQ_config_resources ∕* device
| ,→resource | to populate | *∕, |     |     |
| ---------- | ----------- | --- | --- | --- |
cudaDevResourceTypeWorkqueueConfig ∕*
| ,→resource | type*∕));   |            |            |                 |
| ---------- | ----------- | ---------- | ---------- | --------------- |
| std::cout  | << "Initial | WQ config. | resources: | " << std::endl; |
std::cout << " - WQ concurrency limit: " << initial_WQ_config_resources.
| ,→wqConfig.wqConcurrencyLimit |     |     | << std::endl; |     |
| ----------------------------- | --- | --- | ------------- | --- |
std::cout << " - WQ sharing scope: " << initial_WQ_config_resources.wqConfig.
| ,→sharingScope | <<  | std::endl; |     |     |
| -------------- | --- | ---------- | --- | --- |
AfterasuccessfulcudaDeviceGetDevResourcecall,theusercanreviewthewqConcurrencyLimit
forthisresource. WhenthestartingpointisaGPUdevice,thewqConcurrencyLimitwillmatchthe
valueofCUDA_DEVICE_MAX_CONNECTIONSenvironmentvariableoritsdefaultvalue.
| 4.6.4.2 Step2: | PartitionSMresources |     |     |     |
| -------------- | -------------------- | --- | --- | --- |
The second step in green context creation is to statically split the available cudaDevResource SM
resources into one or more partitions, with potentially some SMs left over in a remaining partition.
ThispartitioningispossibleusingthecudaDevSmResourceSplitByCount()orthecudaDevSmRe-
sourceSplit() API. The cudaDevSmResourceSplitByCount() API can only create one or more
homogeneouspartitions,plusapotentialremainingpartition,whilethecudaDevSmResourceSplit()
APIcanalsocreateheterogeneouspartitions,plusthepotentialremainingone. Thesubsequentsec-
tionsdescribethefunctionalityofbothAPIsindetail. BothAPIsareonlyapplicabletoSM-typedevice
resources.
cudaDevSmResourceSplitByCountAPI
ThecudaDevSmResourceSplitByCountruntimeAPIsignatureis:
cudaError_t cudaDevSmResourceSplitByCount(cudaDevResource* result, unsigned
int* nbGroups, const cudaDevResource* input, cudaDevResource* remaining, un-
| signed | int useFlags, | unsigned | int minCount) |     |
| ------ | ------------- | -------- | ------------- | --- |
AsFigure43highlights,theuserrequeststosplittheinputSM-typedeviceresourceinto*nbGroups
homogeneous groups with minCount SMs each. However, the end result will contain a potentially
updated *nbGroups number of homogeneous groups with N SMs each. The potentially updated
*nbGroups will be less than or equal to the originally requested group number, while N will be equal
toorgreaterthanminCount. Theseadjustmentsmayoccurduetosomegranularityandalignment
requirements,whicharearchitecturespecific.
Figure43: SMresourcesplitusingthecudaDevSmResourceSplitByCountAPI
4.6. GreenContexts 239

CUDAProgrammingGuide,Release13.1
Table30liststheminimumSMpartitionsizeandtheSMco-scheduledalignmentforallthecurrently
supported compute capabilities, for the default useFlags=0 case. One can also retrieve these val-
ues via the minSmPartitionSize and smCoscheduledAlignment fields of cudaDevSmResource,
as shown in Step 1: Get available GPU resources. Some of these requirements can be lowered via a
differentuseFlagsvalue. Table14providessomerelevantexampleshighlightingthedifferencebe-
tweenwhatisrequestedandthefinalresult,alongwithanexplanation. Thetablefocusesoncompute
capability(CC9.0),wheretheminimumnumberofSMsperpartitionis8andtheSMcounthastobe
amultipleof8,ifuseFlagsiszero.
|               |       |          | Table14: | Splitfunctionality |                 |      |         |        |        |
| ------------- | ----- | -------- | -------- | ------------------ | --------------- | ---- | ------- | ------ | ------ |
| Re-           |       |          |          |                    | Actual(forGH200 |      |         |        |        |
| quested       |       |          |          |                    | with132SMs)     |      |         |        |        |
| *nbGroumpsin- |       | useFlags |          |                    | *nbGroups       | with | Remain- | Reason |        |
|               | Count |          |          |                    | N SMs           |      | ingSMs  |        |        |
| 2             | 72    | 0        |          |                    | 1groupof72SMs   |      | 60      | cannot | exceed |
132SMs
| 6   | 11  | 0   |     |     | 6 groups | of 16 | 36  | multiple    | of 8 |
| --- | --- | --- | --- | --- | -------- | ----- | --- | ----------- | ---- |
|     |     |     |     |     | SMs      |       |     | requirement |      |
6 11 CU_DEV_SM_RESOURCE_SPLIT_IGN6ORgEr_ouSpMs_CwOSitChHE1D2UL6I0NG loweredtomul-
|     |     |     |     |     | SMseach  |      |       | tipleof2req. |     |
| --- | --- | --- | --- | --- | -------- | ---- | ----- | ------------ | --- |
| 2   | 1   | 0   |     |     | 2 groups | with | 8 116 | min. 8 SMs   | re- |
|     |     |     |     |     | SMseach  |      |       | quirement    |     |
HereisacodesnippetrequestingtosplittheavailableSMresourcesintofivegroupsof8SMseach:
| cudaDevResource |      | avail_resources     |                 | = {}; |           |         |     |     |     |
| --------------- | ---- | ------------------- | --------------- | ----- | --------- | ------- | --- | --- | --- |
| ∕∕ Code         | that | has populated       | avail_resources |       | not shown |         |     |     |     |
| unsigned        | int  | min_SM_count        | = 8;            |       |           |         |     |     |     |
| unsigned        | int  | actual_split_groups |                 | = 5;  | ∕∕ may be | updated |     |     |     |
cudaDevResource actual_split_result[5] = {{}, {}, {}, {}, {}};
| cudaDevResource |     | remaining_partition |     | =   | {}; |     |     |     |     |
| --------------- | --- | ------------------- | --- | --- | --- | --- | --- | --- | --- |
CUDA_CHECK(cudaDevSmResourceSplitByCount(&actual_split_result[0],
&actual_split_groups,
&avail_resources,
&remaining_partition,
|     |     |     |     |     | 0 ∕*useFlags |     | *∕, |     |     |
| --- | --- | --- | --- | --- | ------------ | --- | --- | --- | --- |
min_SM_count));
std::cout << "Split " << avail_resources.sm.smCount << " SMs into " << actual_
| ,→split_groups |     | << " groups | "   | \   |     |     |     |     |     |
| -------------- | --- | ----------- | --- | --- | --- | --- | --- | --- | --- |
<< "with " << actual_split_result[0].sm.smCount << " each " \
<< "and a remaining group with " << remaining_partition.sm.smCount <
| ,→< " | SMs" | << std::endl; |     |     |     |     |     |     |     |
| ----- | ---- | ------------- | --- | --- | --- | --- | --- | --- | --- |
Beawarethat:
▶
onecoulduseresult=nullptrtoquerythenumberofgroupsthatwouldbecreated
| 240 |     |     |     |     |     |     | Chapter4. | CUDAFeatures |     |
| --- | --- | --- | --- | --- | --- | --- | --------- | ------------ | --- |

CUDAProgrammingGuide,Release13.1
▶ onecouldsetremaining=nullptr,ifonedoesnotcarefortheSMsoftheremainingpartition
▶ theremaining(leftover)partitiondoesnothavethesamefunctionalorperformanceguarantees
asthehomogeneousgroupsinresult.
▶ useFlagsisexpectedtobe0inthedefaultcase,butvaluesofcudaDevSmResourceSplitIg-
noreSmCoscheduling and cudaDevSmResourceSplitMaxPotentialClusterSize are also
supported
▶ any resulting cudaDevResource cannot be repartitioned without first creating a resource de-
scriptorandagreencontextfromit(i.e.,steps3and4below)
PleaserefertocudaDevSmResourceSplitByCountruntimeAPIreferenceformoredetails.
cudaDevSmResourceSplitAPI
Asmentionedearlier,asinglecudaDevSmResourceSplitByCountAPIcallcanonlycreatehomoge-
neouspartitions,i.e.,partitionswiththesamenumberofSMs,plustheremainingpartition. Thiscan
be limiting for heterogeneous workloads, where work running on different green contexts has dif-
ferentSMcountrequirements. Toachieveheterogeneouspartitionswiththesplit-by-countAPI,one
wouldusuallyneedtore-partitionanexistingresourcebyrepeatingSteps1-4(multipletimes). Or,in
somecases,onemaybeabletocreatehomogeneouspartitionseachwithSMcountequaltotheGCD
(greatestcommondivisor)ofalltheheterogeneouspartitionsaspartofstep-2andthenmergethe
requirednumberofthemtogetheraspartofstep-3. Thislastapproachhoweverisnotrecommended,
astheCUDAdrivermaybeabletocreatebetterpartitionsiflargersizeswererequestedupfront.
ThecudaDevSmResourceSplitAPIaimstoaddresstheselimitationsbyallowingtheusertocreate
non-overlapping heterogeneous partitions in a single call. The cudaDevSmResourceSplit runtime
APIsignatureis:
cudaError_t cudaDevSmResourceSplit(cudaDevResource* result, unsigned int
nbGroups, const cudaDevResource* input, cudaDevResource* remainder, unsigned
int flags, cudaDevSmResourceGroupParams* groupParams)
ThisAPIwill attemptto partitionthe input SM-typeresourceinto nbGroups validdeviceresources
(groups)placedintheresultarraybasedontherequirementsspecifiedforeachoneinthegroup-
Paramsarray. Anoptionalremainingpartitionmayalsobecreated. Inasuccessfulsplit,asshownin
Figure44,eachresourceintheresultcanhaveadifferentnumberofSMs,butneverzeroSMs.
Figure44: SMresourcesplitusingthecudaDevSmResourceSplitAPI
Whenrequestingaheterogeneoussplit,oneneedstospecifytheSMcount(smCountfieldofrelevant
groupParamsentry)foreachresourceinresult. ThisSMcountshouldalwaysbeamultipleoftwo.
For the scenario in the previous image, groupParams[0].smCount would be X, groupParams[1].
smCountY,etc. However,justspecifyingtheSMcountisnotsufficient,ifanapplicationusesThread
BlockClusters. Sinceallthethreadblocksofaclusterareguaranteedtobeco-scheduled,theuseralso
needstospecifythemaximumsupportedclustersize,ifany,agivenresourcegroupshouldsupport.
This is possible via the coscheduledSmCount field of the relevant groupParams entry. For GPUs
withcomputecapability10.0andon(CC10.0+),clusterscanalsohaveapreferreddimension,which
isamultipleoftheirdefaultclusterdimension. Duringasinglekernellaunchonsupportedsystems,
this larger preferred cluster dimension is used as much as possible, if at all, and the smaller default
4.6. GreenContexts 241

CUDAProgrammingGuide,Release13.1
cluster dimension is used otherwise. The user can express this preferred cluster dimension hint via
thepreferredCoscheduledSmCountfieldoftherelevantgroupParamsentry. Finally,theremaybe
caseswheretheusermaywanttoloosentheSMcountrequirementsandpullinmoreavailableSMs
in a given group; the user can express this backfill option by setting the flags field of the relevant
groupParamsentrytoitsnon-defaultflagvalue.
Toprovidemoreflexibility,thecudaDevSmResourceSplitAPIalsohasadiscoverymode,tobeused
when the exact SM count, for one or more groups, is not known ahead of time. For example, a user
may want to create a device resource that has as many SMs as possible, while meeting some co-
schedulingrequirements(e.g.,allowingclustersofsizefour). Toexercisethisdiscoverymode,theuser
cansetthesmCountfieldoftherelevantgroupParamsentry(orentries)tozero. Afterasuccessful
cudaDevSmResourceSplitcall,thesmCountfieldofthegroupParamswillhavebeenpopulatedwith
a valid non-zero value; we refer to this as the actual smCount value. If result was not null (so this
wasnotadryrun),thentherelevantgroupofresultwillalsohaveitssmCountsettothesamevalue.
TheorderthenbGroupsgroupParamsentriesarespecifiedmatters,astheyareevaluatedfromleft
(index0)toright(indexnbGroups-1).
Table 15 provides a high level view of the supported arguments for the cudaDevSmResourceSplit
API.
Table15: OverviewofcudaDevSmResourceSplitsplitAPI
groupParamsarray;showingentryiwithi[0,
nbGroups)
result nbGroiunppsut remain- flags smCount cosched- preferred- flags
der uledSm- Cosched-
Count uledSm-
Count
nullptr for num- resource nullptr if 0 0 for dis- 0 (de- 0 (default) 0 (default)
explorative ber to split you do covery fault) or valid or cud-
dry run; of into not want mode or valid preferred aDevSm-
not null ptr groupsnbGroups a remain- or other cosched- coscheduled Resource-
otherwise groups dergroup valid sm- uled SM SM count Group-
Count count (hint) Backfill
Notes:
1) cudaDevSmResourceSplitAPI’sreturnvaluedependsonresult:
▶ result != nullptr: the API will return cudaSuccess only when the split is successful
and nbGroups valid cudaDevResource groups, meeting the specified requirements were cre-
ated; otherwise, it will return an error. As different types of errors may return the same error
code (e.g., CUDA_ERROR_INVALID_RESOURCE_CONFIGURATION), it is recommended to use the
CUDA_LOG_FILEenvironmentvariabletogetmoreinformativeerrordescriptionsduringdevel-
opment.
▶ result == nullptr: theAPImayreturncudaSuccesseveniftheresultingsmCountofagroup
iszero,acasewhichwouldhavereturnedanerrorwithanon-nullptrresult. Thinkofthismode
asadry-runtestyoucanusewhileexploringwhatissupported,especiallyindiscoverymode.
2) Onasuccessfulcallwithresult!=nullptr,theresultingresult[i]deviceresourcewithiin[0,
nbGroups)willbeoftypecudaDevResourceTypeSmandhavearesult[i].sm.smCountthat
will either be the non-zero user-specified groupParams[i].smCount value or the discovered
one. Inbothcases,theresult[i].sm.smCountwillmeetallthefollowingconstraints:
242 Chapter4. CUDAFeatures

CUDAProgrammingGuide,Release13.1
▶
| beamultiple  |     | of 2and                   |     |     |     |
| ------------ | --- | ------------------------- | --- | --- | --- |
| ▶ beinthe[2, |     | input.sm.smCount]rangeand |     |     |     |
▶ (flags == 0) ? (multiple of actual group_params[i].coscheduledSmCount) :
| (>= | groups_params[i].coscheduledSmCount) |     |     |     |     |
| --- | ------------------------------------ | --- | --- | --- | --- |
3) Specifying zero for any of the coscheduledSmCount and preferredCoscheduledSmCount
fieldsindicatesthatthedefaultvaluesforthesefieldsshouldbeused;thesecanvaryperGPU.
These default values are both equal to the smCoscheduledAlignment of the SM resource re-
trievedviathecudaDeviceGetDevResourceAPIforthegivendevice(andnotanySMresource).
Toreviewthesedefaultvalues,onecanexaminetheirupdatedvaluesintherelevantgroupPa-
rams entry after a successful cudaDevSmResourceSplit call with them initially set to 0; see
below.
| int             | gpu_device_index |                          | = 0; |     |     |
| --------------- | ---------------- | ------------------------ | ---- | --- | --- |
| cudaDevResource |                  | initial_GPU_SM_resources |      |     | {}; |
CUDA_CHECK(cudaDeviceGetDevResource(gpu_device_index, &initial_GPU_
| ,→SM_resources, |     | cudaDevResourceTypeSm)); |     |     |     |
| --------------- | --- | ------------------------ | --- | --- | --- |
std::cout << "Default value will be equal to " << initial_GPU_SM_
| ,→resources.sm.smCoscheduledAlignment |                     |     |     |     | << std::endl; |
| ------------------------------------- | ------------------- | --- | --- | --- | ------------- |
| int                                   | default_split_flags |     | =   | 0;  |               |
cudaDevSmResourceGroupParams group_params_tmp = {.smCount=0, .
,→coscheduledSmCount=0, .preferredCoscheduledSmCount=0, .flags=0};
CUDA_CHECK(cudaDevSmResourceSplit(nullptr, 1, &initial_GPU_SM_
,→resources, nullptr ∕*remainder*∕, default_split_flags, &group_
,→params_tmp));
std::cout << "coscheduledSmcount default value: " << group_params.
| ,→coscheduledSmCount |     |     | << std::endl; |     |     |
| -------------------- | --- | --- | ------------- | --- | --- |
std::cout << "preferredCoscheduledSmcount default value: " << group_
| ,→params.preferredCoscheduledSmCount |     |     |     |     | << std::endl; |
| ------------------------------------ | --- | --- | --- | --- | ------------- |
4) Theremaindergroup,ifpresent,willnothaveanyconstraintsonitsSMcountorco-scheduling
| requirements. |     | Itwillbeuptotheusertoexplorethat. |     |     |     |
| ------------- | --- | --------------------------------- | --- | --- | --- |
BeforeprovidingmoredetailedinformationforthevariouscudaDevSmResourceGroupParamsfields,
Table 16 shows what these values could be for some example use cases. Assume an ini-
tial_GPU_SM_resourcesdeviceresourcehasalreadybeenpopulated,asinthepreviouscodesnip-
pet,andistheresourcethatwillbesplit. Everyrowinthetablewillhavethatsamestartingpoint. For
simplicitythetablewillonlyshowthenbGroupsvalueandthegroupParamsfieldsperusecasethat
canbeusedinacodesnippetliketheonebelow.
| int nbGroups    |     | = 2; ∕∕ update      | as  | needed    |           |
| --------------- | --- | ------------------- | --- | --------- | --------- |
| unsigned        | int | default_split_flags |     | = 0;      |           |
| cudaDevResource |     | remainder           | {}; | ∕∕ update | as needed |
cudaDevResource result_use_case[2] = {{}, {}}; ∕∕ Update depending on number
,→of groups planned. Increase size if you plan to also use a workqueue resource
cudaDevSmResourceGroupParams group_params_use_case[2] = {{.smCount = X, .
,→coscheduledSmCount=0, .preferredCoscheduledSmCount = 0, .flags = 0},
{.smCount = Y, .
,→coscheduledSmCount=0, .preferredCoscheduledSmCount = 0, .flags = 0}}
CUDA_CHECK(cudaDevSmResourceSplit(&result_use_case[0], nbGroups, &initial_GPU_
,→SM_resources, remainder, default_split_flags, &group_params_use_case[0]));
4.6. GreenContexts 243

CUDAProgrammingGuide,Release13.1
|     |     |     |     | Table16: | splitAPIusecases |                                      |     |     |     |     |
| --- | --- | --- | --- | -------- | ---------------- | ------------------------------------ | --- | --- | --- | --- |
|     |     |     |     |          |                  | groupParams[i]fields(ishowninascend- |     |     |     | i   |
ingorder;seelastcolumn)
| # Goal/UseCases |     |     |     | nbGroruep-s |            | sm- | cosched-preferred- |          | flags |     |
| --------------- | --- | --- | --- | ----------- | ---------- | --- | ------------------ | -------- | ----- | --- |
|                 |     |     |     |             | main-Count |     | uledSm-            | Cosched- |       |     |
|                 |     |     |     |             | der        |     | Count              | uledSm-  |       |     |
Count
| 1 Resource | with      | 16 SMs. | Do  | not 1 | nullptr16 |     | 0   | 0   | 0   | 0   |
| ---------- | --------- | ------- | --- | ----- | --------- | --- | --- | --- | --- | --- |
| care for   | remaining | SMs.    | May | use   |           |     |     |     |     |     |
clusters.
| 2a One resource        |     | with | 16 SMs  | and 1 | not     | 16  | 2   | 2   | 0   | 0   |
| ---------------------- | --- | ---- | ------- | ----- | ------- | --- | --- | --- | --- | --- |
| onewitheverythingelse. |     |      | Willnot | (2a)  | nullptr |     |     |     |     |     |
useclusters.
| (Note:                            | showing | two    | options:   | in       |           |     |     |     |         |     |
| --------------------------------- | ------- | ------ | ---------- | -------- | --------- | --- | --- | --- | ------- | --- |
| 2b option(2a),the2ndresourceisthe |         |        |            | 2        | nullptr16 |     | 2   | 2   | 0       | 0   |
| remainder;                        | in      | option | (2b), itis | the (2b) |           |     |     |     |         |     |
|                                   |         |        |            |          |           | 0   | 2   | 2   | cudaDe- | 1   |
result_use_case[1].)
vSmRe-
source-
GroupBack-
fill
| 3 Two resources  |     | with         | 28 and | 32 2 | nullptr28 |     | 4   | 4   | 0   | 0   |
| ---------------- | --- | ------------ | ------ | ---- | --------- | --- | --- | --- | --- | --- |
| SMsrespectively. |     | Willuseclus- |        |      |           |     |     |     |     |     |
|                  |     |              |        |      |           | 32  | 4   | 4   | 0   | 1   |
tersofsize4.
| 4 One resource |     | with as   | many | SMs 1 | not     | 0   | 8   | 8   | 0   | 0   |
| -------------- | --- | --------- | ---- | ----- | ------- | --- | --- | --- | --- | --- |
| as possible,   |     | which can | run  | clus- | nullptr |     |     |     |     |     |
tersofsize8,andoneremainder.
| 5 One resource |     | with as   | many | SMs 2 | nullptr8 |     | 2   | 2   | 0   | 0   |
| -------------- | --- | --------- | ---- | ----- | -------- | --- | --- | --- | --- | --- |
| as possible,   |     | which can | run  | clus- |          |     |     |     |     |     |
tersofsize4,andonewith8SMs.
| (Note:      | Order      | matters! | Changing    |     |     | 0   | 4   | 4   | 0   | 1   |
| ----------- | ---------- | -------- | ----------- | --- | --- | --- | --- | --- | --- | --- |
| order       | of entries | in       | groupParams |     |     |     |     |     |     |     |
| array could | mean       | no       | SMs left    | for |     |     |     |     |     |     |
the8-SMgroup)
DetailedinformationaboutthevariouscudaDevSmResourceGroupParamsstructfields
smCount:
▶ ControlsSMcountforthecorrespondinggroupinresult.
▶
| Values: | 0(discoverymode)orvalidnon-zerovalue(non-discoverymode) |     |     |     |     |     |     |           |              |     |
| ------- | ------------------------------------------------------- | --- | --- | --- | --- | --- | --- | --------- | ------------ | --- |
| 244     |                                                         |     |     |     |     |     |     | Chapter4. | CUDAFeatures |     |

CUDAProgrammingGuide,Release13.1
▶ Validnon-zerosmCountvaluerequirements: (multiple of 2) and in [2, input->sm.
smCount] and ((flags == 0) ? multiple of actual coscheduledSmCount :
greater than or equal to coscheduledSmCount)
▶ Use cases: use discovery mode to explore what’s possible when SM count is not known/fixed;
usenon-discoverymodetorequestaspecificnumberofSMs.
▶ Note: indiscoverymode,actualSMcount,aftersuccessfulsplitcallwithnon-nullptrresult,will
meetvalidnon-zerovaluerequirements
coscheduledSmCount:
▶ ControlsnumberofSMsgroupedtogether(“co-scheduled”)toenablelaunchofdifferentclusters
oncomputecapability9.0+. ItcanthusimpactthenumberofSMsinaresultinggroupandthe
clustersizestheycansupport.
▶ Values: 0(defaultforcurrentarchitecture)orvalidnon-zerovalue
▶ Validnon-zerovaluerequirements: (multiple of 2)uptomaxlimit
▶ Usecases: Usedefaultoramanuallychosenvalueforclusters,keepinginmindthemax. portable
clustersizeonagivenarchitecture. Ifyourcodedoesnotuseclusters,youcanusetheminimum
supportedvalueof2orthedefaultvalue.
▶ Note: whenthedefaultvalueisused,theactualcoscheduledSmCount,afterasuccessfulsplit
call,willalsomeetvalidnon-zerovaluerequirements. Ifflagsisnotzero,theresultingsmCount
willbe>=coscheduledSmCount. ThinkofcoscheduledSmCountasprovidingsomeguaranteed
underlying “structure” to validresulting groups (i.e., that group can run at least a single cluster
ofcoscheduledSmCountsizeintheworstcase). Thistypeofstructureguaranteedoesnotapply
totheremaininggroup;thereitisuptotheusertoexplorewhatclustersizescanbelaunched.
preferredCoscheduledSmCount:
▶ Acts as a hint to the driver to try to merge groups of actual coscheduledSmCount SMs into
largergroupsofpreferredCoscheduledSmCountifpossible. Doingsocanallowcodetomake
use of preferred cluster dimensions feature available on devices with compute capability (CC)
10.0andon). SeecudaLaunchAttributeValue::preferredClusterDim.
▶ Values: 0(defaultforcurrentarchitecture)orvalidnon-zerovalue
▶ Validnon-zerovaluerequirements: (multiple of actual coscheduledSmCount)
▶ Usecases: useamanuallychosenvaluegreaterthan2ifyouusepreferredclustersandareona
deviceofcomputecapability10.0(Blackwell)orlater. Ifyoudon’tuseclusters,choosethesame
value as coscheduledSmCount: either select the minimum supported value of 2 or use 0 for
both
▶ Note: whenthedefaultvalueisused,theactualpreferredCoscheduledSmCount,afterasuc-
cessfulsplitcall,willalsomeetvalidnon-zerovaluerequirement.
flags:
▶ Controls if the resulting SM count of a group will be a multiple of actual coscheduled SM
count (default) or if SMs can be backfilled into this group (backfill). In the backfill case, the
resulting SM count (result[i].sm.smCount) will be greater than or equal to the specified
groupParams[i].smCount.
▶ Values: 0(default)orcudaDevSmResourceGroupBackfill
▶ Use cases: Use the zero (default), so the resulting group has the guaranteed flexibility of sup-
porting multiple clusters of coScheduledSmCount size. Use the backfill option, if you want to
get as many SMs as possible in the group, with some of these SMs (the backfilled ones), not
providinganycoschedulingguarantee.
4.6. GreenContexts 245

CUDAProgrammingGuide,Release13.1
▶
Note: a group created with the backfill flag can still support clusters (e.g., it is guaranteed to
supportatleastonecoscheduledSmCountsize).
| 4.6.4.3 Step2(continued): |     | Addworkqueueresources |     |     |     |     |
| ------------------------- | --- | --------------------- | --- | --- | --- | --- |
Ifyoualsowanttospecifyaworkqueueresource,thenthisneedstobedoneexplicitly. Thefollowing
exampleshowshowtocreateaworkqueueconfigurationresourceforaspecificdevicewithbalanced
sharingscopeandaconcurrencylimitoffour.
| cudaDevResource | split_result[2] |     | = {{}, | {}}; |     |     |
| --------------- | --------------- | --- | ------ | ---- | --- | --- |
∕∕ code to populate split_result[0] not shown; used split API with nbGroups=1
| ∕∕ The               | last resource | will be                               | a workqueue | resource. |     |     |
| -------------------- | ------------- | ------------------------------------- | ----------- | --------- | --- | --- |
| split_result[1].type |               | = cudaDevResourceTypeWorkqueueConfig; |             |           |     |     |
split_result[1].wqConfig.device = 0; ∕∕ assume device ordinal of 0
| split_result[1].wqConfig.sharingScope |     |     |     | =  |     |     |
| ------------------------------------- | --- | --- | --- | --- | --- | --- |
,→cudaDevWorkqueueConfigScopeGreenCtxBalanced;
| split_result[1].wqConfig.wqConcurrencyLimit |     |     |     | = 4; |     |     |
| ------------------------------------------- | --- | --- | --- | ---- | --- | --- |
A workqueueconcurrencylimitoffourhintsto the driverthat the userexpectsmaximum fourcon-
current stream-ordered workloads. The driver will assign work queues trying to respect this hint, if
possible.
| 4.6.4.4 Step3: | CreateaResourceDescriptor |     |     |     |     |     |
| -------------- | ------------------------- | --- | --- | --- | --- | --- |
The next step, after resources have been split, is to generate a resource descriptor, using the cud-
aDevResourceGenerateDescAPI,foralltheresourcesexpectedtobeavailabletoagreencontext.
TherelevantCUDAruntimeAPIfunctionsignatureis:
cudaError_t cudaDevResourceGenerateDesc(cudaDevResourceDesc_t *phDesc, cudaDe-
| vResource | *resources, | unsigned | int nbResources) |     |     |     |
| --------- | ----------- | -------- | ---------------- | --- | --- | --- |
ItispossibletocombinemultiplecudaDevResourceresources. Forexample,thecodesnippetbelow
shows how to generate a resource descriptor that encapsulates three groups of resources. You just
needtoensurethattheseresourcesareallallocatedcontinuouslyintheresourcesarray.
| cudaDevResource | actual_split_result[5] |                     |     | = {};     |     |     |
| --------------- | ---------------------- | ------------------- | --- | --------- | --- | --- |
| ∕∕ code         | to populate            | actual_split_result |     | not shown |     |     |
∕∕ Generate resource desc. to encapsulate 3 resources: actual_split_result[2]
,→to [4]
| cudaDevResourceDesc_t |     | resource_desc; |     |     |     |     |
| --------------------- | --- | -------------- | --- | --- | --- | --- |
CUDA_CHECK(cudaDevResourceGenerateDesc(&resource_desc, &actual_split_
| ,→result[2], | 3)); |     |     |     |     |     |
| ------------ | ---- | --- | --- | --- | --- | --- |
Combiningdifferenttypesofresourcesisalsosupported. Forexample,onecouldgenerateadescriptor
withbothSMandworkqueueresources.
ForacudaDevResourceGenerateDesccalltobesuccessful:
▶
AllnbResourcesresourcesshouldbelongtothesameGPUdevice.
▶ IfmultipleSM-typeresourcesarecombined,theyshouldbegeneratedfromthesamesplitAPI
callandhavethesamecoscheduledSmCountvalues(ifnotpartofremainder)
▶ Onlyasingleworkqueueconfigorworkqueuetyperesourcemaybepresent.
| 246 |     |     |     |     | Chapter4. | CUDAFeatures |
| --- | --- | --- | --- | --- | --------- | ------------ |

CUDAProgrammingGuide,Release13.1
4.6.4.5 Step4: CreateaGreenContext
ThefinalstepistocreateagreencontextfromaresourcedescriptorusingthecudaGreenCtxCreate
API.Thatgreencontextwillonlyhaveaccesstotheresources(e.g.,SMs,workqueues)encapsulatedin
theresourcedescriptorspecifiedduringitscreation. Theseresourceswillbeprovisionedduringthis
step.
TherelevantCUDAruntimeAPIfunctionsignatureis:
cudaError_t cudaGreenCtxCreate(cudaExecutionContext_t *phCtx, cudaDevRe-
sourceDesc_t desc, int device, unsigned int flags)
The flags parameter should be set to 0. It is also recommended to explicitly initialize the device’s
primarycontextbeforecreatingagreencontextviaeitherthecudaInitDeviceAPIorthecudaSet-
Device API, which also sets the primary context as current to the calling thread. Doing so ensures
therewillbenoadditionalprimarycontextinitializationoverheadduringgreencontextcreation.
Seecodesnippetbelow.
int current_device = 0; ∕∕ assume single GPU
CUDA_CHECK(cudaSetDevice(current_device)); ∕∕ Or cudaInitDevice
cudaDevResourceDesc_t resource_desc {};
∕∕ Code to generate resource_desc not shown
∕∕ Create a green_ctx on GPU with current_device ID with access to resources
,→from resource_desc
cudaExecutionContext_t green_ctx {};
CUDA_CHECK(cudaGreenCtxCreate(&green_ctx, resource_desc, current_device, 0));
After a successful green context creation, the user can verify its resources by calling cudaExecu-
tionCtxGetDevResourceonthatexecutioncontextforeachresourcetype.
CreatingMultipleGreenContexts
Anapplicationcanhavemorethanonegreencontext,inwhichcasesomeofthestepsaboveshould
be repeated. For most use cases, these green contexts will each have a separate non-overlapping
set of provisioned SMs. For example, for the case of five homogeneous cudaDevResource groups
(actual_split_resultarray),onegreencontext’sdescriptormayencapsulateactual_split_result[2]
to[4]resources,whilethedescriptorofanothergreencontextmayencapsulateactual_split_result[0]
to [1]. In this case, a specific SM will be provisioned for only one of the two green contexts of the
application.
But SM oversubscription is also possible and may be used in some cases. For example, it may
be acceptable to have the second green context’s descriptor encapsulate actual_split_result[0] to
[2]. In this case, all the SMs of actual_split_resource[2] cudaDevResource will be oversubscribed,
i.e., provisioned for both green contexts, while resources actual_split_resource[0] to [1] and ac-
tual_split_resource[3]to[4]mayonlybeusedbyoneofthetwogreencontexts. SMoversubscription
shouldbejudiciouslyusedonaper-casebasis.
4.6.5. Green Contexts - Launching work
Tolaunchakerneltargetingagreencontextcreatedusingthepriorsteps,youfirstneedtocreatea
stream for that green context with the cudaExecutionCtxStreamCreate API. Launching a kernel
onthatstreamusing<<< >>>orthecudaLaunchKernelAPI,willensurethatkernelcanonlyusethe
resources(SMs,workqueues)availabletothatstreamviaitsexecutioncontext. Forexample:
4.6. GreenContexts 247

CUDAProgrammingGuide,Release13.1
∕∕ Create green_ctx_stream CUDA stream for previously created green_ctx green
,→context
cudaStream_t green_ctx_stream;
int priority = 0;
CUDA_CHECK(cudaExecutionCtxStreamCreate(&green_ctx_stream,
green_ctx,
cudaStreamDefault,
priority));
∕∕ Kernel my_kernel will only use the resources (SMs, work queues, as
,→applicable) available to green_ctx_stream's execution context
my_kernel<<<grid_dim, block_dim, 0, green_ctx_stream>>>();
CUDA_CHECK(cudaGetLastError());
The default stream creation flag passed to the stream creation API above is equivalent to cudaS-
treamNonBlockinggivengreen_ctxisagreencontext.
CUDAgraphs
For kernels launched as part of a CUDA graph (see CUDA Graphs), there are a few more subtleties.
Unlikekernels,theCUDAstreamaCUDAgraphislaunchedondoesnotdeterminetheSMresources
used,asthatstreamissolelyusedfordependencytracking.
The execution context a kernel node (and other applicable node types) will execute on is set during
nodecreation. IftheCUDAgraphwillbecreatedusingstreamcapture,thentheexecutioncontext(s)
ofthestream(s)involvedinthecapturewilldeterminetheexecutioncontext(s)oftherelevantgraph
nodes. If the graph will be created using the graph APIs, then the user should explicitly set the ex-
ecution context for each relevant node. For example, to add a kernel node, the user should use the
polymorphic cudaGraphAddNode API with cudaGraphNodeTypeKernel type and explicitly specify
the .ctx field of the cudaKernelNodeParamsV2 struct under .kernel. The cudaGraphAddKer-
nelNodedoesnotallowtheusertospecifyanexecutioncontextandshouldthusbeavoided. Please
notethatitispossiblefordifferentgraphnodesinagraphtobelongtodifferentexecutioncontexts.
Forverificationpurposes,onecoulduseNsightSystemsinnodetracingmode(--cuda-graph-trace
node)toobservethegreencontext(s)specificgraphnodeswillexecuteon. Notethatinthedefault
graphtracingmode,theentiregraphwillappearunderthegreencontextofthestreamitwaslaunched
on,but,aspreviouslyexplained,thisdoesnotprovideanyinformationabouttheexecutioncontext(s)
ofthevariousgraphnodes.
To verify programmatically, one could potentially use the CUDA driver API
cuGraphKernelNodeGetParams(graph_node, &node_params)andcomparethenode_params.
ctxcontexthandlefieldwiththeexpectedcontexthandleforthatgraphnode. UsingthedriverAPI
is possible given CUgraphNode and cudaGraphNode_t can be used interchangeably, but the user
wouldneedtoincludetherelevantcuda.hheaderandlinkwiththedriverdirectly(-lcuda).
ThreadBlockClusters
Kernels with thread block clusters (see Section 1.2.2.1.1) can also be launched on a green context
stream, like any other kernel, and thus use that green context’s provisioned resources. Section
4.6.4.2 showed how to specify the number of SMs that need to be coscheduled when a device re-
sourceissplit, tofacilitateclusters. Butaswithanykernelusingclusters,theusershouldmakeuse
of the relevant occupancy APIs to determine the max potential cluster size for a kernel (via cud-
aOccupancyMaxPotentialClusterSize) and, if needed, the maximum number of active clusters
(cudaOccupancyMaxActiveClusters). Iftheuserspecifiesagreencontextstreamasthestream
fieldoftherelevantcudaLaunchConfig,thentheseoccupancyAPIswilltakeintoconsiderationthe
SMresourcesprovisionedforthatgreencontext. Thisusecaseisespeciallyrelevantforlibrariesthat
248 Chapter4. CUDAFeatures

CUDAProgrammingGuide,Release13.1
maygetagreencontextCUDAstreampassedtothembytheuser,aswellasincaseswherethegreen
contextwascreatedfromaremainingdeviceresource.
ThecodesnippetbelowshowshowtheseAPIscanbeused.
∕∕ Assume cudaStream_t gc_stream has already been created and a __global__
| ,→void | cluster_kernel | exists. |     |     |     |
| ------ | -------------- | ------- | --- | --- | --- |
∕∕ Uncomment to support non portable cluster size, if possible
∕∕ CUDA_CHECK(cudaFuncSetAttribute(cluster_kernel,
| ,→cudaFuncAttributeNonPortableClusterSizeAllowed, |     |        |        |     | 1)) |
| ------------------------------------------------- | --- | ------ | ------ | --- | --- |
| cudaLaunchConfig_t                                |     | config | = {0}; |     |     |
config.gridDim = grid_dim; ∕∕ has to be a multiple of cluster dim.
| config.blockDim               |              | =                                      | block_dim;                   |     |     |
| ----------------------------- | ------------ | -------------------------------------- | ---------------------------- | --- | --- |
| config.dynamicSmemBytes       |              | =                                      | expected_dynamic_shared_mem; |     |     |
| cudaLaunchAttribute           |              | attribute[1];                          |                              |     |     |
| attribute[0].id               |              | = cudaLaunchAttributeClusterDimension; |                              |     |     |
| attribute[0].val.clusterDim.x |              |                                        | = 1;                         |     |     |
| attribute[0].val.clusterDim.y |              |                                        | = 1;                         |     |     |
| attribute[0].val.clusterDim.z |              |                                        | = 1;                         |     |     |
| config.attrs                  | = attribute; |                                        |                              |     |     |
| config.numAttrs               |              | = 1;                                   |                              |     |     |
config.stream=gc_stream; ∕∕ Need to pass the CUDA stream that will be used for
| ,→that                         | kernel    |             |         |                |        |
| ------------------------------ | --------- | ----------- | ------- | -------------- | ------ |
| int max_potential_cluster_size |           |             | =       | 0;             |        |
| ∕∕ the                         | next call | will ignore | cluster | dims in launch | config |
CUDA_CHECK(cudaOccupancyMaxPotentialClusterSize(&max_potential_cluster_size,
| ,→cluster_kernel, |     | &config)); |     |     |     |
| ----------------- | --- | ---------- | --- | --- | --- |
std::cout << "max potential cluster size is " << max_potential_cluster_size <
| ,→< " for | CUDA stream | gc_stream" |     | << std::endl; |     |
| --------- | ----------- | ---------- | --- | ------------- | --- |
∕∕ Could choose to update launch config's clusterDim with max_potential_cluster_
,→size.
∕∕ Doing so would result in a successful cudaLaunchKernelEx call for the same
| ,→kernel          | and launch | config. |     |     |     |
| ----------------- | ---------- | ------- | --- | --- | --- |
| int num_clusters= |            | 0;      |     |     |     |
CUDA_CHECK(cudaOccupancyMaxActiveClusters(&num_clusters, cluster_kernel, &
,→config));
std::cout << "Potential max. active clusters count is " << num_clusters <<
,→std::endl;
VerifyGreenContextsUse
Beyondempiricalobservationsofaffectedkernelexecutiontimesduetogreencontextprovisioning,
the user can leverage Nsight Systems or Nsight Compute CUDA developer tools to verify, to some
extent,correctgreencontextsuse.
For example, kernels launched on CUDA streams belonging to different green contexts will appear
underdifferentGreenContextrowsundertheCUDAHWtimelinesectionofanNsightSystemsreport.
NsightCompute provides a GreenContextResources overview in its Session page as well as updated
# SMs under the Launch Statistics of the Details section. The former provides a visual bitmask of
4.6. GreenContexts 249

CUDAProgrammingGuide,Release13.1
provisionedresources. Thisisparticularlyusefulifanapplicationusesdifferentgreencontexts,asthe
user can confirm expected overlap across GCs (no overlap or expected non-zero overlap if SMs are
oversubscribed).
Figure45depictstheseresourcesforanexamplewithtwogreencontextsprovisionedwith112and16
SMsrespectively,withnoSMoverlapacrossthem. Theprovidedviewcanhelptheuserverifythepro-
visionedSMresourcecountpergreencontext. ItalsohelpsconfirmthatnoSMswereoversubscribed,
asnoboxismarkedgreen(provisionedforthatGC)acrossbothgreencontexts.
Figure45: GreencontextsresourcessectionfromNsightCompute
TheLaunchStatisticssectionalsoexplicitlyliststhenumberofSMsprovisionedforthisgreencontext,
which can thus be used by this kernel. Please note that these are the SMs a given kernel can have
accesstoduringitsexecution,andnottheactualnumberofSMsthatkernelranon. Thesameapplies
to the resources overview shown earlier. The actual number of SMs used by the kernel can depend
onvariousfactors,includingthekernelitself(launchgeometry,etc.),otherworkrunningatthesame
timeontheGPU,etc.
4.6.6. Additional Execution Contexts APIs
ThissectiontouchesuponsomeadditionalgreencontextAPIs. Foracompletelist,pleaserefertothe
relevantCUDAruntimeAPIsection.
For synchronization using CUDA events, one can leverage the cudaError_t cudaExecutionC-
txRecordEvent(cudaExecutionContext_t ctx, cudaEvent_t event)andcudaError_t cu-
daExecutionCtxWaitEvent(cudaExecutionCtxWaitEvent(cudaExecutionContext_t ctx,
cudaEvent_t event) APIs. cudaExecutionCtxRecordEvent records a CUDA event capturing all
work/activities of the specified execution context at the time of this call, while cudaExecutionC-
txWaitEventmakesallfutureworksubmittedtotheexecutioncontextwaitfortheworkcaptured
inthespecifiedevent.
Using cudaExecutionCtxRecordEvent is more convenient than cudaEventRecord if the execu-
tioncontexthasmultipleCUDAstreams. Toachieveequivalentbehaviorwithoutthisexecutioncon-
textAPI,onewouldneedtorecordaseparateCUDAeventviacudaEventRecordoneveryexecution
context stream and then have dependent work wait separately for all these events. Similarly, cud-
aExecutionCtxWaitEventismoreconvenientthancudaStreamWaitEvent,ifoneneedsallexecu-
tion context streams to wait for an event to complete. The alternative would be a separate cudaS-
treamWaitEventforeverystreaminthisexecutioncontext.
ForblockingsynchronizationontheCPUside,onecanusecudaError_t cudaExecutionCtxSyn-
chronize(cudaExecutionContext_t ctx). Thiscallwillblockuntilthespecifiedexecutioncontext
hascompletedallitswork. IfthespecifiedexecutioncontextwasnotcreatedviacudaGreenCtxCre-
ate, but was rather obtained via cudaDeviceGetExecutionCtx, and is thus the device’s primary
context, calling that function will also synchronize all green contexts that have been created on the
samedevice.
To retrieve the device a given execution context is associated with, one can use cudaExecutionC-
txGetDevice. Toretrievetheuniqueidentifierofagivenexecutioncontext, onecanusecudaExe-
cutionCtxGetId.
Finally, anexplicitlycreatedexecutioncontextcanbedestroyedviathecudaError_t cudaExecu-
tionCtxDestroy(cudaExecutionContext_t ctx)API.
250 Chapter4. CUDAFeatures

CUDAProgrammingGuide,Release13.1
4.6.7. Green Contexts Example
Thissectionillustrateshowgreencontextscanenablecriticalworktostartandcompletesooner. Sim-
ilartothescenariousedinSection4.6.1,theapplicationhastwokernelsthatwillrunontwodifferent
non-blocking CUDA streams. The timeline, from the CPU side, is as follows. A long running kernel
(delay_kernel_us),whichtakesmultiplewavesonthefullGPU,islaunchedfirstonCUDAstreamstrm1.
Thenafterabriefwaittime(lessthanthekernelduration),ashorterbutcriticalkernel(critical_kernel)
is launched on stream strm2. The GPU durations and time from CPU launch to completion for both
kernelsaremeasured.
Asaproxyforalongrunningkernel, adelaykernelisusedwhereeverythreadblockrunsforafixed
numberofmicrosecondsandthenumberofthreadblocksexceedstheGPU’savailableSMs.
Initially,nogreencontextsareused,butthecriticalkernelislaunchedonaCUDAstreamwithahigher
prioritythanthelongrunningkernel. Becauseofitshighprioritystream,thecriticalkernelcanstart
executing as soon as some of the thread blocks of the long running kernel complete. However, it
willstillneedtowaitforsomepotentiallylong-runningthreadblockstocomplete,whichwilldelayits
executionstart.
Figure 46 shows this scenario in an Nsight Systems report. The long running kernel is launched on
stream13,whiletheshortbutcriticalkernelislaunchedonstream14,whichhashigherstreampri-
ority. Ashighlightedontheimage,thecriticalkernelwaitsfor0.9ms(inthiscase)beforeitcanstart
executing. Ifthetwostreamshadidenticalpriorities,thecriticalkernelwouldexecutemuchlater.
Figure46: NsightSystemstimelinewithoutgreencontexts
Toleveragethegreencontextsfeature,twogreencontextsarecreated,eachprovisionedwithadis-
tinct non-overlapping set of SMs. The exact SM split in this case for an H100 with 132 SMs was
chosen,forillustrationpurposes,as16SMsforthecriticalkernel(GreenContext3)and112SMsfor
thelongrunningkernel(GreenContext2). AsFigure47shows,thecriticalkernelcannowstartalmost
instantaneously,asthereareSMsonlyGreenContext3canuse.
Thedurationoftheshortkernelmayincrease,comparedtoitsdurationwhenrunninginisolation,as
there is now a limit on the number of SMs it can use. The same is also the case for the long run-
ningkernel,whichcannolongerusealltheSMsoftheGPU,butisconstrainedbyitsgreencontext’s
provisionedresources. However,thekeyresultisthatthecriticalkernelworkcannowstartandcom-
plete significantly sooner than before. That is barring any other limitations, as parallel execution, as
4.6. GreenContexts 251

CUDAProgrammingGuide,Release13.1
mentionedearlier,cannotbeguaranteed.
Figure47: NsightSystemstimelinewithgreencontexts
Inallcases,theexactSMsplitshouldbedecidedonapercasebasisafterexperimentation.
| 4.7.   | Lazy Loading |     |     |     |     |     |
| ------ | ------------ | --- | --- | --- | --- | --- |
| 4.7.1. | Introduction |     |     |     |     |     |
LazyloadingreducesprograminitializationtimebywaitingtoloadCUDAmodulesuntiltheyareneeded.
Lazy loading is particularly effective for programs that only use a small number of the kernels they
include,asiscommonwhenusinglibraries. Lazyloadingisdesignedtobeinvisibletotheuserwhenthe
CUDAprogrammingmodelisfollowed. PotentialHazardsexplainsthisindetail. AsofCUDA12.3lazy
Loadingisenabledbydefaultonallplatforms,butcanbecontrolledviatheCUDA_MODULE_LOADING
environmentvariable.
| 4.7.2. | Change | History |     |     |     |     |
| ------ | ------ | ------- | --- | --- | --- | --- |
Table17: SelectLazyLoadingChangesbyCUDAVersion
| CUDAVersion |     | Change |     |     |     |     |
| ----------- | --- | ------ | --- | --- | --- | --- |
12.3 Lazyloadingperformanceimproved. NowenabledbydefaultforWindows.
| 12.2   |              | LazyloadingenabledbydefaultforLinux.          |          |         |     |     |
| ------ | ------------ | --------------------------------------------- | -------- | ------- | --- | --- |
| 11.7   |              | Lazyloadingfirstintroduced,disabledbydefault. |          |         |     |     |
| 4.7.3. | Requirements |                                               | for Lazy | Loading |     |     |
LazyloadingisajointfeatureofboththeCUDAruntimeanddriver. Lazyloadingisonlyavailablewhen
theruntimeanddriverversionrequirementsaresatisfied.
| 252 |     |     |     |     | Chapter4. | CUDAFeatures |
| --- | --- | --- | --- | --- | --------- | ------------ |

CUDAProgrammingGuide,Release13.1
4.7.3.1 CUDARuntimeVersionRequirement
Lazy loading is available starting in CUDA runtime version 11.7. As CUDA runtime is usually linked
staticallyintoprogramsandlibraries,onlyprogramsandlibrariesfromorcompiledwithCUDA11.7+
toolkitwillbenefitfromlazyloading. LibrariescompiledusingolderCUDAruntimeversionswillloadall
moduleseagerly.
4.7.3.2 CUDADriverVersionRequirement
Lazyloadingrequiresdriverversion515ornewer. Lazyloadingisnotavailablefordriverversionsolder
than515,evenwhenusingCUDAtoolkit11.7ornewer.
4.7.3.3 CompilerRequirements
Lazyloadingdoesnotrequireanycompilersupport. BothSASSandPTXcompiledwithpre-11.7com-
pilerscanbeloadedwithlazyloadingenabled, andwillseefullbenefitsofthefeature. However, the
version11.7+CUDAruntimeisstillrequired,asdescribedabove.
4.7.3.4 KernelRequirements
Lazyloadingdoesnotaffectmodulescontainingmanagedvariables,whichwillstillbeloadedeagerly.
4.7.4. Usage
4.7.4.1 Enabling&Disabling
Lazy loading is enabled by setting the CUDA_MODULE_LOADING environment variable to LAZY. Lazy
loadingcanbedisabledbysettingtheCUDA_MODULE_LOADINGenvironmentvariabletoEAGER.Asof
CUDA12.3,lazyloadingisenabledbydefaultonallplatforms.
4.7.4.2 CheckingifLazyLoadingisEnabledatRuntime
ThecuModuleGetLoadingModeAPIintheCUDAdriverAPIcanbeusedtodetermineiflazyloading
isenabled. NotethatCUDAmustbeinitializedbeforerunningthisfunction. Sampleusageisshown
inthesnippetbelow.
#include "<cuda.h>"
#include "<assert.h>"
#include "<iostream>"
int main() {
CUmoduleLoadingMode mode;
assert(CUDA_SUCCESS == cuInit(0));
assert(CUDA_SUCCESS == cuModuleGetLoadingMode(&mode));
std::cout << "CUDA Module Loading Mode is " << ((mode == CU_MODULE_
,→LAZY_LOADING) ? "lazy" : "eager") << std::endl;
return 0;
}
4.7. LazyLoading 253

CUDAProgrammingGuide,Release13.1
4.7.4.3 ForcingaModuletoLoadEagerlyatRuntime
Loading kernels and variables happens automatically, without any need for explicit loading. Kernels
canbeloadedexplicitlyevenwithoutexecutingthembydoingthefollowing:
▶ ThecuModuleGetFunction()functionwillcauseamoduletobeloadedintodevicememory
▶ ThecudaFuncGetAttributes()functionwillcauseakerneltobeloadedintodevicememory
Note
cuModuleLoad()doesnotguaranteethatamodulewillbeloadedimmediately.
4.7.5. Potential Hazards
Lazyloadingisdesignedsothatitshouldnotrequireanymodificationstoapplicationstouseit. That
said,therearesomecaveats,especiallywhenapplicationsarenotfullycompliantwiththeCUDApro-
grammingmodel,asdescribedbelow.
4.7.5.1 ImpactonConcurrentKernelExecution
Some programs incorrectly assume that concurrent kernel execution is guaranteed. A deadlock can
occurifcross-kernelsynchronizationisrequired,butkernelexecutionhasbeenserialized. Tominimize
theimpactoflazyloadingonconcurrentkernelexecution,dothefollowing:
▶ preloadallkernelsthatyouhopetoexecuteconcurrentlypriortolaunchingthemor
▶ run application with CUDA_MODULE_LOADING = EAGER to force loading data eagerly without
forcingeachfunctiontoloadeagerly
4.7.5.2 LargeMemoryAllocations
LazyloadingdelaysmemoryallocationforCUDAmodulesfromprograminitializationuntilclosertoex-
ecutiontime. IfanapplicationallocatestheentireVRAMonstartup,CUDAcanfailtoallocatememory
formodulesatruntime. Possiblesolutions:
▶ usecudaMallocAsync()insteadofanallocatorthatallocatestheentireVRAMonstartup
▶ addsomebuffertocompensateforthedelayedloadingofkernels
▶ preloadallkernelsthatwillbeusedintheprogrambeforetryingtoinitializetheallocator
4.7.5.3 ImpactonPerformanceMeasurements
Lazy loading may skew performance measurements by moving CUDA module initialization into the
measuredexecutionwindow. Toavoidthis:
▶ doatleastonewarmupiterationpriortomeasurement
▶ preloadthebenchmarkedkernelpriortolaunchingit
4.8. Error Log Management
The ErrorLogManagement mechanism allows for CUDA API errors to be reported to developers in a
plain-Englishformatthatdescribesthecauseoftheissue.
254 Chapter4. CUDAFeatures

CUDAProgrammingGuide,Release13.1
4.8.1. Background
Traditionally,theonlyindicationofafailedCUDAAPIcallisthereturnofanon-zerocode. AsofCUDA
Toolkit12.9,theCUDARuntimedefinesover100differentreturncodesforerrorconditions,butmany
ofthemaregenericandgivethedevelopernoassistancewithdebuggingthecause.
4.8.2. Activation
SettheCUDA_LOG_FILEenvironmentvariable. Acceptablevaluesarestdout,stderr,oravalidpathon
thesystemtowriteafile. ThelogbuffercanbedumpedviaAPIevenifCUDA_LOG_FILE wasnotset
beforeprogramexecution. NOTE:Anerror-freeexecutionmaynotprintanylogs.
4.8.3. Output
Logsareoutputinthefollowingformat:
[Time][TID][Source][Severity][API Entry Point] Message
ThefollowinglineisanactualerrormessagethatisgeneratedifthedevelopertriestodumptheError
LogManagementlogstoanunallocatedbuffer:
[22:21:32.099][25642][CUDA][E][cuLogsDumpToMemory] buffer cannot be NULL
Wherebefore,allthedeveloperwouldhavegottenisCUDA_ERROR_INVALID_VALUEinthereturncode
andpossibly“invalidargument”ifcuGetErrorStringiscalled.
4.8.4. API Description
TheCUDADriverprovidesAPIsintwocategoriesforinteractingwiththeErrorLogManagementfea-
ture.
Thisfeatureallowsdeveloperstoregistercallbackfunctionstobeusedwheneveranerrorlogisgen-
erated,wherethecallbacksignatureis:
void callbackFunc(void *data, CUlogLevel logLevel, char *message, size_t
,→length)
CallbacksareregisteredwiththisAPI:
CUresult cuLogsRegisterCallback(CUlogsCallback callbackFunc, void *userData,
,→CUlogsCallbackHandle *callback_out)
Where userData is passed to the callback function without modifications. callback_out should be
storedbythecallerforuseincuLogsUnregisterCallback.
CUresult cuLogsUnregisterCallback(CUlogsCallbackHandle callback)
TheothersetofAPIfunctionsareformanagingtheoutputoflogs. Animportantconceptisthelog
iterator,whichpointstothecurrentendofthebuffer:
CUresult cuLogsCurrent(CUlogIterator *iterator_out, unsigned int flags)
4.8. ErrorLogManagement 255

CUDAProgrammingGuide,Release13.1
Theiteratorpositioncanbekeptbythecallingsoftwareinsituationswhereadumpoftheentirelog
buffer is not desired. Currently, the flags parameter must be 0, with additional options reserved for
futureCUDAreleases.
Atanytime,theerrorlogbuffercanbedumpedtoeitherafileormemorywiththesefunctions:
CUresult cuLogsDumpToFile(CUlogIterator *iterator, const char *pathToFile,
,→unsigned int flags)
CUresult cuLogsDumpToMemory(CUlogIterator *iterator, char *buffer, size_t
,→*size, unsigned int flags)
If iterator is NULL, the entire bufferwill be dumped, up to the maximum of 100 entries. If iterator is
notNULL,logswillbedumpedstartingfromthatentryandthevalueofiteratorwillbeupdatedtothe
current end of the logs, as if cuLogsCurrent had been called. If there have been more than 100 log
entriesintothebuffer,anotewillbeaddedatthestartofthedumpnotingthis.
Theflagsparametermustbe0,withadditionaloptionsreservedforfutureCUDAreleases.
ThecuLogsDumpToMemoryfunctionhasadditionalconsiderations:
1. Thebufferitselfwillbenull-terminated,buteachindividuallogentrywillonlybeseparatedbya
newline(n)character.
2. Themaximumsizeofthebufferis25600bytes.
3. Ifthevalueprovidedinsizeisnotsufficienttostorealldesiredlogs,anotewillbeaddedasthe
firstentryandtheoldestentriesthatdonotfitwillnotbedumped.
4. Afterreturning,sizewillcontaintheactualnumberofbyteswrittentotheprovidedbuffer.
4.8.5. Limitations and Known Issues
1. The log buffer is limited to 100 entries. After this limit is reached, the oldest entries will be
replacedandlogdumpswillcontainalinenotingtherollover.
2. Not all CUDA APIs are covered yet. This is an ongoing project to provide better usage error re-
portingforallAPIs.
3. TheErrorLogManagementloglocation(ifgiven)willnotbetestedforvalidityuntil/unlessalog
isgenerated.
4. TheErrorLogManagementAPIsarecurrentlyonlyavailableviatheCUDADriver. EquivalentAPIs
willbeaddedtotheCUDARuntimeinafuturerelease.
5. ThelogmessagesarenotlocalizedtoanylanguageandallprovidedlogsareinUSEnglish.
4.9. Asynchronous Barriers
Asynchronousbarriers,introducedinAdvancedSynchronizationPrimitives,extendCUDAsynchroniza-
tionbeyond__syncthreads()and__syncwarp(),enablingfine-grained,non-blockingcoordination
andbetteroverlapofcommunicationandcomputation.
Thissectionprovidesdetailsonhowtouseasynchronousbarriersmainlyviathecuda::barrierAPI
(withpointerstocuda::ptxandprimitiveswhereapplicable).
256 Chapter4. CUDAFeatures

CUDAProgrammingGuide,Release13.1
4.9.1. Initialization
Initializationmusthappenbeforeanythreadbeginsparticipatinginabarrier.
CUDAC++cuda::barrier
| #include <cuda∕barrier>         |                     |     |     |
| ------------------------------- | ------------------- | --- | --- |
| #include <cooperative_groups.h> |                     |     |     |
| __global__                      | void init_barrier() |     |     |
{
| __shared__              | cuda::barrier<cuda::thread_scope_block>    |       | bar; |
| ----------------------- | ------------------------------------------ | ----- | ---- |
| auto block              | = cooperative_groups::this_thread_block(); |       |      |
| if (block.thread_rank() |                                            | == 0) |      |
{
∕∕ A single thread initializes the total expected arrival count.
| init(&bar, | block.size()); |     |     |
| ---------- | -------------- | --- | --- |
}
block.sync();
}
CUDAC++cuda::ptx
| #include <cuda∕ptx>             |                     |     |     |
| ------------------------------- | ------------------- | --- | --- |
| #include <cooperative_groups.h> |                     |     |     |
| __global__                      | void init_barrier() |     |     |
{
| __shared__              | uint64_t                                   | bar;  |     |
| ----------------------- | ------------------------------------------ | ----- | --- |
| auto block              | = cooperative_groups::this_thread_block(); |       |     |
| if (block.thread_rank() |                                            | == 0) |     |
{
∕∕ A single thread initializes the total expected arrival count.
cuda::ptx::mbarrier_init(&bar, block.size());
}
block.sync();
}
4.9. AsynchronousBarriers 257

CUDAProgrammingGuide,Release13.1
CUDACprimitives
| #include   | <cuda_awbarrier_primitives.h> |     |     |     |     |     |
| ---------- | ----------------------------- | --- | --- | --- | --- | --- |
| #include   | <cooperative_groups.h>        |     |     |     |     |     |
| __global__ | void init_barrier()           |     |     |     |     |     |
{
| __shared__ | uint64_t                                         | bar; |     |     |     |     |
| ---------- | ------------------------------------------------ | ---- | --- | --- | --- | --- |
| auto       | block = cooperative_groups::this_thread_block(); |      |     |     |     |     |
| if         | (block.thread_rank()                             | ==   | 0)  |     |     |     |
{
∕∕ A single thread initializes the total expected arrival count.
| __mbarrier_init(&bar, |     | block.size()); |     |     |     |     |
| --------------------- | --- | -------------- | --- | --- | --- | --- |
}
block.sync();
}
Before any thread can participate in a barrier, the barrier must be initialized using the
cuda::barrier::init() friend function. This must happen before any thread arrives on the bar-
rier. This poses a bootstrapping challenge in that threads must synchronize before participating in
thebarrier,butthreadsarecreatingabarrierinordertosynchronize. Inthisexample,threadsthatwill
participatearepartofacooperativegroupanduseblock.sync()tobootstrapinitialization. Sincea
wholethreadblockisparticipatinginthebarrier,__syncthreads()couldalsobeused.
Thesecondparameterofinit()istheexpectedarrivalcount,i.e.,thenumberoftimesbar.arrive()
willbecalledbyparticipatingthreadsbeforeaparticipatingthreadisunblockedfromitscalltobar.
wait(std::move(token)). Inthisandthepreviousexamples,thebarrierisinitializedwiththenum-
berofthreadsinthethreadblocki.e.,cooperative_groups::this_thread_block().size(),so
thatallthreadswithinthethreadblockcanparticipateinthebarrier.
Asynchronous barriers are flexible in specifying how threads participate (split arrive/wait) and which
threads participate. In contrast, this_thread_block.sync() or __syncthreads() is applicable
to the whole thread-block and to a specified subset of a warp. Nonetheless, if
__syncwarp(mask)
the intention of the user is to synchronize a full thread block or a full warp, we recommend using
__syncthreads()and__syncwarp()respectivelyforbetterperformance.
| 4.9.2. | A Barrier’s | Phase: | Arrival, | Countdown, | Completion, |     |
| ------ | ----------- | ------ | -------- | ---------- | ----------- | --- |
and Reset
Anasynchronousbarriercountsdownfromtheexpectedarrivalcounttozeroasparticipatingthreads
callbar.arrive(). Whenthecountdownreacheszero,thebarrieriscompleteforthecurrentphase.
Whenthelastcalltobar.arrive()causesthecountdowntoreachzero,thecountdownisautomat-
icallyandatomicallyreset. Theresetassignsthecountdowntotheexpectedarrivalcount,andmoves
thebarriertothenextphase.
Atokenobjectofclasscuda::barrier::arrival_token,asreturnedfromtoken=bar.arrive(),
isassociatedwiththecurrentphaseofthebarrier. Acalltobar.wait(std::move(token))blocks
the calling thread while the barrier is in the current phase, i.e., while the phase associated with the
token matches the phase of the barrier. If the phase is advanced (because the countdown reaches
| 258 |     |     |     |     | Chapter4. | CUDAFeatures |
| --- | --- | --- | --- | --- | --------- | ------------ |

CUDAProgrammingGuide,Release13.1
zero)beforethecalltobar.wait(std::move(token))thenthethreaddoesnotblock;ifthephase
isadvancedwhilethethreadisblockedinbar.wait(std::move(token)),thethreadisunblocked.
It is essential to know when a reset could or could not occur, especially in non-trivial arrive/wait
synchronizationpatterns.
▶ A thread’s calls to token=bar.arrive() and bar.wait(std::move(token)) must be se-
quencedsuchthattoken=bar.arrive() occursduringthebarrier’scurrentphase, andbar.
wait(std::move(token))occursduringthesameornextphase.
▶ Athread’scalltobar.arrive()mustoccurwhenthebarrier’scounterisnon-zero. Afterbar-
rier initialization, if a thread’s call to bar.arrive() causes the countdown to reach zero then
a call to bar.wait(std::move(token)) must happen before the barrier can be reused for a
subsequentcalltobar.arrive().
▶ bar.wait()mustonlybecalledusingatokenobjectofthecurrentphaseortheimmediately
precedingphase. Foranyothervaluesofthetokenobject,thebehaviorisundefined.
Forsimplearrive/waitsynchronizationpatterns,compliancewiththeseusagerulesisstraightforward.
4.9.2.1 WarpEntanglement
Warp-divergence affects the number of times an arrive on operation updates the barrier. If the in-
vokingwarpisfullyconverged,thenthebarrierisupdatedonce. Iftheinvokingwarpisfullydiverged,
then32individualupdatesareappliedtothebarrier.
Note
Itisrecommendedthatarrive-on(bar)invocationsareusedbyconvergedthreadstominimize
updatestothebarrierobject. Whencodeprecedingtheseoperationsdivergesthreads, thenthe
warpshouldbere-converged,via__syncwarpbeforeinvokingarrive-onoperations.
4.9.3. Explicit Phase Tracking
An asynchronous barrier can have multiple phases depending on how many times it is used to syn-
chronizethreadsandmemoryoperations. Insteadofusingtokenstotrackbarrierphaseflips,wecan
directlytrackaphaseusingthembarrier_try_wait_parity()familyoffunctionsavailablethrough
thecuda::ptxandprimitivesAPIs.
In its simplest form, the cuda::ptx::mbarrier_try_wait_parity(uint64_t* bar, const
uint32_t& phaseParity) function waits for a phase with a particular parity. The phaseParity
operand is the integer parity of either the current phase or the immediately preceding phase of the
barrierobject. Anevenphasehasintegerparity0andanoddphasehasintegerparity1. Whenweini-
tializeabarrier,itsphasehasparity0. SothevalidvaluesofphaseParityare0and1. Explicitphase
trackingcanbeusefulwhentrackingasynchronousmemoryoperations,asitallowsonlyasinglethread
toarriveonthebarrierandsetthetransactioncount,whileotherthreadsonlywaitforaparity-based
phaseflip. Thiscanbemoreefficientthanhavingallthreadsarriveonthebarrierandusetokens. This
functionalityisonlyavailableforshared-memorybarriersatthread-blockandclusterscope.
4.9. AsynchronousBarriers 259

CUDAProgrammingGuide,Release13.1
CUDAC++cuda::barrier
| #include   |     | <cuda∕ptx>             |               |     |        |                 |     |     |     |
| ---------- | --- | ---------------------- | ------------- | --- | ------ | --------------- | --- | --- | --- |
| #include   |     | <cooperative_groups.h> |               |     |        |                 |     |     |     |
| __device__ |     | void                   | compute(float |     | *data, | int iteration); |     |     |     |
__global__ void split_arrive_wait(int iteration_count, float *data)
{
|     | using                   | barrier_t | =                                        | cuda::barrier<cuda::thread_scope_block>; |              |       |     |     |     |
| --- | ----------------------- | --------- | ---------------------------------------- | ---------------------------------------- | ------------ | ----- | --- | --- | --- |
|     | __shared__              | barrier_t |                                          | bar;                                     |              |       |     |     |     |
|     | int parity              | =         | 0; ∕∕                                    | Initial                                  | phase parity | is 0. |     |     |     |
|     | auto block              | =         | cooperative_groups::this_thread_block(); |                                          |              |       |     |     |     |
|     | if (block.thread_rank() |           |                                          |                                          | == 0)        |       |     |     |     |
{
|     | ∕∕ Initialize |     | barrier        |     | with expected | arrival | count. |     |     |
| --- | ------------- | --- | -------------- | --- | ------------- | ------- | ------ | --- | --- |
|     | init(&bar,    |     | block.size()); |     |               |         |        |     |     |
}
block.sync();
|     | for (int | i = | 0; i | < iteration_count; |     | ++i) |     |     |     |
| --- | -------- | --- | ---- | ------------------ | --- | ---- | --- | --- | --- |
{
|     | ∕* code | before | arrive   |     | *∕           |           |           |     |     |
| --- | ------- | ------ | -------- | --- | ------------ | --------- | --------- | --- | --- |
|     | ∕∕ This | thread | arrives. |     | Arrival does | not block | a thread. |     |     |
∕∕ Get a handle to the native barrier to use with cuda::ptx API.
(void)cuda::ptx::mbarrier_arrive(cuda::device::barrier_native_
,→handle(bar));
|     | compute(data, |     | i); |     |     |     |     |     |     |
| --- | ------------- | --- | --- | --- | --- | --- | --- | --- | --- |
∕∕ Wait for all threads participating in the barrier to complete mbarrier_
,→arrive().
∕∕ Get a handle to the native barrier to use with cuda::ptx API.
while (!cuda::ptx::mbarrier_try_wait_parity(cuda::device::barrier_
|     | ,→native_handle(bar), |         |      | parity)) | {}  |     |     |     |     |
| --- | --------------------- | ------- | ---- | -------- | --- | --- | --- | --- | --- |
|     | ∕∕ Flip               | parity. |      |          |     |     |     |     |     |
|     | parity                | ^= 1;   |      |          |     |     |     |     |     |
|     | ∕* code               | after   | wait | *∕       |     |     |     |     |     |
}
}
| 260 |     |     |     |     |     |     |     | Chapter4. | CUDAFeatures |
| --- | --- | --- | --- | --- | --- | --- | --- | --------- | ------------ |

CUDAProgrammingGuide,Release13.1
CUDAC++cuda::ptx
| #include   | <cuda∕ptx>             |               |        |                 |     |
| ---------- | ---------------------- | ------------- | ------ | --------------- | --- |
| #include   | <cooperative_groups.h> |               |        |                 |     |
| __device__ | void                   | compute(float | *data, | int iteration); |     |
__global__ void split_arrive_wait(int iteration_count, float *data)
{
| __shared__              | uint64_t | bar;                                     |              |       |     |
| ----------------------- | -------- | ---------------------------------------- | ------------ | ----- | --- |
| int parity              | =        | 0; ∕∕ Initial                            | phase parity | is 0. |     |
| auto block              | =        | cooperative_groups::this_thread_block(); |              |       |     |
| if (block.thread_rank() |          |                                          | == 0)        |       |     |
{
| ∕∕ Initialize                  |     | barrier | with expected  | arrival | count. |
| ------------------------------ | --- | ------- | -------------- | ------- | ------ |
| cuda::ptx::mbarrier_init(&bar, |     |         | block.size()); |         |        |
}
block.sync();
| for (int | i = | 0; i < iteration_count; |     | ++i) |     |
| -------- | --- | ----------------------- | --- | ---- | --- |
{
| ∕* code | before | arrive   | *∕           |           |           |
| ------- | ------ | -------- | ------------ | --------- | --------- |
| ∕∕ This | thread | arrives. | Arrival does | not block | a thread. |
(void)cuda::ptx::mbarrier_arrive(&bar);
| compute(data, |     | i); |     |     |     |
| ------------- | --- | --- | --- | --- | --- |
∕∕ Wait for all threads participating in the barrier to complete mbarrier_
,→arrive().
while (!cuda::ptx::mbarrier_try_wait_parity(&bar, parity)) {}
| ∕∕ Flip | parity. |         |     |     |     |
| ------- | ------- | ------- | --- | --- | --- |
| parity  | ^= 1;   |         |     |     |     |
| ∕* code | after   | wait *∕ |     |     |     |
}
}
4.9. AsynchronousBarriers 261

CUDAProgrammingGuide,Release13.1
CUDACprimitives
| #include   |     | <cuda_awbarrier_primitives.h> |               |     |        |     |     |             |     |     |     |
| ---------- | --- | ----------------------------- | ------------- | --- | ------ | --- | --- | ----------- | --- | --- | --- |
| #include   |     | <cooperative_groups.h>        |               |     |        |     |     |             |     |     |     |
| __device__ |     | void                          | compute(float |     | *data, |     | int | iteration); |     |     |     |
__global__ void split_arrive_wait(int iteration_count, float *data)
{
|     | __shared__              | __mbarrier_t |                                          |     | bar;    |       |        |     |        |     |     |
| --- | ----------------------- | ------------ | ---------------------------------------- | --- | ------- | ----- | ------ | --- | ------ | --- | --- |
|     | bool parity             | =            | false;                                   | ∕∕  | Initial | phase | parity | is  | false. |     |     |
|     | auto block              | =            | cooperative_groups::this_thread_block(); |     |         |       |        |     |        |     |     |
|     | if (block.thread_rank() |              |                                          |     | == 0)   |       |        |     |        |     |     |
{
|     | ∕∕ Initialize         |     | barrier |     | with expected  |     | arrival |     | count. |     |     |
| --- | --------------------- | --- | ------- | --- | -------------- | --- | ------- | --- | ------ | --- | --- |
|     | __mbarrier_init(&bar, |     |         |     | block.size()); |     |         |     |        |     |     |
}
block.sync();
|     | for (int | i = | 0; i | < iteration_count; |     |     | ++i) |     |     |     |     |
| --- | -------- | --- | ---- | ------------------ | --- | --- | ---- | --- | --- | --- | --- |
{
|     | ∕* code | before | arrive   |     | *∕      |      |     |       |           |     |     |
| --- | ------- | ------ | -------- | --- | ------- | ---- | --- | ----- | --------- | --- | --- |
|     | ∕∕ This | thread | arrives. |     | Arrival | does | not | block | a thread. |     |     |
(void)__mbarrier_arrive(&bar);
|     | compute(data, |     | i); |     |     |     |     |     |     |     |     |
| --- | ------------- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
∕∕ Wait for all threads participating in the barrier to complete __
,→mbarrier_arrive().
|     | while(!__mbarrier_try_wait_parity(&bar, |       |      |     |     |     |     | parity, | 1000)) | {}  |     |
| --- | --------------------------------------- | ----- | ---- | --- | --- | --- | --- | ------- | ------ | --- | --- |
|     | parity                                  | ^= 1; |      |     |     |     |     |         |        |     |     |
|     | ∕* code                                 | after | wait | *∕  |     |     |     |         |        |     |     |
}
}
| 4.9.4. |     | Early | Exit |     |     |     |     |     |     |     |     |
| ------ | --- | ----- | ---- | --- | --- | --- | --- | --- | --- | --- | --- |
When a thread that is participating in a sequence of synchronizations must exit early from that se-
quence,thatthreadmustexplicitlydropoutofparticipationbeforeexiting. Theremainingparticipat-
ingthreadscanproceednormallywithsubsequentarriveandwaitoperations.
| 262 |     |     |     |     |     |     |     |     |     | Chapter4. | CUDAFeatures |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --------- | ------------ |

CUDAProgrammingGuide,Release13.1
CUDAC++cuda::barrier
| #include   | <cuda∕barrier>         |                       |     |     |     |     |
| ---------- | ---------------------- | --------------------- | --- | --- | --- | --- |
| #include   | <cooperative_groups.h> |                       |     |     |     |     |
| __device__ | bool                   | condition_check();    |     |     |     |     |
| __global__ | void                   | early_exit_kernel(int |     |     | N)  |     |
{
| __shared__              |     | cuda::barrier<cuda::thread_scope_block>    |     |     |     | bar; |
| ----------------------- | --- | ------------------------------------------ | --- | --- | --- | ---- |
| auto block              |     | = cooperative_groups::this_thread_block(); |     |     |     |      |
| if (block.thread_rank() |     |                                            | ==  | 0)  |     |      |
{
| init(&bar, |     | block.size()); |     |     |     |     |
| ---------- | --- | -------------- | --- | --- | --- | --- |
}
block.sync();
| for (int | i   | = 0; i < | N; ++i) |     |     |     |
| -------- | --- | -------- | ------- | --- | --- | --- |
{
| if (condition_check()) |     |     |     |     |     |     |
| ---------------------- | --- | --- | --- | --- | --- | --- |
{
bar.arrive_and_drop();
return;
}
| ∕∕ Other |         | threads can     | proceed | normally.  |     |     |
| -------- | ------- | --------------- | ------- | ---------- | --- | --- |
| auto     | token   | = bar.arrive(); |         |            |     |     |
| ∕* code  | between | arrive          | and     | wait       | *∕  |     |
| ∕∕ Wait  | for     | all threads     |         | to arrive. |     |     |
bar.wait(std::move(token));
| ∕* code | after | wait | *∕  |     |     |     |
| ------- | ----- | ---- | --- | --- | --- | --- |
}
}
4.9. AsynchronousBarriers 263

CUDAProgrammingGuide,Release13.1
CUDACprimitives
| #include   |     | <cuda_awbarrier_primitives.h> |                       |     |     |     |     |     |     |
| ---------- | --- | ----------------------------- | --------------------- | --- | --- | --- | --- | --- | --- |
| #include   |     | <cooperative_groups.h>        |                       |     |     |     |     |     |     |
| __device__ |     | bool                          | condition_check();    |     |     |     |     |     |     |
| __global__ |     | void                          | early_exit_kernel(int |     |     |     | N)  |     |     |
{
|     | __shared__              |     | __mbarrier_t                               |     | bar; |     |     |     |     |
| --- | ----------------------- | --- | ------------------------------------------ | --- | ---- | --- | --- | --- | --- |
|     | auto block              |     | = cooperative_groups::this_thread_block(); |     |      |     |     |     |     |
|     | if (block.thread_rank() |     |                                            |     | ==   | 0)  |     |     |     |
{
|     | __mbarrier_init(&bar, |     |     |     | block.size()); |     |     |     |     |
| --- | --------------------- | --- | --- | --- | -------------- | --- | --- | --- | --- |
}
block.sync();
|     | for (int | i   | = 0; | i < | N; ++i) |     |     |     |     |
| --- | -------- | --- | ---- | --- | ------- | --- | --- | --- | --- |
{
|     | if (condition_check()) |     |     |     |     |     |     |     |     |
| --- | ---------------------- | --- | --- | --- | --- | --- | --- | --- | --- |
{
|     | __mbarrier_token_t |     |     |     | token | = __mbarrier_arrive_and_drop(&bar); |     |     |     |
| --- | ------------------ | --- | --- | --- | ----- | ----------------------------------- | --- | --- | --- |
return;
}
|     | ∕∕ Other           |                             | threads | can     | proceed | normally.                  |               |     |     |
| --- | ------------------ | --------------------------- | ------- | ------- | ------- | -------------------------- | ------------- | --- | --- |
|     | __mbarrier_token_t |                             |         |         | token   | = __mbarrier_arrive(&bar); |               |     |     |
|     | ∕* code            | between                     |         | arrive  | and     | wait                       | *∕            |     |     |
|     | ∕∕ Wait            | for                         | all     | threads |         | to arrive.                 |               |     |     |
|     | while              | (!__mbarrier_try_wait(&bar, |         |         |         |                            | token, 1000)) | {}  |     |
|     | ∕* code            | after                       |         | wait    | *∕      |                            |               |     |     |
}
}
Thebar.arrive_and_drop()operationarrivesonthebarriertofulfilltheparticipatingthread’sobli-
gation to arrive in the current phase, and then decrements the expected arrival count for the next
phasesothatthisthreadisnolongerexpectedtoarriveonthebarrier.
| 4.9.5. |     | Completion |     |     | Function |     |     |     |     |
| ------ | --- | ---------- | --- | --- | -------- | --- | --- | --- | --- |
The cuda::barrier API supports an optional completion function. A CompletionFunction of
cuda::barrier<Scope, CompletionFunction>isexecutedonceperphase,afterthelastthread
arrives and before any thread is unblocked from the wait. Memory operations performed by the
threadsthatarrivedatthebarrierduringthephasearevisibletothethreadexecutingtheComple-
tionFunction, and all memory operations performed within the CompletionFunction are visible
toallthreadswaitingatthebarrieroncetheyareunblockedfromthewait.
| 264 |     |     |     |     |     |     |     | Chapter4. | CUDAFeatures |
| --- | --- | --- | --- | --- | --- | --- | --- | --------- | ------------ |

CUDAProgrammingGuide,Release13.1
4.9. AsynchronousBarriers 265

CUDAProgrammingGuide,Release13.1
CUDAC++cuda::barrier
| #include   |     | <cuda∕barrier>         |                             |        |     |     |        |          |     |     |     |     |
| ---------- | --- | ---------------------- | --------------------------- | ------ | --- | --- | ------ | -------- | --- | --- | --- | --- |
| #include   |     | <cooperative_groups.h> |                             |        |     |     |        |          |     |     |     |     |
| #include   |     | <functional>           |                             |        |     |     |        |          |     |     |     |     |
| namespace  |     | cg =                   | cooperative_groups;         |        |     |     |        |          |     |     |     |     |
| __device__ |     | int                    | divergent_compute(int       |        |     |     | *,     | int);    |     |     |     |     |
| __device__ |     | int                    | independent_computation(int |        |     |     |        | *, int); |     |     |     |     |
| __global__ |     | void                   | psum(int                    | *data, |     | int | n, int | *acc)    |     |     |     |     |
{
|     | auto block         | =           | cg::this_thread_block(); |                   |        |     |     |     |     |     |     |     |
| --- | ------------------ | ----------- | ------------------------ | ----------------- | ------ | --- | --- | --- | --- | --- | --- | --- |
|     | constexpr          | int         | BlockSize                |                   | = 128; |     |     |     |     |     |     |     |
|     | __shared__         | int         | smem[BlockSize];         |                   |        |     |     |     |     |     |     |     |
|     | assert(BlockSize   |             |                          | == block.size()); |        |     |     |     |     |     |     |     |
|     | assert(n           | % BlockSize |                          | ==                | 0);    |     |     |     |     |     |     |     |
|     | auto completion_fn |             |                          | = [&]             |        |     |     |     |     |     |     |     |
{
|     | int | sum = 0; |      |                |     |     |      |     |     |     |     |     |
| --- | --- | -------- | ---- | -------------- | --- | --- | ---- | --- | --- | --- | --- | --- |
|     | for | (int i   | = 0; | i < BlockSize; |     |     | ++i) |     |     |     |     |     |
{
|     | sum | += smem[i]; |     |     |     |     |     |     |     |     |     |     |
| --- | --- | ----------- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
}
|     | *acc | += sum; |     |     |     |     |     |     |     |     |     |     |
| --- | ---- | ------- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
};
|     | ∕* Barrier | storage.        |          |                                         |                           |                       |     |     |         |     |     |     |
| --- | ---------- | --------------- | -------- | --------------------------------------- | ------------------------- | --------------------- | --- | --- | ------- | --- | --- | --- |
|     | Note:      | the             | barrier  | is                                      | not default-constructible |                       |     |     | because |     |     |     |
|     |            | completion_fn   |          | is                                      | not                       | default-constructible |     |     |         | due |     |     |
|     |            | to the          | capture. |                                         | *∕                        |                       |     |     |         |     |     |     |
|     | using      | completion_fn_t |          | =                                       | decltype(completion_fn);  |                       |     |     |         |     |     |     |
|     | using      | barrier_t       | =        | cuda::barrier<cuda::thread_scope_block, |                           |                       |     |     |         |     |     |     |
completion_fn_t>;
|     | __shared__ | std::aligned_storage<sizeof(barrier_t), |     |     |     |     |     |     |     |     |     |     |
| --- | ---------- | --------------------------------------- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
alignof(barrier_t)>
bar_storage;
|     | ∕∕ Initialize           |      | barrier. |            |       |                 |     |     |     |     |     |     |
| --- | ----------------------- | ---- | -------- | ---------- | ----- | --------------- | --- | --- | --- | --- | --- | --- |
|     | barrier_t               | *bar | =        | (barrier_t |       | *)&bar_storage; |     |     |     |     |     |     |
|     | if (block.thread_rank() |      |          |            | == 0) |                 |     |     |     |     |     |     |
{
|     | assert(*acc       |                               | ==  | 0);           |     |               |                 |                 |     |     |     |     |
| --- | ----------------- | ----------------------------- | --- | ------------- | --- | ------------- | --------------- | --------------- | --- | --- | --- | --- |
|     | assert(blockDim.x |                               |     | == blockDim.y |     |               | == blockDim.y   |                 | ==  | 1); |     |     |
|     | new               | (bar) barrier_t{block.size(), |     |               |     |               | completion_fn}; |                 |     |     |     |     |
|     | ∕* equivalent     |                               | to: | init(bar,     |     | block.size(), |                 | completion_fn); |     |     | *∕  |     |
}
block.sync();
|     | ∕∕ Main  | loop. |      |        |     |               |     |     |     |     |     |     |
| --- | -------- | ----- | ---- | ------ | --- | ------------- | --- | --- | --- | --- | --- | --- |
|     | for (int | i =   | 0; i | < n; i | +=  | block.size()) |     |     |     |     |     |     |
{
| 266 |                           |        |                  |     |             |           |     |         |     | Chapter4. |     | CUDAFeatures |
| --- | ------------------------- | ------ | ---------------- | --- | ----------- | --------- | --- | ------- | --- | --------- | --- | ------------ |
|     | smem[block.thread_rank()] |        |                  |     |             | = data[i] |     | + *acc; |     |           |     |              |
|     | auto                      | token  | = bar->arrive(); |     |             |           |     |         |     |           |     |              |
|     | ∕∕ We                     | can do | independent      |     | computation |           |     | here.   |     |           |     |              |
bar->wait(std::move(token));
|     | ∕∕ Shared-memory |     |         | is safe    | to   | re-use | in  | the next  | iteration |     |     |     |
| --- | ---------------- | --- | ------- | ---------- | ---- | ------ | --- | --------- | --------- | --- | --- | --- |
|     | ∕∕ since         | all | threads | are        | done | with   | it, | including | the       | one |     |     |
|     | ∕∕ that          | did | the     | reduction. |      |        |     |           |           |     |     |     |
}
}

CUDAProgrammingGuide,Release13.1
| 4.9.6. | Tracking | Asynchronous | Memory | Operations |
| ------ | -------- | ------------ | ------ | ---------- |
Asynchronousbarrierscanbeusedtotrackasynchronousmemorycopies. Whenanasynchronouscopy
operationisboundtoabarrier,thecopyoperationautomaticallyincrementstheexpectedcountofthe
current barrier phase upon initiation and decrements it upon completion. This mechanism ensures
thatthebarrier’swait()operationwillblockuntilallassociatedasynchronousmemorycopieshave
completed,providingaconvenientwaytosynchronizemultipleconcurrentmemoryoperations.
Starting with compute capability 9.0, asynchronous barriers in shared memory with thread-
block or cluster scope can explicitly track asynchronous memory operations. We refer to
these barriers as asynchronous transaction barriers. In addition to the expected arrival count,
a barrier object can accept a transaction count, which can be used for tracking the comple-
tion of asynchronous transactions. The transaction count tracks the number of asynchronous
transactions that are outstanding and yet to be complete, in units specified by the asyn-
chronous memory operation (typically bytes). The transaction count to be tracked by the cur-
rent phase can be set on arrival with cuda::device::barrier_arrive_tx() or directly with
cuda::device::barrier_expect_tx(). Whenabarrierusesatransactioncount,itblocksthreads
at the wait operation until all the producer threads have performed an arrive and the sum of all the
transactioncountsreachesanexpectedvalue.
CUDAC++cuda::barrier
| #include   | <cuda∕barrier>         |     |     |     |
| ---------- | ---------------------- | --- | --- | --- |
| #include   | <cooperative_groups.h> |     |     |     |
| __global__ | void track_kernel()    |     |     |     |
{
| __shared__ | cuda::barrier<cuda::thread_scope_block>          |       |     | bar; |
| ---------- | ------------------------------------------------ | ----- | --- | ---- |
| auto       | block = cooperative_groups::this_thread_block(); |       |     |      |
| if         | (block.thread_rank()                             | == 0) |     |      |
{
| init(&bar, | block.size()); |     |     |     |
| ---------- | -------------- | --- | --- | --- |
}
block.sync();
| auto | token = cuda::device::barrier_arrive_tx(bar, |     |     | 1, 0); |
| ---- | -------------------------------------------- | --- | --- | ------ |
bar.wait(cuda::std::move(token));
}
4.9. AsynchronousBarriers 267

CUDAProgrammingGuide,Release13.1
CUDAC++cuda::ptx
| #include   | <cuda∕ptx>             |     |     |     |     |     |
| ---------- | ---------------------- | --- | --- | --- | --- | --- |
| #include   | <cooperative_groups.h> |     |     |     |     |     |
| __global__ | void track_kernel()    |     |     |     |     |     |
{
| __shared__ | uint64_t                                         | bar;  |     |     |     |     |
| ---------- | ------------------------------------------------ | ----- | --- | --- | --- | --- |
| auto       | block = cooperative_groups::this_thread_block(); |       |     |     |     |     |
| if         | (block.thread_rank()                             | == 0) |     |     |     |     |
{
| cuda::ptx::mbarrier_init(&bar, |     |     | block.size()); |     |     |     |
| ------------------------------ | --- | --- | -------------- | --- | --- | --- |
}
block.sync();
uint64_t token = cuda::ptx::mbarrier_arrive_expect_tx(cuda::ptx::sem_
,→release, cuda::ptx::scope_cluster, cuda::ptx::space_shared, &bar, 1, 0);
| while | (!cuda::ptx::mbarrier_try_wait(&bar, |     |     | token)) | {}  |     |
| ----- | ------------------------------------ | --- | --- | ------- | --- | --- |
}
Inthisexample,thecuda::device::barrier_arrive_tx()operationconstructsanarrivaltoken
objectassociatedwiththephasesynchronizationpointforthecurrentphase. Then,decrementsthe
arrivalcountby1andincrementstheexpectedtransactioncountby0. Sincethetransactioncount
updateis0,thebarrierisnottrackinganytransactions. ThesubsequentsectiononUsingtheTensor
MemoryAccelerator(TMA)includesexamplesoftrackingasynchronousmemoryoperations.
| 4.9.7. | Producer-Consumer |     | Pattern | Using | Barriers |     |
| ------ | ----------------- | --- | ------- | ----- | -------- | --- |
Athreadblockcanbespatiallypartitionedtoallowdifferentthreadstoperformindependentopera-
tions. Thisismostcommonlydonebyassigningthreadsfromdifferentwarpswithinthethreadblock
| tospecifictasks. | Thistechniqueisreferredtoaswarpspecialization. |     |     |     |     |     |
| ---------------- | ---------------------------------------------- | --- | --- | --- | --- | --- |
Thissectionshowsanexampleofspatialpartitioninginaproducer-consumerpattern,whereonesub-
setofthreadsproducesdatathatisconcurrentlyconsumedbytheother(disjoint)subsetofthreads.
Aproducer-consumerspatialpartitioningpatternrequirestwoone-sidedsynchronizationstomanage
adatabufferbetweentheproducerandconsumer.
|     | Producer                         |     | Consumer                      |     |     |     |
| --- | -------------------------------- | --- | ----------------------------- | --- | --- | --- |
|     | waitforbuffertobereadytobefilled |     | signalbufferisreadytobefilled |     |     |     |
producedataandfillthebuffer
|     | signalbufferisfilled |     | waitforbuffertobefilled |     |     |     |
| --- | -------------------- | --- | ----------------------- | --- | --- | --- |
consumedatainfilledbuffer
Producerthreadswaitforconsumerthreadstosignalthatthebufferisreadytobefilled;however,con-
sumerthreadsdonotwaitforthissignal. Consumerthreadswaitforproducerthreadstosignalthat
| 268 |     |     |     |     | Chapter4. | CUDAFeatures |
| --- | --- | --- | --- | --- | --------- | ------------ |

CUDAProgrammingGuide,Release13.1
thebufferisfilled;however,producerthreadsdonotwaitforthissignal. Forfullproducer/consumer
concurrencythispatternhas(atleast)doublebufferingwhereeachbufferrequirestwobarriers.
4.9. AsynchronousBarriers 269

CUDAProgrammingGuide,Release13.1
CUDAC++cuda::barrier
| #include |           | <cuda∕barrier> |     |                                          |     |     |     |     |     |     |
| -------- | --------- | -------------- | --- | ---------------------------------------- | --- | --- | --- | --- | --- | --- |
| using    | barrier_t |                | =   | cuda::barrier<cuda::thread_scope_block>; |     |     |     |     |     |     |
__device__ void produce(barrier_t ready[], barrier_t filled[], float
|     | ,→*buffer, | int | buffer_len, |     | float |     | *in, int | N)  |     |     |
| --- | ---------- | --- | ----------- | --- | ----- | --- | -------- | --- | --- | --- |
{
|     | for (int | i   | = 0; | i < | N ∕ buffer_len; |     | ++i) |     |     |     |
| --- | -------- | --- | ---- | --- | --------------- | --- | ---- | --- | --- | --- |
{
ready[i % 2].arrive_and_wait(); ∕* wait for buffer_(i%2) to be ready to
|     | ,→be filled | *∕  |       |     |          |              |     |     |     |     |
| --- | ----------- | --- | ----- | --- | -------- | ------------ | --- | --- | --- | --- |
|     | ∕* produce, |     | i.e., |     | fill in, | buffer_(i%2) |     | *∕  |     |     |
barrier_t::arrival_token token = filled[i % 2].arrive(); ∕* buffer_(i
|     | ,→%2) is | filled | *∕  |     |     |     |     |     |     |     |
| --- | -------- | ------ | --- | --- | --- | --- | --- | --- | --- | --- |
}
}
__device__ void consume(barrier_t ready[], barrier_t filled[], float
|     | ,→*buffer, | int | buffer_len, |     | float |     | *out, int | N)  |     |     |
| --- | ---------- | --- | ----------- | --- | ----- | --- | --------- | --- | --- | --- |
{
barrier_t::arrival_token token1 = ready[0].arrive(); ∕* buffer_0 is ready
|     | ,→for initial |     | fill | *∕  |     |     |     |     |     |     |
| --- | ------------- | --- | ---- | --- | --- | --- | --- | --- | --- | --- |
barrier_t::arrival_token token2 = ready[1].arrive(); ∕* buffer_1 is ready
|     | ,→for initial |     | fill | *∕  |                 |     |      |     |     |     |
| --- | ------------- | --- | ---- | --- | --------------- | --- | ---- | --- | --- | --- |
|     | for (int      | i   | = 0; | i < | N ∕ buffer_len; |     | ++i) |     |     |     |
{
filled[i % 2].arrive_and_wait(); ∕* wait for buffer_(i%2) to be filled *∕
|     | ∕* consume |     | buffer_(i%2) |     |     | *∕  |     |     |     |     |
| --- | ---------- | --- | ------------ | --- | --- | --- | --- | --- | --- | --- |
barrier_t::arrival_token token3 = ready[i % 2].arrive(); ∕* buffer_(i
|     | ,→%2) is | ready | to  | be re-filled |     | *∕  |     |     |     |     |
| --- | -------- | ----- | --- | ------------ | --- | --- | --- | --- | --- | --- |
}
}
__global__ void producer_consumer_pattern(int N, float *in, float *out, int
,→buffer_len)
{
|     | constexpr | int | warpSize |     | = 32; |     |     |     |     |     |
| --- | --------- | --- | -------- | --- | ----- | --- | --- | --- | --- | --- |
∕* Shared memory buffer declared below is of size 2 * buffer_len
|     | so         | that | we can   | alternatively |              |     | work between | two buffers. |     |     |
| --- | ---------- | ---- | -------- | ------------- | ------------ | --- | ------------ | ------------ | --- | --- |
|     | buffer_0   |      | = buffer |               | and buffer_1 |     | = buffer     | + buffer_len | *∕  |     |
|     | __shared__ |      | extern   | float         | buffer[];    |     |              |              |     |     |
∕* bar[0] and bar[1] track if buffers buffer_0 and buffer_1 are ready to be
,→filled,
while bar[2] and bar[3] track if buffers buffer_0 and buffer_1 are
|     | ,→filled-in     | respectively     |           |      | *∕                           |     |     |     |     |     |
| --- | --------------- | ---------------- | --------- | ---- | ---------------------------- | --- | --- | --- | --- | --- |
|     | #pragma         | nv_diag_suppress |           |      | static_var_with_dynamic_init |     |     |     |     |     |
|     | __shared__      |                  | barrier_t |      | bar[4];                      |     |     |     |     |     |
|     | if (threadIdx.x |                  |           | < 4) |                              |     |     |     |     |     |
{
| 270 |          |     |                |     |     |              |     |     | Chapter4. | CUDAFeatures |
| --- | -------- | --- | -------------- | --- | --- | ------------ | --- | --- | --------- | ------------ |
|     | init(bar |     | + threadIdx.x, |     |     | blockDim.x); |     |     |           |              |
}
__syncthreads();
|     | if (threadIdx.x |     |     | < warpSize) |            |     |             |         |     |     |
| --- | --------------- | --- | --- | ----------- | ---------- | --- | ----------- | ------- | --- | --- |
|     | { produce(bar,  |     |     | bar +       | 2, buffer, |     | buffer_len, | in, N); | }   |     |
else
|     | { consume(bar, |     |     | bar + | 2, buffer, |     | buffer_len, | out, N); | }   |     |
| --- | -------------- | --- | --- | ----- | ---------- | --- | ----------- | -------- | --- | --- |
}

CUDAProgrammingGuide,Release13.1
4.9. AsynchronousBarriers 271

CUDAProgrammingGuide,Release13.1
CUDAC++cuda::ptx
| #include |     | <cuda∕ptx> |     |     |     |     |     |     |     |     |     |
| -------- | --- | ---------- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
__device__ void produce(barrier ready[], barrier filled[], float *buffer,
|     | ,→int buffer_len, |     |     | float | *in, | int | N)  |     |     |     |     |
| --- | ----------------- | --- | --- | ----- | ---- | --- | --- | --- | --- | --- | --- |
{
|     | for (int | i   | = 0; | i < | N ∕ | buffer_len; |     | ++i) |     |     |     |
| --- | -------- | --- | ---- | --- | --- | ----------- | --- | ---- | --- | --- | --- |
{
|     | uint64_t |     | token1 | =   | cuda::ptx::mbarrier_arrive(ready[i |     |     |     |     | % 2]); |     |
| --- | -------- | --- | ------ | --- | ---------------------------------- | --- | --- | --- | --- | ------ | --- |
while(!cuda::ptx::mbarrier_try_wait(&ready[i % 2], token1)) {} ∕* wait
|     | ,→for buffer_(i%2) |     |       | to  | be ready |     | to be        | filled | *∕  |     |     |
| --- | ------------------ | --- | ----- | --- | -------- | --- | ------------ | ------ | --- | --- | --- |
|     | ∕* produce,        |     | i.e., |     | fill     | in, | buffer_(i%2) |        | *∕  |     |     |
uint64_t token2 = cuda::ptx::mbarrier_arrive(&filled[i % 2]); ∕* buffer_
|     | ,→(i%2) | is filled |     | *∕  |     |     |     |     |     |     |     |
| --- | ------- | --------- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
}
}
__device__ void consume(barrier ready[], barrier filled[], float *buffer,
|     | ,→buffer_len, |     | float | *out, |     | int | N)  |     |     |     |     |
| --- | ------------- | --- | ----- | ----- | --- | --- | --- | --- | --- | --- | --- |
{
uint64_t token1 = cuda::ptx::mbarrier_arrive(&ready[0]); ∕* buffer_0 is
|     | ,→ready | for initial |     | fill | *∕  |     |     |     |     |     |     |
| --- | ------- | ----------- | --- | ---- | --- | --- | --- | --- | --- | --- | --- |
uint64_t token2 = cuda::ptx::mbarrier_arrive(&ready[1]); ∕* buffer_1 is
|     | ,→ready  | for initial |      | fill | *∕  |             |     |      |     |     |     |
| --- | -------- | ----------- | ---- | ---- | --- | ----------- | --- | ---- | --- | --- | --- |
|     | for (int | i           | = 0; | i <  | N ∕ | buffer_len; |     | ++i) |     |     |     |
{
uint64_t token3 = cuda::ptx::mbarrier_arrive(&filled[i % 2]);
while(!cuda::ptx::mbarrier_try_wait(&filled[i % 2], token3x)) {} ∕*
|     | ,→wait for | buffer_(i%2) |              |     | to  | be filled |     | *∕  |     |     |     |
| --- | ---------- | ------------ | ------------ | --- | --- | --------- | --- | --- | --- | --- | --- |
|     | ∕* consume |              | buffer_(i%2) |     |     | *∕        |     |     |     |     |     |
uint64_t token4 = cuda::ptx::mbarrier_arrive(&ready[i % 2]); ∕* buffer_
|     | ,→(i%2) | is ready |     | to be | re-filled |     | *∕  |     |     |     |     |
| --- | ------- | -------- | --- | ----- | --------- | --- | --- | --- | --- | --- | --- |
}
}
__global__ void producer_consumer_pattern(int N, float *in, float *out, int
,→buffer_len)
{
|     | constexpr | int | warpSize |     | =   | 32; |     |     |     |     |     |
| --- | --------- | --- | -------- | --- | --- | --- | --- | --- | --- | --- | --- |
∕* Shared memory buffer declared below is of size 2 * buffer_len
|     | so         | that | we can   | alternatively |     |           | work | between  | two buffers. |     |     |
| --- | ---------- | ---- | -------- | ------------- | --- | --------- | ---- | -------- | ------------ | --- | --- |
|     | buffer_0   |      | = buffer |               | and | buffer_1  |      | = buffer | + buffer_len | *∕  |     |
|     | __shared__ |      | extern   | float         |     | buffer[]; |      |          |              |     |     |
∕* bar[0] and bar[1] track if buffers buffer_0 and buffer_1 are ready to be
,→filled,
while bar[2] and bar[3] track if buffers buffer_0 and buffer_1 are
|      | ,→filled-in     | respectively     |          |      | *∕      |                              |     |     |     |           |              |
| ---- | --------------- | ---------------- | -------- | ---- | ------- | ---------------------------- | --- | --- | --- | --------- | ------------ |
|      | #pragma         | nv_diag_suppress |          |      |         | static_var_with_dynamic_init |     |     |     |           |              |
|      | __shared__      |                  | uint64_t |      | bar[4]; |                              |     |     |     |           |              |
|      | if (threadIdx.x |                  |          | < 4) |         |                              |     |     |     |           |              |
| 272{ |                 |                  |          |      |         |                              |     |     |     | Chapter4. | CUDAFeatures |
cuda::ptx::mbarrier_init(bar + block.thread_rank(), block.size());
}
__syncthreads();
|     | if (threadIdx.x |     |     | < warpSize) |      |         |     |             |         |     |     |
| --- | --------------- | --- | --- | ----------- | ---- | ------- | --- | ----------- | ------- | --- | --- |
|     | { produce(bar,  |     |     | bar         | + 2, | buffer, |     | buffer_len, | in, N); | }   |     |
else
|     | { consume(bar, |     |     | bar | + 2, | buffer, |     | buffer_len, | out, N); | }   |     |
| --- | -------------- | --- | --- | --- | ---- | ------- | --- | ----------- | -------- | --- | --- |
}

CUDAProgrammingGuide,Release13.1
4.9. AsynchronousBarriers 273

CUDAProgrammingGuide,Release13.1
CUDACprimitives
| #include |     | <cuda_awbarrier_primitives.h> |     |     |     |     |     |     |     |     |     |
| -------- | --- | ----------------------------- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
__device__ void produce(__mbarrier_t ready[], __mbarrier_t filled[], float
|     | ,→*buffer, | int | buffer_len, |     | float |     | *in, | int N) |     |     |     |
| --- | ---------- | --- | ----------- | --- | ----- | --- | ---- | ------ | --- | --- | --- |
{
|     | for (int | i   | = 0; | i < | N ∕ buffer_len; |     |     | ++i) |     |     |     |
| --- | -------- | --- | ---- | --- | --------------- | --- | --- | ---- | --- | --- | --- |
{
__mbarrier_token_t token1 = __mbarrier_arrive(&ready[i % 2]); ∕* wait
|     | ,→for buffer_(i%2)                  |     |       | to  | be ready | to           | be filled |       | *∕      |           |     |
| --- | ----------------------------------- | --- | ----- | --- | -------- | ------------ | --------- | ----- | ------- | --------- | --- |
|     | while(!__mbarrier_try_wait(&ready[i |     |       |     |          |              |           | % 2], | token1, | 1000)) {} |     |
|     | ∕* produce,                         |     | i.e., |     | fill in, | buffer_(i%2) |           |       | *∕      |           |     |
__mbarrier_token_t token2 = __mbarrier_arrive(filled[i % 2]); ∕*
|     | ,→buffer_(i%2) |     | is  | filled | *∕  |     |     |     |     |     |     |
| --- | -------------- | --- | --- | ------ | --- | --- | --- | --- | --- | --- | --- |
}
}
__device__ void consume(__mbarrier_t ready[], __mbarrier_t filled[], float
|     | ,→*buffer, | int | buffer_len, |     | float |     | *out, | int | N)  |     |     |
| --- | ---------- | --- | ----------- | --- | ----- | --- | ----- | --- | --- | --- | --- |
{
__mbarrier_token_t token1 = __mbarrier_arrive(&ready[0]); ∕* buffer_0 is
|     | ,→ready | for initial |     | fill | *∕  |     |     |     |     |     |     |
| --- | ------- | ----------- | --- | ---- | --- | --- | --- | --- | --- | --- | --- |
__mbarrier_token_t token2 = __mbarrier_arrive(&ready[1]); ∕* buffer_1 is
|     | ,→ready  | for initial |      | fill | *∕              |     |     |      |     |     |     |
| --- | -------- | ----------- | ---- | ---- | --------------- | --- | --- | ---- | --- | --- | --- |
|     | for (int | i           | = 0; | i <  | N ∕ buffer_len; |     |     | ++i) |     |     |     |
{
__mbarrier_token_t token3 = __mbarrier_arrive(&filled[i % 2]);
|     | while(!__mbarrier_try_wait(&filled[i |     |              |     |     |     |     | %   | 2], token3, | 1000)) {} |     |
| --- | ------------------------------------ | --- | ------------ | --- | --- | --- | --- | --- | ----------- | --------- | --- |
|     | ∕* consume                           |     | buffer_(i%2) |     |     | *∕  |     |     |             |           |     |
__mbarrier_token_t token4 = __mbarrier_arrive(&ready[i % 2]); ∕* buffer_
|     | ,→(i%2) | is ready |     | to be | re-filled |     | *∕  |     |     |     |     |
| --- | ------- | -------- | --- | ----- | --------- | --- | --- | --- | --- | --- | --- |
}
}
__global__ void producer_consumer_pattern(int N, float *in, float *out, int
,→buffer_len)
{
|     | constexpr | int | warpSize |     | = 32; |     |     |     |     |     |     |
| --- | --------- | --- | -------- | --- | ----- | --- | --- | --- | --- | --- | --- |
∕* Shared memory buffer declared below is of size 2 * buffer_len
|     | so         | that | we can   | alternatively |              |     | work | between | two buffers. |     |     |
| --- | ---------- | ---- | -------- | ------------- | ------------ | --- | ---- | ------- | ------------ | --- | --- |
|     | buffer_0   |      | = buffer |               | and buffer_1 |     | =    | buffer  | + buffer_len | *∕  |     |
|     | __shared__ |      | extern   | float         | buffer[];    |     |      |         |              |     |     |
∕* bar[0] and bar[1] track if buffers buffer_0 and buffer_1 are ready to be
,→filled,
while bar[2] and bar[3] track if buffers buffer_0 and buffer_1 are
|     | ,→filled-in     | respectively     |              |      | *∕                           |     |     |     |     |     |     |
| --- | --------------- | ---------------- | ------------ | ---- | ---------------------------- | --- | --- | --- | --- | --- | --- |
|     | #pragma         | nv_diag_suppress |              |      | static_var_with_dynamic_init |     |     |     |     |     |     |
|     | __shared__      |                  | __mbarrier_t |      | bar[4];                      |     |     |     |     |     |     |
|     | if (threadIdx.x |                  |              | < 4) |                              |     |     |     |     |     |     |
{
| 274 |                     |     |     |     |                |     |     |              |     | Chapter4. | CUDAFeatures |
| --- | ------------------- | --- | --- | --- | -------------- | --- | --- | ------------ | --- | --------- | ------------ |
|     | __mbarrier_init(bar |     |     |     | + threadIdx.x, |     |     | blockDim.x); |     |           |              |
}
__syncthreads();
|     | if (threadIdx.x |     |     | < warpSize) |            |     |             |     |         |     |     |
| --- | --------------- | --- | --- | ----------- | ---------- | --- | ----------- | --- | ------- | --- | --- |
|     | { produce(bar,  |     |     | bar +       | 2, buffer, |     | buffer_len, |     | in, N); | }   |     |
else
|     | { consume(bar, |     |     | bar + | 2, buffer, |     | buffer_len, |     | out, | N); } |     |
| --- | -------------- | --- | --- | ----- | ---------- | --- | ----------- | --- | ---- | ----- | --- |
}

CUDAProgrammingGuide,Release13.1
In this example, the first warp is specialized as the producer and the remaining warps are special-
ized as consumers. All producer and consumer threads participate (call bar.arrive() or bar.
arrive_and_wait())ineachofthefourbarrierssotheexpectedarrivalcountsareequaltoblock.
size().
A producer thread waits for the consumer threads to signal that the shared memory buffer can be
filled. Inordertowaitforabarrier,aproducerthreadmustfirstarriveonthatready[i%2].arrive()
to get a token and then ready[i%2].wait(token) with that token. For simplicity, ready[i%2].
arrive_and_wait()combinestheseoperations.
bar.arrive_and_wait();
| ∕* is equivalent | to  | *∕  |     |     |
| ---------------- | --- | --- | --- | --- |
bar.wait(bar.arrive());
Producerthreadscomputeandfillthereadybuffer,theythensignalthatthebufferisfilledbyarriving
onthefilledbarrier,filled[i%2].arrive(). Aproducerthreaddoesnotwaitatthispoint,instead
itwaitsuntilthenextiteration’sbuffer(doublebuffering)isreadytobefilled.
A consumer thread begins by signaling that both buffers are ready to be filled. A consumer thread
does not wait at this point, instead it waits for this iteration’s buffer to be filled, filled[i%2].
arrive_and_wait(). After the consumer threads consume the buffer they signal that the buffer
isreadytobefilledagain,ready[i%2].arrive(),andthenwaitforthenextiteration’sbuffertobe
filled.
| 4.10. | Pipelines |     |     |     |
| ----- | --------- | --- | --- | --- |
Pipelines,introducedinAdvancedSynchronizationPrimitives,areamechanismforstagingworkandco-
ordinatingmulti-bufferproducer–consumerpatterns,commonlyusedtooverlapcomputewithasyn-
chronousdatacopies.
Thissectionprovidesdetailsonhowtousepipelinesmainlyviathecuda::pipelineAPI(withpointers
toprimitiveswhereapplicable).
| 4.10.1. | Initialization |     |     |     |
| ------- | -------------- | --- | --- | --- |
A can be created at different thread scopes. For a scope other than
cuda::pipeline
cuda::thread_scope_thread, acuda::pipeline_shared_state<scope, count> objectisre-
quiredtocoordinatetheparticipatingthreads. Thisstateencapsulatesthefiniteresourcesthatallow
apipelinetoprocessuptocountconcurrentstages.
| ∕∕ Create             | a pipeline        | at thread                    | scope                    |     |
| --------------------- | ----------------- | ---------------------------- | ------------------------ | --- |
| constexpr             | auto scope        | = cuda::thread_scope_thread; |                          |     |
| cuda::pipeline<scope> |                   | pipeline                     | = cuda::make_pipeline(); |     |
| ∕∕ Create             | a pipeline        | at block                     | scope                    |     |
| constexpr             | auto scope        | = cuda::thread_scope_block;  |                          |     |
| constexpr             | auto stages_count |                              | = 2;                     |     |
__shared__ cuda::pipeline_shared_state<scope, stages_count> shared_state;
| auto pipeline | = cuda::make_pipeline(group, |     |     | &shared_state); |
| ------------- | ---------------------------- | --- | --- | --------------- |
Pipelinescanbeeitherunifiedorpartitioned. Inaunifiedpipeline,alltheparticipatingthreadsareboth
producers and consumers. In a partitioned pipeline, each participating thread is either a producer
or a consumer and its role cannot change during the lifetime of the pipeline object. A thread-local
4.10. Pipelines 275

CUDAProgrammingGuide,Release13.1
pipelinecannotbepartitioned. Tocreateapartitionedpipeline,weneedtoprovideeitherthenumber
ofproducersortheroleofthethreadtocuda::make_pipeline().
∕∕ Create a partitioned pipeline at block scope where only thread 0 is a
,→producer
constexpr auto scope = cuda::thread_scope_block;
constexpr auto stages_count = 2;
__shared__ cuda::pipeline_shared_state<scope, stages_count> shared_state;
auto thread_role = (group.thread_rank() == 0) ? cuda::pipeline_role::producer
,→: cuda::pipeline_role::consumer;
auto pipeline = cuda::make_pipeline(group, &shared_state, thread_role);
Tosupportpartitioning,asharedcuda::pipelineincursadditionaloverheads,includingusingaset
of shared memory barriers per stage for synchronization. These are used even when the pipeline is
unifiedandcoulduse__syncthreads()instead. Thus, itispreferabletousethread-localpipelines
whichavoidtheseoverheadswhenpossible.
4.10.2. Submitting Work
Committingworktoapipelinestageinvolves:
▶ Collectively acquiring the pipeline head from a set of producer threads using pipeline.
producer_acquire().
▶ Submittingasynchronousoperations,e.g.,memcpy_async,tothepipelinehead.
▶ Collectivelycommitting(advancing)thepipelineheadusingpipeline.producer_commit().
If all resources are in use, pipeline.producer_acquire() blocks producer threads until the re-
sourcesofthenextpipelinestagearereleasedbyconsumerthreads.
4.10.3. Consuming Work
Consumingworkfromapreviouslycommittedstageinvolves:
▶ Collectivelywaitingforthestagetocomplete,e.g.,usingpipeline.consumer_wait()towait
onthetail(oldest)stage,fromasetofconsumerthreads.
▶ Collectivelyreleasingthestageusingpipeline.consumer_release().
With cuda::pipeline<cuda:thread_scope_thread> one can also use the
cuda::pipeline_consumer_wait_prior<N>() friend function to wait for all except the last
Nstagestocomplete,similarto__pipeline_wait_prior(N)intheprimitivesAPI.
4.10.4. Warp Entanglement
The pipeline mechanism is shared among CUDA threads in the same warp. This sharing causes se-
quencesofsubmittedoperationstobeentangledwithinawarp,whichcanimpactperformanceunder
certaincircumstances.
Commit. Thecommitoperationiscoalescedsuchthatthepipeline’ssequenceisincrementedoncefor
allconvergedthreadsthatinvokethecommitoperationandtheirsubmittedoperationsarebatched
together. Ifthewarpisfullyconverged,thesequenceisincrementedbyoneandallsubmittedoper-
ationswillbebatchedinthesamestageofthepipeline;ifthewarpisfullydiverged,thesequenceis
incrementedby32andallsubmittedoperationswillbespreadtodifferentstages.
276 Chapter4. CUDAFeatures

CUDAProgrammingGuide,Release13.1
▶
LetPBbethewarp-sharedpipeline’sactualsequenceofoperations.
| PB = {BP0, | BP1, BP2, | …, BPL} |     |     |     |
| ---------- | --------- | ------- | --- | --- | --- |
▶ LetTBbeathread’sperceivedsequenceofoperations,asifthesequencewereonlyincremented
bythisthread’sinvocationofthecommitoperation.
| TB = {BT0, | BT1, BT2, | …, BTL} |     |     |     |
| ---------- | --------- | ------- | --- | --- | --- |
The pipeline::producer_commit() return value is from the thread’s perceived batch
sequence.
▶
Anindexinathread’sperceivedsequencealwaysalignstoanequalorlargerindexintheactual
warp-shared sequence. The sequences are equal only when all commit operations are invoked
fromfullyconvergedthreads.
| BTn  BPmwheren | <=  | m   |     |     |     |
| --------------- | --- | --- | --- | --- | --- |
Forexample,whenawarpisfullydiverged:
▶ Thewarp-sharedpipeline’sactualsequencewouldbe: PB = {0, 1, 2, 3, ..., 31}(PL=31).
▶
Theperceivedsequenceforeachthreadofthiswarpwouldbe:
| ▶ Thread0: | TB = {0}(TL=0) |     |     |     |     |
| ---------- | -------------- | --- | --- | --- | --- |
▶
Thread1: TB = {0}(TL=0)
▶
…
| ▶ Thread31: | TB = {0}(TL=0) |        |         |                           |     |
| ----------- | -------------- | ------ | ------- | ------------------------- | --- |
| Wait.       | A CUDA         | thread | invokes | pipeline::consumer_wait() | or  |
pipeline_consumer_wait_prior<N>() to wait for batches in the perceived se-
quence TB to complete. Note that pipeline::consumer_wait() is equivalent to
| pipeline_consumer_wait_prior<N>(),whereN |     |     | = PL. |     |     |
| ---------------------------------------- | --- | --- | ----- | --- | --- |
ThewaitpriorvariantswaitforbatchesintheactualsequenceatleastuptoandincludingPL-N.Since
TL <= PL, waiting for batch up to and including PL-N includes waiting for batch TL-N. Thus, when
TL < PL, the thread will unintentionally wait for additional, more recent batches. In the extreme
fully-divergedwarpexampleabove,eachthreadcouldwaitforall32batches.
Note
Itisrecommendedthatcommitinvocationsarebyconvergedthreadstonotover-wait,bykeeping
threads’perceivedsequenceofbatchesalignedwiththeactualsequence.
When code preceding these operations diverges threads, then the warp should be re-converged,
via__syncwarpbeforeinvokingcommitoperations.
| 4.10.5. Early | Exit |     |     |     |     |
| ------------- | ---- | --- | --- | --- | --- |
Whenathreadthatisparticipatinginapipelinemustexitearly,thatthreadmustexplicitlydropoutof
participation before exiting using cuda::pipeline::quit(). The remaining participating threads
canproceednormallywithsubsequentoperations.
| 4.10. Pipelines |     |     |     |     | 277 |
| --------------- | --- | --- | --- | --- | --- |

CUDAProgrammingGuide,Release13.1
| 4.10.6. | Tracking | Asynchronous | Memory | Operations |     |
| ------- | -------- | ------------ | ------ | ---------- | --- |
Thefollowingexampledemonstrateshowtocollectivelycopydatafromglobaltosharedmemorywith
asynchronousmemorycopiesusingapipelinetokeeptrackofthecopyoperations. Eachthreaduses
its own pipeline to independently submit memory copies and then wait for them to complete and
consumethedata. Formoredetailsonasynchronousdatacopies,seeSection3.2.5.
| 278 |     |     |     | Chapter4. | CUDAFeatures |
| --- | --- | --- | --- | --------- | ------------ |

CUDAProgrammingGuide,Release13.1
CUDAC++cuda::pipeline
| #include   | <cuda∕pipeline> |                      |     |     |     |            |     |
| ---------- | --------------- | -------------------- | --- | --- | --- | ---------- | --- |
| __global__ | void            | example_kernel(const |     |     |     | float *in) |     |
{
| constexpr |     | int block_size |     |     | = 128; |     |     |
| --------- | --- | -------------- | --- | --- | ------ | --- | --- |
__shared__ __align__(sizeof(float)) float buffer[4 * block_size];
| ∕∕ Create |     | a unified |     | pipeline | per | thread |     |
| --------- | --- | --------- | --- | -------- | --- | ------ | --- |
cuda::pipeline<cuda::thread_scope_thread> pipeline = cuda::make_
,→pipeline();
| ∕∕ First |     | stage | of memory |     | copies |     |     |
| -------- | --- | ----- | --------- | --- | ------ | --- | --- |
pipeline.producer_acquire();
| ∕∕ Every                   |     | thread | fetches | one | element | of the         | first block |
| -------------------------- | --- | ------ | ------- | --- | ------- | -------------- | ----------- |
| cuda::memcpy_async(buffer, |     |        |         |     | in,     | sizeof(float), | pipeline);  |
pipeline.producer_commit();
| ∕∕ Second |     | stage | of memory |     | copies |     |     |
| --------- | --- | ----- | --------- | --- | ------ | --- | --- |
pipeline.producer_acquire();
∕∕ Every thread fetches one element of the second and third block
cuda::memcpy_async(buffer + block_size, in + block_size, sizeof(float),
,→pipeline);
cuda::memcpy_async(buffer + 2 * block_size, in + 2 * block_size,
| ,→sizeof(float), |     | pipeline); |     |     |     |     |     |
| ---------------- | --- | ---------- | --- | --- | --- | --- | --- |
pipeline.producer_commit();
| ∕∕ Third |     | stage | of memory |     | copies |     |     |
| -------- | --- | ----- | --------- | --- | ------ | --- | --- |
pipeline.producer_acquire();
| ∕∕ Every |     | thread | fetches | one | element | of the | last block |
| -------- | --- | ------ | ------- | --- | ------- | ------ | ---------- |
cuda::memcpy_async(buffer + 3 * block_size, in + 3 * block_size,
| ,→sizeof(float), |     | pipeline); |     |     |     |     |     |
| ---------------- | --- | ---------- | --- | --- | --- | --- | --- |
pipeline.producer_commit();
| ∕∕ Wait | for | the | oldest | stage | (waits | for first | stage) |
| ------- | --- | --- | ------ | ----- | ------ | --------- | ------ |
pipeline.consumer_wait();
pipeline.consumer_release();
| ∕∕ __syncthreads(); |      |      |        |       |        |            |        |
| ------------------- | ---- | ---- | ------ | ----- | ------ | ---------- | ------ |
| ∕∕ Use              | data | from | the    | first | stage  |            |        |
| ∕∕ Wait             | for  | the  | oldest | stage | (waits | for second | stage) |
pipeline.consumer_wait();
pipeline.consumer_release();
| ∕∕ __syncthreads(); |      |      |        |        |        |           |        |
| ------------------- | ---- | ---- | ------ | ------ | ------ | --------- | ------ |
| ∕∕ Use              | data | from | the    | second | stage  |           |        |
| ∕∕ Wait             | for  | the  | oldest | stage  | (waits | for third | stage) |
pipeline.consumer_wait();
pipeline.consumer_release();
| ∕∕ __syncthreads(); |     |     |     |     |     |     |     |
| ------------------- | --- | --- | --- | --- | --- | --- | --- |
4.10. P∕∕ipeUlisneesdata 279
|     |     | from | the | third | stage |     |     |
| --- | --- | ---- | --- | ----- | ----- | --- | --- |
}

CUDAProgrammingGuide,Release13.1
CUDACprimitives
| #include   | <cuda_pipeline.h>         |     |     |            |     |     |
| ---------- | ------------------------- | --- | --- | ---------- | --- | --- |
| __global__ | void example_kernel(const |     |     | float *in) |     |     |
{
| constexpr | int block_size |     | = 128; |     |     |     |
| --------- | -------------- | --- | ------ | --- | --- | --- |
__shared__ __align__(sizeof(float)) float buffer[4 * block_size];
| ∕∕ First                        | batch  | of memory | copies      |                     |       |     |
| ------------------------------- | ------ | --------- | ----------- | ------------------- | ----- | --- |
| ∕∕ Every                        | thread | fetches   | one element | of the first        | block |     |
| __pipeline_memcpy_async(buffer, |        |           |             | in, sizeof(float)); |       |     |
__pipeline_commit();
| ∕∕ Second | batch | of memory | copies |     |     |     |
| --------- | ----- | --------- | ------ | --- | --- | --- |
∕∕ Every thread fetches one element of the second and third block
__pipeline_memcpy_async(buffer + block_size, in + block_size,
,→sizeof(float));
__pipeline_memcpy_async(buffer + 2 * block_size, in + 2 * block_size,
,→sizeof(float));
__pipeline_commit();
| ∕∕ Third | batch  | of memory | copies      |             |       |     |
| -------- | ------ | --------- | ----------- | ----------- | ----- | --- |
| ∕∕ Every | thread | fetches   | one element | of the last | block |     |
__pipeline_memcpy_async(buffer + 3 * block_size, in + 3 * block_size,
,→sizeof(float));
__pipeline_commit();
∕∕ Wait for all except the last two batches of memory copies (waits for
| ,→first | batch) |     |     |     |     |     |
| ------- | ------ | --- | --- | --- | --- | --- |
__pipeline_wait_prior(2);
| ∕∕ __syncthreads(); |           |           |       |     |     |     |
| ------------------- | --------- | --------- | ----- | --- | --- | --- |
| ∕∕ Use              | data from | the first | batch |     |     |     |
∕∕ Wait for all except the last batch of memory copies (waits for second
,→batch)
__pipeline_wait_prior(1);
| ∕∕ __syncthreads(); |           |            |       |     |     |     |
| ------------------- | --------- | ---------- | ----- | --- | --- | --- |
| ∕∕ Use              | data from | the second | batch |     |     |     |
∕∕ Wait for all batches of memory copies (waits for third batch)
__pipeline_wait_prior(0);
| ∕∕ __syncthreads(); |           |          |       |     |     |     |
| ------------------- | --------- | -------- | ----- | --- | --- | --- |
| ∕∕ Use              | data from | the last | batch |     |     |     |
}
| 280 |     |     |     |     | Chapter4. | CUDAFeatures |
| --- | --- | --- | --- | --- | --------- | ------------ |

CUDAProgrammingGuide,Release13.1
4.10.7. Producer-Consumer Pattern using Pipelines
InSection4.9.7,weshowedhowathreadblockcanbespatiallypartitionedtoimplementaproducer-
consumer pattern using asynchronousbarriers. With cuda::pipeline, this can be simplified using
asinglepartitionedpipelinewithonestageperdatabufferinsteadoftwoasynchronousbarriersper
buffer.
4.10. Pipelines 281

CUDAProgrammingGuide,Release13.1
CUDAC++cuda::pipeline
| #include |                  | <cuda∕pipeline>        |                                             |     |                              |     |     |     |     |     |     |
| -------- | ---------------- | ---------------------- | ------------------------------------------- | --- | ---------------------------- | --- | --- | --- | --- | --- | --- |
| #include |                  | <cooperative_groups.h> |                                             |     |                              |     |     |     |     |     |     |
| #pragma  | nv_diag_suppress |                        |                                             |     | static_var_with_dynamic_init |     |     |     |     |     |     |
| using    | pipeline         |                        | = cuda::pipeline<cuda::thread_scope_block>; |     |                              |     |     |     |     |     |     |
__device__ void produce(pipeline &pipe, int num_stages, int stage, int num_
,→batches, int batch, float *buffer, int buffer_len, float *in, int N)
{
|     | if (batch | <   | num_batches) |     |     |     |     |     |     |     |     |
| --- | --------- | --- | ------------ | --- | --- | --- | --- | --- | --- | --- | --- |
{
pipe.producer_acquire();
∕* copy data from in(batch) to buffer(stage) using asynchronous memory
|     | ,→copies | *∕  |     |     |     |     |     |     |     |     |     |
| --- | -------- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
pipe.producer_commit();
}
}
__device__ void consume(pipeline &pipe, int num_stages, int stage, int num_
,→batches, int batch, float *buffer, int buffer_len, float *out, int N)
{
pipe.consumer_wait();
|     | ∕* consume | buffer(stage) |     |     | and | update | out(batch) |     | *∕  |     |     |
| --- | ---------- | ------------- | --- | --- | --- | ------ | ---------- | --- | --- | --- | --- |
pipe.consumer_release();
}
__global__ void producer_consumer_pattern(float *in, float *out, int N, int
,→buffer_len)
{
|     | auto block | =   | cooperative_groups::this_thread_block(); |     |     |     |     |     |     |     |     |
| --- | ---------- | --- | ---------------------------------------- | --- | --- | --- | --- | --- | --- | --- | --- |
∕* Shared memory buffer declared below is of size 2 * buffer_len
|     | so that    | we          | can      | alternatively |           | work        | between  | two          | buffers. |     |     |
| --- | ---------- | ----------- | -------- | ------------- | --------- | ----------- | -------- | ------------ | -------- | --- | --- |
|     | buffer_0   |             | = buffer | and           | buffer_1  |             | = buffer | + buffer_len |          | *∕  |     |
|     | __shared__ | extern      |          | float         | buffer[]; |             |          |              |          |     |     |
|     | const int  | num_batches |          |               | = N ∕     | buffer_len; |          |              |          |     |     |
∕∕ Create a partitioned pipeline with 2 stages where half the threads are
|     | ,→producers       | and  | the        | other          | half                      | are | consumers.     |     |      |     |     |
| --- | ----------------- | ---- | ---------- | -------------- | ------------------------- | --- | -------------- | --- | ---- | --- | --- |
|     | constexpr         | auto | scope      | =              | cuda::thread_scope_block; |     |                |     |      |     |     |
|     | constexpr         | int  | num_stages |                | =                         | 2;  |                |     |      |     |     |
|     | cuda::std::size_t |      |            | producer_count |                           |     | = block.size() |     | ∕ 2; |     |     |
__shared__ cuda::pipeline_shared_state<scope, num_stages> shared_state;
pipeline pipe = cuda::make_pipeline(block, &shared_state, producer_count);
|     | ∕∕ Fill                 | the | pipeline |     |                   |     |     |     |     |     |     |
| --- | ----------------------- | --- | -------- | --- | ----------------- | --- | --- | --- | --- | --- | --- |
|     | if (block.thread_rank() |     |          |     | < producer_count) |     |     |     |     |     |     |
{
|     | for (int | s   | = 0; | s < | num_stages; |     | ++s) |     |     |     |     |
| --- | -------- | --- | ---- | --- | ----------- | --- | ---- | --- | --- | --- | --- |
{
| 282 |               |     |     |             |     |     |              |     | bufferC,habputefrfe4r. |     | _ClUeDnA, Fiena,tures |
| --- | ------------- | --- | --- | ----------- | --- | --- | ------------ | --- | ---------------------- | --- | ---------------------- |
|     | produce(pipe, |     |     | num_stages, |     | s,  | num_batches, |     | s,                     |     |                        |
,→N);
}
}
|     | ∕∕ Process  | the | batches |     |              |     |      |     |     |     |     |
| --- | ----------- | --- | ------- | --- | ------------ | --- | ---- | --- | --- | --- | --- |
|     | int stage   | =   | 0;      |     |              |     |      |     |     |     |     |
|     | for (size_t |     | b = 0;  | b < | num_batches; |     | ++b) |     |     |     |     |
{
|     | if (block.thread_rank() |     |     |     | <   | producer_count) |     |     |     |     |     |
| --- | ----------------------- | --- | --- | --- | --- | --------------- | --- | --- | --- | --- | --- |
{
|     | ∕∕  | Prefetch | the | next | batch |     |     |     |     |     |     |
| --- | --- | -------- | --- | ---- | ----- | --- | --- | --- | --- | --- | --- |
produce(pipe, num_stages, stage, num_batches, b + num_stages, buffer,
|     | ,→buffer_len, |     | in, N); |     |     |     |     |     |     |     |     |
| --- | ------------- | --- | ------- | --- | --- | --- | --- | --- | --- | --- | --- |
}
else
{
|     | ∕∕  | Consume | the | oldest | batch |     |     |     |     |     |     |
| --- | --- | ------- | --- | ------ | ----- | --- | --- | --- | --- | --- | --- |
consume(pipe, num_stages, stage, num_batches, b, buffer, buffer_len,
|     | ,→out, N); |     |     |     |     |     |     |     |     |     |     |
| --- | ---------- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
}
|     | stage | = (stage |     | + 1) | % num_stages; |     |     |     |     |     |     |
| --- | ----- | -------- | --- | ---- | ------------- | --- | --- | --- | --- | --- | --- |
}
}

CUDAProgrammingGuide,Release13.1
In this example, we use half of the threads in the thread block as producers and the other half as
consumers. As a first step, we need to create a cuda::pipeline object. Since we want some
threads to be producers and some to be consumers, we need to use a partitioned pipeline with
cuda::thread_scope_block. Partitioned pipelines require a cuda::pipeline_shared_state to
coordinate the participating threads. We initialize the state for a 2-stage pipeline in thread-block
scope and then call cuda::make_pipeline(). Next, producer threads fill the pipeline by submit-
tingasynchronouscopiesfromintobuffer. Atthispointalldatacopiesarein-flight. Finally,inthe
mainloop,wegooverallofthebatchesofdataanddependingonwhetherathreadisaproduceror
consumer, we either submit another asynchronous copy for a future batch or consume the current
batch.
4.11. Asynchronous Data Copies
BuildingonSection3.2.5,thissectionprovidesdetailedguidanceandexamplesforasynchronousdata
movementwithintheGPUmemoryhierarchy. ItcoversLDGSTSforelement-wisecopies, theTensor
MemoryAccelerator(TMA)forbulk(one-dimensionalandmulti-dimensional)transfers,andSTASfor
registertodistributedsharedmemorycopies,andshowshowthesemechanismsintegratewithasyn-
chronousbarriersandpipelines.
4.11.1. Using LDGSTS
ManyCUDAapplicationsrequirefrequentdatamovementbetweenglobalandsharedmemory. Often,
this involves copying smaller data elements or performing irregular memory access patterns. The
primarygoalofLDGSTS(CC8.0+,seePTXdocumentation)istoprovideanefficientasynchronousdata
transfermechanismfromglobalmemorytosharedmemoryforsmaller,element-wisedatatransfers
whileenablingbetterutilizationofcomputeresourcesthroughoverlappedexecution.
Dimensions. LDGSTSsupportscopying4,8,or16bytes. Copying4or8bytesalwayshappensinthe
socalledL1ACCESSmode,inwhichcasedataisalsocachedintheL1,whilecopying16-bytesenables
theL1BYPASSmode,inwhichcasetheL1isnotpolluted.
Sourceanddestination. TheonlydirectionsupportedforasynchronouscopyoperationswithLDGSTS
isfromglobal to sharedmemory. The pointers need to be aligned to 4, 8, or 16 bytes depending on
thesizeofthedatabeingcopied. Bestperformanceisachievedwhenthealignmentofbothshared
memoryandglobalmemoryis128bytes.
Asynchronicity. Data transfers using LDGSTS are asynchronous and are modeled as async thread
operations(seeAsyncThreadandAsyncProxy). Thisallowstheinitiatingthreadtocontinuecomputing
whilethehardwareasynchronouslycopiesthedata. Whetherthedatatransferoccursasynchronously
inpracticeisuptothehardwareimplementationandmaychangeinthefuture.
LDGSTSmustprovideasignalwhentheoperationiscomplete. LDGSTScanusesharedmemorybar-
riersorpipelinesasmechanismstoprovidecompletionsignals. Bydefault,eachthreadonlywaitsfor
itsownLDGSTScopies. Thus,ifyouuseLDGSTStoprefetchsomedatathatwillbesharedwithother
threads,a__syncthreads()isnecessaryaftersynchronizingwiththeLDGSTScompletionmecha-
nism.
4.11. AsynchronousDataCopies 283

CUDAProgrammingGuide,Release13.1
|     |     | Table    | 18: | Asynchronous |        | copies | with       | possible | source     | and des- |     |     |
| --- | --- | -------- | --- | ------------ | ------ | ------ | ---------- | -------- | ---------- | -------- | --- | --- |
|     |     | tination |     | memory       | spaces | and    | completion |          | mechanisms | using    |     |     |
LDGSTS.Anemptycellindicatesthatasource-destinationpair
isnotsupported.
| Direction |          |     | AsynchronousCopy(LDGSTS,CC8.0+) |     |     |     |     |     |     |     |     |     |
| --------- | -------- | --- | ------------------------------- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| Source    | Destina- |     | CompletionMecha-                |     |     |     | API |     |     |     |     |     |
|           | tion     |     | nism                            |     |     |     |     |     |     |     |     |     |
| global    | global   |     |                                 |     |     |     |     |     |     |     |     |     |
shared::ctaglobal
| global | shared::ctasharedmemorybar- |     |               |     |     |     | cuda::memcpy_async,           |     |     |     |     | coop- |
| ------ | --------------------------- | --- | ------------- | --- | --- | --- | ----------------------------- | --- | --- | --- | --- | ----- |
|        |                             |     | rier,pipeline |     |     |     | erative_groups::memcpy_async, |     |     |     |     |       |
__pipeline_memcpy_async
| global | shared::cluster |     |     |     |     |     |     |     |     |     |     |     |
| ------ | --------------- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
shared::clusstherared::cta
shared::ctashared::cta
Inthefollowingsections,wewilldemonstratehowtouseLDGSTSthroughexamplesandexplainthe
differencesbetweenthedifferentAPIs.
| 4.11.1.1 | BatchingLoadsinConditionalCode |     |     |     |     |     |     |     |     |     |     |     |
| -------- | ------------------------------ | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
Inthisstencilexample,thefirstwarpofthethreadblockisresponsibleforcollectivelyloadingallthe
requireddatafromthecenteraswellastheleftandrighthalos. Withsynchronouscopies,duetothe
conditionalnatureofthecode,thecompilermaychoosetogenerateasequenceofload-from-global
(LDG) store-to-shared (STS) instructions instead of 3 LDGs followed by 3 STSs, which would be the
optimalwaytoloadthedatatohidetheglobalmemorylatency.
__global__ void stencil_kernel(const float *left, const float *center, const
| ,→float | *right) |     |     |     |     |     |     |     |     |     |     |     |
| ------- | ------- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
{
∕∕ Left halo (8 elements) - center (32 elements) - right halo (8 elements)
|     | __shared__  | float   | buffer[8       |            | +           | 32 + | 8];  |       |      |     |     |     |
| --- | ----------- | ------- | -------------- | ---------- | ----------- | ---- | ---- | ----- | ---- | --- | --- | --- |
|     | const int   | tid     | = threadIdx.x; |            |             |      |      |       |      |     |     |     |
|     | if (tid     | < 8)    | {              |            |             |      |      |       |      |     |     |     |
|     | buffer[tid] |         | =              | left[tid]; |             | ∕∕   | Left | halo  |      |     |     |     |
|     | } else      | if (tid | >=             | 32 -       | 8) {        |      |      |       |      |     |     |     |
|     | buffer[tid  |         | + 16]          | =          | right[tid]; |      | ∕∕   | Right | halo |     |     |     |
}
|     | if (tid    | < 32) | {    |                |     |     |           |     |     |     |     |     |
| --- | ---------- | ----- | ---- | -------------- | --- | --- | --------- | --- | --- | --- | --- | --- |
|     | buffer[tid |       | + 8] | = center[tid]; |     |     | ∕∕ Center |     |     |     |     |     |
}
__syncthreads();
|     | ∕∕ Compute | stencil |     |     |     |     |     |     |     |     |     |     |
| --- | ---------- | ------- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
}
Toensurethatthedataisloadedintheoptimalway,wecanreplacethesynchronousmemorycopies
| 284 |     |     |     |     |     |     |     |     |     | Chapter4. | CUDAFeatures |     |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --------- | ------------ | --- |

CUDAProgrammingGuide,Release13.1
withasynchronouscopiesthatloaddatadirectlyfromglobalmemorytosharedmemory. Thisnotonly
reducesregisterusagebycopyingthedatadirectlytosharedmemory,butalsoensuresallloadsfrom
globalmemoryarein-flight.
4.11. AsynchronousDataCopies 285

CUDAProgrammingGuide,Release13.1
CUDAC++cuda::memcpy_async
| #include | <cooperative_groups.h> |     |     |     |     |     |     |     |     |     |
| -------- | ---------------------- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| #include | <cuda∕barrier>         |     |     |     |     |     |     |     |     |     |
__global__ void stencil_kernel(const float *left, const float *center,
|     | ,→const float | *right) |     |     |     |     |     |     |     |     |
| --- | ------------- | ------- | --- | --- | --- | --- | --- | --- | --- | --- |
{
|     | auto block              | =         | cooperative_groups::this_thread_block(); |
No results found