Created
February 25, 2026 21:22
-
-
Save lee101/6fdd96a37e70d6bd8eb9588d3069a054 to your computer and use it in GitHub Desktop.
This file has been truncated, but you can view the full file.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| | CUDA | Programming | Guide | | |
| | ---- | ----------- | ----- | | |
| Release 13.1 | |
| NVIDIA Corporation | |
| Dec12,2025 | |
| Contents | |
| 1 IntroductiontoCUDA 3 | |
| 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 | |
| 1.1.1 TheGraphicsProcessingUnit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 | |
| 1.1.2 TheBenefitsofUsingGPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 | |
| 1.1.3 GettingStartedQuickly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 | |
| 1.2 ProgrammingModel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 | |
| 1.2.1 HeterogeneousSystems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 | |
| 1.2.2 GPUHardwareModel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 | |
| 1.2.2.1 ThreadBlocksandGrids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 | |
| 1.2.2.2 WarpsandSIMT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 | |
| 1.2.3 GPUMemory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 | |
| 1.2.3.1 DRAMMemoryinHeterogeneousSystems . . . . . . . . . . . . . . . . . . . . . . . 11 | |
| 1.2.3.2 On-ChipMemoryinGPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 | |
| 1.2.3.3 UnifiedMemory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 | |
| 1.3 TheCUDAplatform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 | |
| 1.3.1 ComputeCapabilityandStreamingMultiprocessorVersions . . . . . . . . . . . . . . . 13 | |
| 1.3.2 CUDAToolkitandNVIDIADriver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 | |
| 1.3.2.1 CUDARuntimeAPIandCUDADriverAPI . . . . . . . . . . . . . . . . . . . . . . . . . 13 | |
| 1.3.3 ParallelThreadExecution(PTX) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 | |
| 1.3.4 CubinsandFatbins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 | |
| 1.3.4.1 BinaryCompatibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 | |
| 1.3.4.2 PTXCompatibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 | |
| 1.3.4.3 Just-in-TimeCompilation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 | |
| 2 ProgrammingGPUsinCUDA 17 | |
| 2.1 IntrotoCUDAC++ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 | |
| 2.1.1 CompilationwithNVCC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 | |
| 2.1.2 Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 | |
| 2.1.2.1 SpecifyingKernels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 | |
| 2.1.2.2 LaunchingKernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 | |
| 2.1.2.3 ThreadandGridIndexIntrinsics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 | |
| 2.1.3 MemoryinGPUComputing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 | |
| 2.1.3.1 UnifiedMemory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 | |
| 2.1.3.2 ExplicitMemoryManagement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 | |
| 2.1.3.3 MemoryManagementandApplicationPerformance . . . . . . . . . . . . . . . . . 24 | |
| 2.1.4 SynchronizingCPUandGPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 | |
| 2.1.5 PuttingitAllTogether . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 | |
| 2.1.6 RuntimeInitialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 | |
| 2.1.7 ErrorCheckinginCUDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 | |
| 2.1.7.1 ErrorState . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 | |
| 2.1.7.2 AsynchronousErrors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 | |
| 2.1.7.3 CUDA_LOG_FILE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 | |
| 2.1.8 DeviceandHostFunctions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 | |
| i | |
| 2.1.9 VariableSpecifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 | |
| 2.1.9.1 DetectingDeviceCompilation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 | |
| 2.1.10 ThreadBlockClusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 | |
| 2.1.10.1 LaunchingwithClustersinTripleChevronNotation . . . . . . . . . . . . . . . . . . 34 | |
| 2.2 WritingCUDASIMTKernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 | |
| 2.2.1 BasicsofSIMT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 | |
| 2.2.2 ThreadHierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 | |
| 2.2.3 GPUDeviceMemorySpaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 | |
| 2.2.3.1 GlobalMemory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 | |
| 2.2.3.2 SharedMemory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 | |
| 2.2.3.3 Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 | |
| 2.2.3.4 LocalMemory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 | |
| 2.2.3.5 ConstantMemory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 | |
| 2.2.3.6 Caches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 | |
| 2.2.3.7 TextureandSurfaceMemory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 | |
| 2.2.3.8 DistributedSharedMemory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 | |
| 2.2.4 MemoryPerformance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 | |
| 2.2.4.1 CoalescedGlobalMemoryAccess . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 | |
| 2.2.4.2 SharedMemoryAccessPatterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 | |
| 2.2.5 Atomics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 | |
| 2.2.6 CooperativeGroups. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 | |
| 2.2.7 KernelLaunchandOccupancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 | |
| 2.3 AsynchronousExecution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 | |
| 2.3.1 WhatisAsynchronousConcurrentExecution? . . . . . . . . . . . . . . . . . . . . . . . . 56 | |
| 2.3.2 CUDAStreams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 | |
| 2.3.2.1 CreatingandDestroyingCUDAStreams . . . . . . . . . . . . . . . . . . . . . . . . . 57 | |
| 2.3.2.2 LaunchingKernelsinCUDAStreams . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 | |
| 2.3.2.3 LaunchingMemoryTransfersinCUDAStreams . . . . . . . . . . . . . . . . . . . . 58 | |
| 2.3.2.4 StreamSynchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 | |
| 2.3.3 CUDAEvents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 | |
| 2.3.3.1 CreatingandDestroyingCUDAEvents . . . . . . . . . . . . . . . . . . . . . . . . . . 60 | |
| 2.3.3.2 InsertingEventsintoCUDAStreams . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 | |
| 2.3.3.3 TimingOperationsinCUDAStreams . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 | |
| 2.3.3.4 CheckingtheStatusofCUDAEvents . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 | |
| 2.3.4 CallbackFunctionsfromStreams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 | |
| 2.3.4.1 UsingcudaStreamAddCallback() . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 | |
| 2.3.4.2 AsynchronousErrorHandling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 | |
| 2.3.5 CUDAStreamOrdering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 | |
| 2.3.6 Blockingandnon-blockingstreamsandthedefaultstream . . . . . . . . . . . . . . . 66 | |
| 2.3.6.1 LegacyDefaultStream . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 | |
| 2.3.6.2 Per-threadDefaultStream . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 | |
| 2.3.7 ExplicitSynchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 | |
| 2.3.8 ImplicitSynchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 | |
| 2.3.9 MiscellaneousandAdvancedtopics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 | |
| 2.3.9.1 StreamPrioritization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 | |
| 2.3.9.2 IntroductiontoCUDAGraphswithStreamCapture . . . . . . . . . . . . . . . . . . 68 | |
| 2.3.10 SummaryofAsynchronousExecution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 | |
| 2.4 UnifiedandSystemMemory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 | |
| 2.4.1 UnifiedVirtualAddressSpace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 | |
| 2.4.2 UnifiedMemory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 | |
| 2.4.2.1 UnifiedMemoryParadigms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 | |
| 2.4.2.2 FullUnifiedMemoryFeatureSupport . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 | |
| 2.4.2.3 LimitedUnifiedMemorySupport . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 | |
| 2.4.2.4 MemoryAdviseandPrefetch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 | |
| ii | |
| 2.4.3 Page-LockedHostMemory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 | |
| 2.4.3.1 MappedMemory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 | |
| 2.4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 | |
| 2.5 NVCC:TheNVIDIACUDACompiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 | |
| 2.5.1 CUDASourceFilesandHeaders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 | |
| 2.5.2 NVCCCompilationWorkflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 | |
| 2.5.3 NVCCBasicUsage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 | |
| 2.5.3.1 NVCCPTXandCubinGeneration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 | |
| 2.5.3.2 HostCodeCompilationNotes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 | |
| 2.5.3.3 SeparateCompilationofGPUCode . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 | |
| 2.5.4 CommonCompilerOptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 | |
| 2.5.4.1 LanguageFeatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 | |
| 2.5.4.2 DebuggingOptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 | |
| 2.5.4.3 OptimizationOptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 | |
| 2.5.4.4 Link-TimeOptimization(LTO) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 | |
| 2.5.4.5 ProfilingOptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 | |
| 2.5.4.6 FatbinCompression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 | |
| 2.5.4.7 CompilerPerformanceControls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 | |
| 3 AdvancedCUDA 89 | |
| 3.1 AdvancedCUDAAPIsandFeatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 | |
| 3.1.1 cudaLaunchKernelEx . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 | |
| 3.1.2 LaunchingClusters: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 | |
| 3.1.2.1 LaunchingwithClustersusingcudaLaunchKernelEx . . . . . . . . . . . . . . . . . 90 | |
| 3.1.2.2 BlocksasClusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 | |
| 3.1.3 MoreonStreamsandEvents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 | |
| 3.1.3.1 StreamPriorities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 | |
| 3.1.3.2 ExplicitSynchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 | |
| 3.1.3.3 ImplicitSynchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 | |
| 3.1.4 ProgrammaticDependentKernelLaunch . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 | |
| 3.1.5 BatchedMemoryTransfers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 | |
| 3.1.6 EnvironmentVariables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 | |
| 3.2 AdvancedKernelProgramming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 | |
| 3.2.1 UsingPTX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 | |
| 3.2.2 HardwareImplementation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 | |
| 3.2.2.1 SIMTExecutionModel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 | |
| 3.2.2.2 HardwareMultithreading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 | |
| 3.2.2.3 AsynchronousExecutionFeatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 | |
| 3.2.3 ThreadScopes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 | |
| 3.2.4 AdvancedSynchronizationPrimitives. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 | |
| 3.2.4.1 ScopedAtomics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 | |
| 3.2.4.2 AsynchronousBarriers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 | |
| 3.2.4.3 Pipelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 | |
| 3.2.5 AsynchronousDataCopies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 | |
| 3.2.6 ConfiguringL1/SharedMemoryBalance. . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 | |
| 3.3 TheCUDADriverAPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 | |
| 3.3.1 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 | |
| 3.3.2 Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 | |
| 3.3.3 KernelExecution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 | |
| 3.3.4 InteroperabilitybetweenRuntimeandDriverAPIs. . . . . . . . . . . . . . . . . . . . . . 125 | |
| 3.4 ProgrammingSystemswithMultipleGPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 | |
| 3.4.1 Multi-DeviceContextandExecutionManagement . . . . . . . . . . . . . . . . . . . . . 126 | |
| 3.4.1.1 DeviceEnumeration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 | |
| 3.4.1.2 DeviceSelection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 | |
| iii | |
| 3.4.1.3 Multi-DeviceStream,Event,andMemoryCopyBehavior . . . . . . . . . . . . . . . 127 | |
| 3.4.2 Multi-DevicePeer-to-PeerTransfersandMemoryAccess . . . . . . . . . . . . . . . . . 128 | |
| 3.4.2.1 Peer-to-PeerMemoryTransfers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 | |
| 3.4.2.2 Peer-to-PeerMemoryAccess . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 | |
| 3.4.2.3 Peer-to-PeerMemoryConsistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 | |
| 3.4.2.4 Multi-DeviceManagedMemory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 | |
| 3.4.2.5 HostIOMMUHardware,PCIAccessControlServices,andVMs . . . . . . . . . . . 130 | |
| 3.5 ATourofCUDAFeatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 | |
| 3.5.1 ImprovingKernelPerformance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 | |
| 3.5.1.1 AsynchronousBarriers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 | |
| 3.5.1.2 AsynchronousDataCopiesandtheTensorMemoryAccelerator(TMA) . . . . . . 130 | |
| 3.5.1.3 Pipelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 | |
| 3.5.1.4 WorkStealingwithClusterLaunchControl . . . . . . . . . . . . . . . . . . . . . . . 131 | |
| 3.5.2 ImprovingLatencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 | |
| 3.5.2.1 GreenContexts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 | |
| 3.5.2.2 Stream-OrderedMemoryAllocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 | |
| 3.5.2.3 CUDAGraphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 | |
| 3.5.2.4 ProgrammaticDependentLaunch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 | |
| 3.5.2.5 LazyLoading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 | |
| 3.5.3 FunctionalityFeatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 | |
| 3.5.3.1 ExtendedGPUMemory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 | |
| 3.5.3.2 DynamicParallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 | |
| 3.5.4 CUDAInteroperability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 | |
| 3.5.4.1 CUDAInteroperabilitywithotherAPIs . . . . . . . . . . . . . . . . . . . . . . . . . . 133 | |
| 3.5.4.2 InterprocessCommunication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 | |
| 3.5.5 Fine-GrainedControl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 | |
| 3.5.5.1 VirtualMemoryManagement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 | |
| 3.5.5.2 DriverEntryPointAccess . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 | |
| 3.5.5.3 ErrorLogManagement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 | |
| 4 CUDAFeatures 135 | |
| 4.1 UnifiedMemory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 | |
| 4.1.1 UnifiedMemoryonDeviceswithFullCUDAUnifiedMemorySupport. . . . . . . . . . 135 | |
| 4.1.1.1 UnifiedMemory: In-DepthExamples . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 | |
| 4.1.1.2 PerformanceTuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 | |
| 4.1.2 UnifiedMemoryonDeviceswithonlyCUDAManagedMemorySupport . . . . . . . . 149 | |
| 4.1.3 UnifiedMemoryonWindows,WSL,andTegra . . . . . . . . . . . . . . . . . . . . . . . . 150 | |
| 4.1.3.1 Multi-GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 | |
| 4.1.3.2 CoherencyandConcurrency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 | |
| 4.1.3.3 StreamAssociatedUnifiedMemory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 | |
| 4.1.4 PerformanceHints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 | |
| 4.1.4.1 DataPrefetching. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 | |
| 4.1.4.2 DataUsageHints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 | |
| 4.1.4.3 QueryingDataUsageAttributesonManagedMemory . . . . . . . . . . . . . . . . 160 | |
| 4.1.4.4 GPUMemoryOversubscription . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 | |
| 4.2 CUDAGraphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 | |
| 4.2.1 GraphStructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 | |
| 4.2.1.1 NodeTypes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 | |
| 4.2.1.2 EdgeData . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 | |
| 4.2.2 BuildingandRunningGraphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 | |
| 4.2.2.1 GraphCreation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 | |
| 4.2.2.2 GraphInstantiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 | |
| 4.2.2.3 GraphExecution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 | |
| 4.2.3 UpdatingInstantiatedGraphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 | |
| iv | |
| 4.2.3.1 WholeGraphUpdate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 | |
| 4.2.3.2 IndividualNodeUpdate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 | |
| 4.2.3.3 IndividualNodeEnable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 | |
| 4.2.3.4 GraphUpdateLimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 | |
| 4.2.4 ConditionalGraphNodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 | |
| 4.2.4.1 ConditionalHandles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 | |
| 4.2.4.2 ConditionalNodeBodyGraphRequirements . . . . . . . . . . . . . . . . . . . . . . 177 | |
| 4.2.4.3 ConditionalIFNodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 | |
| 4.2.4.4 ConditionalWHILENodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 | |
| 4.2.4.5 ConditionalSWITCHNodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182 | |
| 4.2.5 GraphMemoryNodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 | |
| 4.2.5.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 | |
| 4.2.5.2 APIFundamentals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 | |
| 4.2.5.3 OptimizedMemoryReuse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 | |
| 4.2.5.4 PerformanceConsiderations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 | |
| 4.2.5.5 PhysicalMemoryFootprint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 | |
| 4.2.5.6 PeerAccess . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 | |
| 4.2.6 DeviceGraphLaunch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198 | |
| 4.2.6.1 DeviceGraphCreation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 | |
| 4.2.6.2 DeviceLaunch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200 | |
| 4.2.7 UsingGraphAPIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 | |
| 4.2.8 CUDAUserObjects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 | |
| 4.3 Stream-OrderedMemoryAllocator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 | |
| 4.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 | |
| 4.3.2 MemoryManagement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210 | |
| 4.3.2.1 AllocatingMemory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210 | |
| 4.3.2.2 FreeingMemory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 | |
| 4.3.3 MemoryPools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 | |
| 4.3.3.1 Default/ImplicitPools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212 | |
| 4.3.3.2 ExplicitPools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212 | |
| 4.3.3.3 DeviceAccessibilityforMulti-GPUSupport . . . . . . . . . . . . . . . . . . . . . . . 213 | |
| 4.3.3.4 EnablingMemoryPoolsforIPC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 | |
| 4.3.4 BestPracticesandTuning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 | |
| 4.3.4.1 QueryforSupport . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 | |
| 4.3.4.2 PhysicalPageCachingBehavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 | |
| 4.3.4.3 ResourceUsageStatistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218 | |
| 4.3.4.4 MemoryReusePolicies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219 | |
| 4.3.4.5 SynchronizationAPIActions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 | |
| 4.3.5 Addendums . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 | |
| 4.3.5.1 cudaMemcpyAsyncCurrentContext/DeviceSensitivity . . . . . . . . . . . . . . . 221 | |
| 4.3.5.2 cudaPointerGetAttributesQuery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 | |
| 4.3.5.3 cudaGraphAddMemsetNode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 | |
| 4.3.5.4 PointerAttributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 | |
| 4.3.5.5 CPUVirtualMemory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 | |
| 4.4 CooperativeGroups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 | |
| 4.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222 | |
| 4.4.2 CooperativeGroupHandle&MemberFunctions . . . . . . . . . . . . . . . . . . . . . . . 222 | |
| 4.4.3 DefaultBehavior/GrouplessExecution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222 | |
| 4.4.3.1 CreateImplicitGroupHandlesAsEarlyAsPossible . . . . . . . . . . . . . . . . . . 223 | |
| 4.4.3.2 OnlyPassGroupHandlesbyReference . . . . . . . . . . . . . . . . . . . . . . . . . . 223 | |
| 4.4.4 CreatingCooperativeGroups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 | |
| 4.4.4.1 AvoidingGroupCreationHazards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224 | |
| 4.4.5 Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224 | |
| 4.4.5.1 Sync . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224 | |
| v | |
| 4.4.5.2 Barriers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224 | |
| 4.4.6 CollectiveOperations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225 | |
| 4.4.6.1 Reduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225 | |
| 4.4.6.2 Scans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226 | |
| 4.4.6.3 InvokeOne . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226 | |
| 4.4.7 AsynchronousDataMovement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227 | |
| 4.4.7.1 MemcpyAsyncAlignmentRequirements . . . . . . . . . . . . . . . . . . . . . . . . 227 | |
| 4.4.8 LargeScaleGroups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228 | |
| 4.4.8.1 WhentousecudaLaunchCooperativeKernel . . . . . . . . . . . . . . . . . . . . 228 | |
| 4.5 ProgrammaticDependentLaunchandSynchronization . . . . . . . . . . . . . . . . . . . . 228 | |
| 4.5.1 Background. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228 | |
| 4.5.2 APIDescription . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230 | |
| 4.5.3 UseinCUDAGraphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231 | |
| 4.6 GreenContexts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231 | |
| 4.6.1 Motivation/WhentoUse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232 | |
| 4.6.2 GreenContexts: Easeofuse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234 | |
| 4.6.3 GreenContexts: DeviceResourceandResourceDescriptor . . . . . . . . . . . . . . . . 236 | |
| 4.6.4 GreenContextCreationExample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237 | |
| 4.6.4.1 Step1: GetavailableGPUresources. . . . . . . . . . . . . . . . . . . . . . . . . . . . 238 | |
| 4.6.4.2 Step2: PartitionSMresources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239 | |
| 4.6.4.3 Step2(continued): Addworkqueueresources . . . . . . . . . . . . . . . . . . . . . 246 | |
| 4.6.4.4 Step3: CreateaResourceDescriptor . . . . . . . . . . . . . . . . . . . . . . . . . . . 246 | |
| 4.6.4.5 Step4: CreateaGreenContext . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247 | |
| 4.6.5 GreenContexts-Launchingwork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247 | |
| 4.6.6 AdditionalExecutionContextsAPIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250 | |
| 4.6.7 GreenContextsExample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251 | |
| 4.7 LazyLoading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252 | |
| 4.7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252 | |
| 4.7.2 ChangeHistory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252 | |
| 4.7.3 RequirementsforLazyLoading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252 | |
| 4.7.3.1 CUDARuntimeVersionRequirement . . . . . . . . . . . . . . . . . . . . . . . . . . . 253 | |
| 4.7.3.2 CUDADriverVersionRequirement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253 | |
| 4.7.3.3 CompilerRequirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253 | |
| 4.7.3.4 KernelRequirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253 | |
| 4.7.4 Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253 | |
| 4.7.4.1 Enabling&Disabling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253 | |
| 4.7.4.2 CheckingifLazyLoadingisEnabledatRuntime . . . . . . . . . . . . . . . . . . . . 253 | |
| 4.7.4.3 ForcingaModuletoLoadEagerlyatRuntime . . . . . . . . . . . . . . . . . . . . . . 254 | |
| 4.7.5 PotentialHazards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254 | |
| 4.7.5.1 ImpactonConcurrentKernelExecution . . . . . . . . . . . . . . . . . . . . . . . . . 254 | |
| 4.7.5.2 LargeMemoryAllocations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254 | |
| 4.7.5.3 ImpactonPerformanceMeasurements . . . . . . . . . . . . . . . . . . . . . . . . . 254 | |
| 4.8 ErrorLogManagement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254 | |
| 4.8.1 Background. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255 | |
| 4.8.2 Activation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255 | |
| 4.8.3 Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255 | |
| 4.8.4 APIDescription . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255 | |
| 4.8.5 LimitationsandKnownIssues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256 | |
| 4.9 AsynchronousBarriers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256 | |
| 4.9.1 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257 | |
| 4.9.2 ABarrier’sPhase: Arrival,Countdown,Completion,andReset . . . . . . . . . . . . . . 258 | |
| 4.9.2.1 WarpEntanglement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259 | |
| 4.9.3 ExplicitPhaseTracking. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259 | |
| 4.9.4 EarlyExit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262 | |
| vi | |
| 4.9.5 CompletionFunction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264 | |
| 4.9.6 TrackingAsynchronousMemoryOperations . . . . . . . . . . . . . . . . . . . . . . . . . 267 | |
| 4.9.7 Producer-ConsumerPatternUsingBarriers . . . . . . . . . . . . . . . . . . . . . . . . . . 268 | |
| 4.10 Pipelines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275 | |
| 4.10.1 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275 | |
| 4.10.2 SubmittingWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276 | |
| 4.10.3 ConsumingWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276 | |
| 4.10.4 WarpEntanglement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276 | |
| 4.10.5 EarlyExit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277 | |
| 4.10.6 TrackingAsynchronousMemoryOperations . . . . . . . . . . . . . . . . . . . . . . . . . 278 | |
| 4.10.7 Producer-ConsumerPatternusingPipelines . . . . . . . . . . . . . . . . . . . . . . . . . 281 | |
| 4.11 AsynchronousDataCopies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283 | |
| 4.11.1 UsingLDGSTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283 | |
| 4.11.1.1 BatchingLoadsinConditionalCode . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284 | |
| 4.11.1.2 PrefetchingData. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289 | |
| 4.11.1.3 Producer-ConsumerPatternThroughWarpSpecialization . . . . . . . . . . . . . . 295 | |
| 4.11.2 UsingtheTensorMemoryAccelerator(TMA) . . . . . . . . . . . . . . . . . . . . . . . . . 299 | |
| 4.11.2.1 UsingTMAtotransferone-dimensionalarrays . . . . . . . . . . . . . . . . . . . . . 300 | |
| 4.11.2.2 UsingTMAtotransfermulti-dimensionalarrays . . . . . . . . . . . . . . . . . . . . 307 | |
| 4.11.3 UsingSTAS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324 | |
| 4.12 WorkStealingwithClusterLaunchControl . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327 | |
| 4.12.1 APIDetails . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329 | |
| 4.12.1.1 ThreadBlockCancellation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329 | |
| 4.12.1.2 ConstraintsonThreadBlockCancellation . . . . . . . . . . . . . . . . . . . . . . . . 330 | |
| 4.12.2 Example: Vector-ScalarMultiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331 | |
| 4.12.2.1 Use-case: ThreadBlocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331 | |
| 4.12.2.2 Use-case: ThreadBlockClusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333 | |
| 4.13 L2CacheControl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335 | |
| 4.13.1 L2CacheSet-AsideforPersistingAccesses . . . . . . . . . . . . . . . . . . . . . . . . . 335 | |
| 4.13.2 L2PolicyforPersistingAccesses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336 | |
| 4.13.3 L2AccessProperties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337 | |
| 4.13.4 L2PersistenceExample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337 | |
| 4.13.5 ResetL2AccesstoNormal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339 | |
| 4.13.6 ManageUtilizationofL2set-asidecache . . . . . . . . . . . . . . . . . . . . . . . . . . . 339 | |
| 4.13.7 QueryL2cacheProperties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339 | |
| 4.13.8 ControlL2CacheSet-AsideSizeforPersistingMemoryAccess . . . . . . . . . . . . . 339 | |
| 4.14 MemorySynchronizationDomains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340 | |
| 4.14.1 MemoryFenceInterference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340 | |
| 4.14.2 IsolatingTrafficwithDomains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341 | |
| 4.14.3 UsingDomainsinCUDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341 | |
| 4.15 InterprocessCommunication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342 | |
| 4.15.1 IPCusingtheLegacyInterprocessCommunicationAPI . . . . . . . . . . . . . . . . . . 343 | |
| 4.15.2 IPCusingtheVirtualMemoryManagementAPI . . . . . . . . . . . . . . . . . . . . . . . 344 | |
| 4.16 VirtualMemoryManagement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344 | |
| 4.16.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345 | |
| 4.16.1.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345 | |
| 4.16.1.2 QueryforSupport . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346 | |
| 4.16.2 APIOverview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347 | |
| 4.16.3 UnicastMemorySharing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349 | |
| 4.16.3.1 AllocateandExport . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349 | |
| 4.16.3.2 ShareandImport . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351 | |
| 4.16.3.3 ReserveandMap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355 | |
| 4.16.3.4 AccessRights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356 | |
| 4.16.3.5 ReleasingtheMemory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356 | |
| vii | |
| 4.16.4 MulticastMemorySharing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357 | |
| 4.16.4.1 AllocatingMulticastObjects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357 | |
| 4.16.4.2 AddDevicestoMulticastObjects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358 | |
| 4.16.4.3 BindMemorytoMulticastObjects . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358 | |
| 4.16.4.4 UseMulticastMappings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358 | |
| 4.16.5 AdvancedConfiguration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359 | |
| 4.16.5.1 MemoryType . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359 | |
| 4.16.5.2 CompressibleMemory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359 | |
| 4.16.5.3 VirtualAliasingSupport . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360 | |
| 4.16.5.4 OS-SpecificHandleDetailsforIPC . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361 | |
| 4.17 ExtendedGPUMemory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362 | |
| 4.17.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362 | |
| 4.17.1.1 EGMPlatforms: Systemtopology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362 | |
| 4.17.1.2 SocketIdentifiers: Whatarethey? Howtoaccessthem? . . . . . . . . . . . . . . 363 | |
| 4.17.1.3 AllocatorsandEGMsupport . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363 | |
| 4.17.1.4 MemorymanagementextensionstocurrentAPIs . . . . . . . . . . . . . . . . . . . 363 | |
| 4.17.2 UsingtheEGMInterface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364 | |
| 4.17.2.1 Single-Node,Single-GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364 | |
| 4.17.2.2 Single-Node,Multi-GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364 | |
| 4.17.2.3 Multi-Node,Multi-GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366 | |
| 4.18 CUDADynamicParallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367 | |
| 4.18.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367 | |
| 4.18.1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367 | |
| 4.18.2 ExecutionEnvironment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367 | |
| 4.18.2.1 ParentandChildGrids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367 | |
| 4.18.2.2 ScopeofCUDAPrimitives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368 | |
| 4.18.2.3 StreamsandEvents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368 | |
| 4.18.2.4 OrderingandConcurrency. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369 | |
| 4.18.3 MemoryCoherenceandConsistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369 | |
| 4.18.3.1 GlobalMemory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369 | |
| 4.18.3.2 MappedMemory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370 | |
| 4.18.3.3 SharedandLocalMemory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370 | |
| 4.18.3.4 LocalMemory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371 | |
| 4.18.4 ProgrammingInterface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371 | |
| 4.18.4.1 Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371 | |
| 4.18.4.2 C++LanguageInterfaceforCDP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372 | |
| 4.18.5 ProgrammingGuidelines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374 | |
| 4.18.5.1 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374 | |
| 4.18.5.2 ImplementationRestrictionsandLimitations . . . . . . . . . . . . . . . . . . . . . . 374 | |
| 4.18.5.3 CompatibilityandInteroperability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375 | |
| 4.18.6 Device-sideLaunchfromPTX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375 | |
| 4.18.6.1 KernelLaunchAPIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375 | |
| 4.18.6.2 ParameterBufferLayout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377 | |
| 4.19 CUDAInteroperabilitywithAPIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377 | |
| 4.19.1 GraphicsInteroperability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377 | |
| 4.19.1.1 OpenGLInteroperability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 378 | |
| 4.19.1.2 Direct3DInteroperability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381 | |
| 4.19.1.3 InteroperabilityinaScalableLinkInterface(SLI)configuration . . . . . . . . . . . 384 | |
| 4.19.2 Externalresourceinteroperability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384 | |
| 4.19.2.1 Vulkaninteroperability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385 | |
| 4.19.2.2 Direct3DInteroperability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397 | |
| 4.19.2.3 NVIDIASoftwareCommunicationInterfaceInteroperability(NVSCI). . . . . . . . 404 | |
| 4.20 DriverEntryPointAccess . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 410 | |
| 4.20.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 410 | |
| viii | |
| 4.20.2 DriverFunctionTypedefs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 410 | |
| 4.20.3 DriverFunctionRetrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411 | |
| 4.20.3.1 UsingtheDriverAPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412 | |
| 4.20.3.2 UsingtheRuntimeAPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413 | |
| 4.20.3.3 RetrievePer-threadDefaultStreamVersions . . . . . . . . . . . . . . . . . . . . . . 413 | |
| 4.20.3.4 AccessNewCUDAfeatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414 | |
| 4.20.4 PotentialImplicationswithcuGetProcAddress . . . . . . . . . . . . . . . . . . . . . . . . 414 | |
| 4.20.4.1 ImplicationswithcuGetProcAddressvsImplicitLinking . . . . . . . . . . . . . . . 414 | |
| 4.20.4.2 CompileTimevsRuntimeVersionUsageincuGetProcAddress . . . . . . . . . . . 415 | |
| 4.20.4.3 APIVersionBumpswithExplicitVersionChecks . . . . . . . . . . . . . . . . . . . . 416 | |
| 4.20.4.4 IssueswithRuntimeAPIUsage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417 | |
| 4.20.4.5 IssueswithRuntimeAPIandDynamicVersioning . . . . . . . . . . . . . . . . . . . 418 | |
| 4.20.4.6 IssueswithRuntimeAPIallowingCUDAVersion . . . . . . . . . . . . . . . . . . . . 419 | |
| 4.20.4.7 ImplicationstoAPI/ABI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419 | |
| 4.20.5 DeterminingcuGetProcAddressFailureReasons. . . . . . . . . . . . . . . . . . . . . . . 420 | |
| 5 TechnicalAppendices 423 | |
| 5.1 ComputeCapabilities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423 | |
| 5.1.1 ObtaintheGPUComputeCapability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423 | |
| 5.1.2 FeatureAvailability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424 | |
| 5.1.2.1 Architecture-SpecificFeatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424 | |
| 5.1.2.2 Family-SpecificFeatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424 | |
| 5.1.2.3 FeatureSetCompilerTargets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424 | |
| 5.1.3 FeaturesandTechnicalSpecifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425 | |
| 5.2 CUDAEnvironmentVariables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 431 | |
| 5.2.1 DeviceEnumerationandProperties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 431 | |
| 5.2.1.1 CUDA_VISIBLE_DEVICES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 431 | |
| 5.2.1.2 CUDA_DEVICE_ORDER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 432 | |
| 5.2.1.3 CUDA_MANAGED_FORCE_DEVICE_ALLOC . . . . . . . . . . . . . . . . . . . . . . . . . 432 | |
| 5.2.2 JITCompilation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 432 | |
| 5.2.2.1 CUDA_CACHE_DISABLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 432 | |
| 5.2.2.2 CUDA_CACHE_PATH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433 | |
| 5.2.2.3 CUDA_CACHE_MAXSIZE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433 | |
| 5.2.2.4 CUDA_FORCE_PTX_JITandCUDA_FORCE_JIT . . . . . . . . . . . . . . . . . . . . . 433 | |
| 5.2.2.5 CUDA_DISABLE_PTX_JITandCUDA_DISABLE_JIT . . . . . . . . . . . . . . . . . . 434 | |
| 5.2.2.6 CUDA_FORCE_PRELOAD_LIBRARIES . . . . . . . . . . . . . . . . . . . . . . . . . . . 434 | |
| 5.2.3 Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434 | |
| 5.2.3.1 CUDA_LAUNCH_BLOCKING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434 | |
| 5.2.3.2 CUDA_DEVICE_MAX_CONNECTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435 | |
| 5.2.3.3 CUDA_DEVICE_MAX_COPY_CONNECTIONS . . . . . . . . . . . . . . . . . . . . . . . . 435 | |
| 5.2.3.4 CUDA_SCALE_LAUNCH_QUEUES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435 | |
| 5.2.3.5 CUDA_GRAPHS_USE_NODE_PRIORITY. . . . . . . . . . . . . . . . . . . . . . . . . . . 436 | |
| 5.2.3.6 CUDA_DEVICE_WAITS_ON_EXCEPTION . . . . . . . . . . . . . . . . . . . . . . . . . . 436 | |
| 5.2.3.7 CUDA_DEVICE_DEFAULT_PERSISTING_L2_CACHE_PERCENTAGE_LIMIT . . . . 436 | |
| 5.2.3.8 CUDA_AUTO_BOOST[[deprecated]] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437 | |
| 5.2.4 ModuleLoading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437 | |
| 5.2.4.1 CUDA_MODULE_LOADING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437 | |
| 5.2.4.2 CUDA_MODULE_DATA_LOADING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438 | |
| 5.2.4.3 CUDA_BINARY_LOADER_THREAD_COUNT . . . . . . . . . . . . . . . . . . . . . . . . . 438 | |
| 5.2.5 CUDAErrorLogManagement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438 | |
| 5.2.5.1 CUDA_LOG_FILE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438 | |
| 5.3 C++LanguageSupport . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 439 | |
| 5.3.1 C++11LanguageFeatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 439 | |
| 5.3.2 C++14LanguageFeatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442 | |
| ix | |
| 5.3.3 C++17LanguageFeatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442 | |
| 5.3.4 C++20LanguageFeatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444 | |
| 5.3.5 CUDAC++StandardLibrary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446 | |
| 5.3.6 CStandardLibraryFunctions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447 | |
| 5.3.6.1 clock()andclock64() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447 | |
| 5.3.6.2 printf() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447 | |
| 5.3.6.3 memcpy()andmemset() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 450 | |
| 5.3.6.4 malloc()andfree() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 450 | |
| 5.3.6.5 alloca() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454 | |
| 5.3.7 LambdaExpressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454 | |
| 5.3.7.1 LambdaExpressionsand__global__FunctionParameters . . . . . . . . . . . . 455 | |
| 5.3.7.2 ExtendedLambdas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455 | |
| 5.3.7.3 ExtendedLambdaTypeTraits. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456 | |
| 5.3.7.4 ExtendedLambdaRestrictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 458 | |
| 5.3.7.5 Host-DeviceLambdaOptimizationNotes . . . . . . . . . . . . . . . . . . . . . . . . 468 | |
| 5.3.7.6 *thisCaptureBy-Value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 468 | |
| 5.3.7.7 ArgumentDependentLookup(ADL). . . . . . . . . . . . . . . . . . . . . . . . . . . . 470 | |
| 5.3.8 PolymorphicFunctionWrappers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 471 | |
| 5.3.9 C/C++LanguageRestrictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474 | |
| 5.3.9.1 UnsupportedFeatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474 | |
| 5.3.9.2 NamespaceReservations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474 | |
| 5.3.9.3 PointersandMemoryAddresses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475 | |
| 5.3.9.4 Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475 | |
| 5.3.9.5 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 481 | |
| 5.3.9.6 Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483 | |
| 5.3.9.7 Templates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 487 | |
| 5.3.10 C++11Restrictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 488 | |
| 5.3.10.1 inlineNamespaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 488 | |
| 5.3.10.2 inlineUnnamedNamespaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 488 | |
| 5.3.10.3 constexprFunctions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 488 | |
| 5.3.10.4 constexprVariables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 491 | |
| 5.3.10.5 __global__VariadicTemplate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 492 | |
| 5.3.10.6 DefaultedFunctions= default . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 492 | |
| 5.3.10.7 [cuda::]std::initializer_list. . . . . . . . . . . . . . . . . . . . . . . . . . . 493 | |
| 5.3.10.8 [cuda::]std::move,[cuda::]std::forward . . . . . . . . . . . . . . . . . . . 494 | |
| 5.3.11 C++14Restrictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494 | |
| 5.3.11.1 FunctionswithDeducedReturnType . . . . . . . . . . . . . . . . . . . . . . . . . . . 494 | |
| 5.3.11.2 VariableTemplates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495 | |
| 5.3.12 C++17Restrictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 496 | |
| 5.3.12.1 inlineVariables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 496 | |
| 5.3.12.2 StructuredBinding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 496 | |
| 5.3.13 C++20Restrictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 497 | |
| 5.3.13.1 Three-wayComparisonOperator. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 497 | |
| 5.3.13.2 constevalFunctions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 497 | |
| 5.4 C/C++LanguageExtensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 498 | |
| 5.4.1 FunctionandVariableAnnotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 498 | |
| 5.4.1.1 ExecutionSpaceSpecifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 498 | |
| 5.4.1.2 MemorySpaceSpecifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 498 | |
| 5.4.1.3 InliningSpecifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 502 | |
| 5.4.1.4 __restrict__Pointers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 502 | |
| 5.4.1.5 __grid_constant__Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 503 | |
| 5.4.1.6 AnnotationSummary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504 | |
| 5.4.2 Built-inTypesandVariables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505 | |
| 5.4.2.1 HostCompilerTypeExtensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505 | |
| x | |
| 5.4.2.2 Built-inVariables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505 | |
| 5.4.2.3 Built-inTypes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 506 | |
| 5.4.3 KernelConfiguration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 508 | |
| 5.4.3.1 ThreadBlockCluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 508 | |
| 5.4.3.2 LaunchBounds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 509 | |
| 5.4.3.3 MaximumNumberofRegistersperThread . . . . . . . . . . . . . . . . . . . . . . . 511 | |
| 5.4.4 SynchronizationPrimitives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 512 | |
| 5.4.4.1 ThreadBlockSynchronizationFunctions . . . . . . . . . . . . . . . . . . . . . . . . . 512 | |
| 5.4.4.2 WarpSynchronizationFunction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514 | |
| 5.4.4.3 MemoryFenceFunctions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514 | |
| 5.4.5 AtomicFunctions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 519 | |
| 5.4.5.1 LegacyAtomicFunctions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 520 | |
| 5.4.5.2 Built-inAtomicFunctions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525 | |
| 5.4.6 WarpFunctions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 531 | |
| 5.4.6.1 WarpActiveMask . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 531 | |
| 5.4.6.2 WarpVoteFunctions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 532 | |
| 5.4.6.3 WarpMatchFunctions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 532 | |
| 5.4.6.4 WarpReduceFunctions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533 | |
| 5.4.6.5 WarpShuffleFunctions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534 | |
| 5.4.6.6 Warp__syncIntrinsicConstraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . 539 | |
| 5.4.7 CUDA-SpecificMacros . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 541 | |
| 5.4.7.1 __CUDA_ARCH__ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 541 | |
| 5.4.7.2 __CUDA_ARCH_SPECIFIC__and__CUDA_ARCH_FAMILY_SPECIFIC__ . . . . . 543 | |
| 5.4.7.3 CUDAFeatureTestingMacros . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544 | |
| 5.4.7.4 __nv_pure__Attribute . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544 | |
| 5.4.8 CUDA-SpecificFunctions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544 | |
| 5.4.8.1 AddressSpacePredicateFunctions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544 | |
| 5.4.8.2 AddressSpaceConversionFunctions . . . . . . . . . . . . . . . . . . . . . . . . . . . 545 | |
| 5.4.8.3 Low-LevelLoadandStoreFunctions . . . . . . . . . . . . . . . . . . . . . . . . . . . 546 | |
| 5.4.8.4 __trap() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 546 | |
| 5.4.8.5 __nanosleep() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 547 | |
| 5.4.8.6 DynamicProgrammingeXtension(DPX)Instructions . . . . . . . . . . . . . . . . . 547 | |
| 5.4.9 CompilerOptimizationHints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 549 | |
| 5.4.9.1 #pragma unroll . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 550 | |
| 5.4.9.2 __builtin_assume_aligned() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 551 | |
| 5.4.9.3 __builtin_assume()and__assume() . . . . . . . . . . . . . . . . . . . . . . . . 551 | |
| 5.4.9.4 __builtin_expect() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 551 | |
| 5.4.9.5 __builtin_unreachable() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 552 | |
| 5.4.9.6 CustomABIPragmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 552 | |
| 5.4.10 DebuggingandDiagnostics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 554 | |
| 5.4.10.1 Assertion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 554 | |
| 5.4.10.2 BreakpointFunction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555 | |
| 5.4.10.3 DiagnosticPragmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555 | |
| 5.4.11 WarpMatrixFunctions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 556 | |
| 5.4.11.1 Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 556 | |
| 5.4.11.2 AlternateFloatingPoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 558 | |
| 5.4.11.3 DoublePrecision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 558 | |
| 5.4.11.4 Sub-byteOperations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 559 | |
| 5.4.11.5 Restrictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 560 | |
| 5.4.11.6 ElementTypesandMatrixSizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 560 | |
| 5.4.11.7 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 562 | |
| 5.5 Floating-PointComputation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 562 | |
| 5.5.1 Floating-PointIntroduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 562 | |
| 5.5.1.1 Floating-PointFormat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 562 | |
| xi | |
| 5.5.1.2 NormalandSubnormalValues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564 | |
| 5.5.1.3 SpecialValues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564 | |
| 5.5.1.4 Associativity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 565 | |
| 5.5.1.5 FusedMultiply-Add(FMA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 566 | |
| 5.5.1.6 DotProductExample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 568 | |
| 5.5.1.7 Rounding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 568 | |
| 5.5.1.8 NotesonHost/DeviceComputationAccuracy . . . . . . . . . . . . . . . . . . . . . 569 | |
| 5.5.2 Floating-PointDataTypes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 570 | |
| 5.5.3 CUDAandIEEE-754Compliance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 572 | |
| 5.5.4 CUDAandC/C++Compliance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 573 | |
| 5.5.5 Floating-PointFunctionalityExposure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 574 | |
| 5.5.6 Built-InArithmeticOperators. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 577 | |
| 5.5.7 CUDAC++MathematicalStandardLibraryFunctions. . . . . . . . . . . . . . . . . . . . 578 | |
| 5.5.7.1 BasicOperations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 578 | |
| 5.5.7.2 ExponentialFunctions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 579 | |
| 5.5.7.3 PowerFunctions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 580 | |
| 5.5.7.4 TrigonometricFunctions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 581 | |
| 5.5.7.5 HyperbolicFunctions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 581 | |
| 5.5.7.6 ErrorandGammaFunctions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 582 | |
| 5.5.7.7 NearestIntegerFloating-PointOperations. . . . . . . . . . . . . . . . . . . . . . . . 583 | |
| 5.5.7.8 Floating-PointManipulationFunctions . . . . . . . . . . . . . . . . . . . . . . . . . . 583 | |
| 5.5.7.9 ClassificationandComparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584 | |
| 5.5.8 Non-StandardCUDAMathematicalFunctions . . . . . . . . . . . . . . . . . . . . . . . . 585 | |
| 5.5.9 IntrinsicFunctions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 587 | |
| 5.5.9.1 BasicIntrinsicFunctions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 587 | |
| 5.5.9.2 Single-Precision-OnlyIntrinsicFunctions . . . . . . . . . . . . . . . . . . . . . . . . 588 | |
| 5.5.9.3 --use_fast_mathEffect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 589 | |
| 5.5.10 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 589 | |
| 5.6 Device-CallableAPIsandIntrinsics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 590 | |
| 5.6.1 MemoryBarrierPrimitivesInterface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 590 | |
| 5.6.1.1 DataTypes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 590 | |
| 5.6.1.2 MemoryBarrierPrimitivesAPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 590 | |
| 5.6.2 PipelinePrimitivesInterface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 591 | |
| 5.6.2.1 memcpy_asyncPrimitive. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 592 | |
| 5.6.2.2 CommitPrimitive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 592 | |
| 5.6.2.3 WaitPrimitive. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 592 | |
| 5.6.2.4 ArriveOnBarrierPrimitive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 592 | |
| 5.6.3 CooperativeGroupsAPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 593 | |
| 5.6.3.1 cooperative_groups.h . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 593 | |
| 5.6.3.2 cooperative_groups/async.h. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 599 | |
| 5.6.3.3 cooperative_groups/partition.h . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 602 | |
| 5.6.3.4 cooperative_groups/reduce.h . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 603 | |
| 5.6.3.5 cooperative_groups/scan.h . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 606 | |
| 5.6.3.6 cooperative_groups/sync.h . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 609 | |
| 5.6.4 CUDADeviceRuntime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 611 | |
| 5.6.4.1 IncludingDeviceRuntimeAPIinCUDACode . . . . . . . . . . . . . . . . . . . . . . 612 | |
| 5.6.4.2 MemoryintheCUDADeviceRuntime. . . . . . . . . . . . . . . . . . . . . . . . . . . 612 | |
| 5.6.4.3 SMIdandWarpId . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 614 | |
| 5.6.4.4 LaunchSetupAPIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 614 | |
| 5.6.4.5 DeviceManagement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 615 | |
| 5.6.4.6 APIReference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 615 | |
| 5.6.4.7 APIErrorsandLaunchFailures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 617 | |
| 5.6.4.8 DeviceRuntimeStreams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 617 | |
| 5.6.4.9 ECCErrors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 619 | |
| xii | |
| 6 Notices 621 | |
| 6.1 Notice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 621 | |
| 6.2 OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 622 | |
| 6.3 Trademarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 622 | |
| xiii | |
| xiv | |
| CUDAProgrammingGuide,Release13.1 | |
| CUDAandtheCUDAProgrammingGuide | |
| CUDAisaparallelcomputingplatformandprogrammingmodeldevelopedbyNVIDIAthatenablesdra- | |
| maticincreasesincomputingperformancebyharnessingthepoweroftheGPU.Itallowsdevelopersto | |
| acceleratecompute-intensiveapplicationsandiswidelyusedinfieldssuchasdeeplearning,scientific | |
| computing,andhigh-performancecomputing(HPC). | |
| This CUDA Programming Guide is the official, comprehensive resource on the CUDA programming | |
| modelandhowtowritecodethatexecutesontheGPUusingtheCUDAplatform. Thisguidecovers | |
| everythingfromthetheCUDAprogrammingmodelandtheCUDAplatformtothedetailsoflanguage | |
| extensionsandcovershowtomakeuseofspecifichardwareandsoftwarefeatures. Thisguidepro- | |
| videsapathwayfordeveloperstolearnCUDAiftheyarenew,andalsoprovidesanessentialresource | |
| fordevelopersastheybuildapplicationsusingCUDA. | |
| OrganizationofThisGuide | |
| Evenfordeveloperswhoprimarilyuselibraries,frameworks,orDSLs,anunderstandingoftheCUDA | |
| programmingmodelandhowGPUsexecutecodeisvaluableinknowingwhatishappeningbehindthe | |
| layersofabstraction. ThisguidestartswithachapterontheCUDAprogrammingmodeloutsideofany | |
| specificprogramminglanguagewhichisapplicabletoanyoneinterestedinunderstandinghowCUDA | |
| works,evennon-developers. | |
| Theguideisbrokendownintofiveprimaryparts: | |
| ▶ Part1: IntroductionandProgrammingModelAbstract | |
| ▶ AlanguageagnosticoverviewoftheCUDAprogrammingmodelaswellasabrieftourofthe | |
| CUDAplatform. | |
| ▶ ThissectionismeanttobereadbyanyonewantingtounderstandGPUsandtheconcepts | |
| ofexecutingcodeonGPUs,eveniftheyarenotdevelopers. | |
| ▶ Part2: ProgrammingGPUsinCUDA | |
| ▶ ThebasicsofprogrammingGPUsusingCUDAC++. | |
| ▶ ThissectionismeanttobereadbyanyonewantingtogetstartedinGPUprogramming. | |
| ▶ This section is meant to be instructional, not complete, and teaches the most important | |
| andcommonpartsofCUDAprogramming,includingsomecommonperformanceconsider- | |
| ations. | |
| ▶ Part3: AdvancedCUDA | |
| ▶ IntroducessomemoreadvancefeaturesofCUDAthatenablebothfine-grainedcontroland | |
| moreopportunitiestomaximizeperformance,includingtheuseofmultipleGPUsinasingle | |
| application. | |
| ▶ Thissectionconcludeswithatourofthefeaturescoveredinpart4withabriefintroduction | |
| to the purpose and function of each, sorted by when and why a developer may find each | |
| featureuseful. | |
| ▶ Part4: CUDAFeatures | |
| ▶ This section contains complete coverage of specific CUDA features such as CUDA graphs, | |
| dynamicparallelism,interoperabilitywithgraphicsAPIs,andunifiedmemory. | |
| ▶ ThissectionshouldbeconsultedwhenknowingthecompletepictureofaspecificCUDAfea- | |
| tureisneeded. Wherepossible,carehasbeentakentointroduceandmotivatethefeatures | |
| coveredinthissectioninearliersections. | |
| ▶ Part5: TechnicalAppendices | |
| Contents 1 | |
| CUDAProgrammingGuide,Release13.1 | |
| ▶ ThetechnicalappendicesprovidesomereferencedocumentationonCUDA’sC++high-level | |
| languagesupport,hardware-specificspecifications,andothertechnicalspecifications. | |
| ▶ This section is meant as technical reference for specific description of syntax, semantics, | |
| andtechnicalbehaviorofelementsofCUDA. | |
| Parts1-3provideaguidedlearningexperiencefordevelopersnewtoCUDA,thoughtheyalsoprovide | |
| insightandupdatedinformationusefulforCUDAdevelopersofanyexperiencelevel. | |
| Parts 4 and 5 provide a wealth of information about specific features and detailed topics, and are | |
| intendedtoprovideacurated,well-organizedreferencefordevelopersneedingtoknowmoredetails | |
| astheywriteCUDAapplications. | |
| 2 Contents | |
| Chapter 1. Introduction to CUDA | |
| 1.1. Introduction | |
| 1.1.1. The Graphics Processing Unit | |
| Bornasaspecial-purposeprocessorfor3Dgraphics,theGraphicsProcessingUnit(GPU)startedout | |
| as fixed-function hardware to accelerate parallel operations in real-time 3D rendering. Over succes- | |
| sivegenerations, GPUsbecamemoreprogrammable. By2003, somestagesofthegraphicspipeline | |
| becamefullyprogrammable,runningcustomcodeinparallelforeachcomponentofa3Dsceneoran | |
| image. | |
| In2006, NVIDIAintroducedtheComputeUnifiedDeviceArchitecture(CUDA)toenableanycomputa- | |
| tionalworkloadtousethethroughputcapabilityofGPUsindependentofgraphicsAPIs. | |
| Sincethen,CUDAandGPUcomputinghavebeenusedtoacceleratecomputationalworkloadsofnearly | |
| everytype, fromscientificsimulationssuchasfluiddynamicsorenergytransporttobusinessappli- | |
| cationslikedatabasesandanalytics. Moreover,thecapabilityandprogrammabilityofGPUshasbeen | |
| foundationaltotheadvancementofnewalgorithmsandtechnologiesrangingfromimageclassifica- | |
| tiontogenerativeartificialintelligencesuchasdiffusionorlargelanguagemodels. | |
| 1.1.2. The Benefits of Using GPUs | |
| AGPUprovidesmuchhigherinstructionthroughputandmemorybandwidththanaCPUwithinasim- | |
| ilarpriceandpowerenvelope. Manyapplicationsleveragethesecapabilitiestorunsignificantlyfaster | |
| on the GPU than on the CPU (see GPU Applications). Other computing devices, like FPGAs, are also | |
| veryenergyefficient,butoffermuchlessprogrammingflexibilitythanGPUs. | |
| GPUsandCPUsaredesignedwithdifferentgoalsinmind. WhileaCPUisdesignedtoexcelatexecuting | |
| aserialsequenceofoperations(calledathread)asfastaspossibleandcanexecuteafewtensofthese | |
| threadsinparallel,aGPUisdesignedtoexcelatexecutingthousandsofthreadsinparallel,tradingoff | |
| lowersingle-threadperformancetoachievemuchgreatertotalthroughput. | |
| GPUs are specialized for highly parallel computations and devote more transistors to data process- | |
| ingunits, whileCPUsdedicatemoretransistorstodatacachingandflowcontrol. Figure1showsan | |
| exampledistributionofchipresourcesforaCPUversusaGPU. | |
| 3 | |
| CUDAProgrammingGuide,Release13.1 | |
| Figure1: TheGPUDevotesMoreTransistorstoDataProcessing | |
| 1.1.3. Getting Started Quickly | |
| TherearemanywaystoleveragethecomputepowerprovidedbyGPUs. Thisguidecoversprogramming | |
| for the CUDA GPU platform in high-level languages such as C++. However, there are many ways to | |
| utilizeGPUsinapplicationsthatdonotrequiredirectlywritingGPUcode. | |
| Anever-growingcollectionofalgorithmsandroutinesfromavarietyofdomainsisavailablethrough | |
| specialized libraries. When a library has already been implemented—especially those provided by | |
| NVIDIA—using it is often more productive and performant than reimplementing algorithms from | |
| scratch. LibrarieslikecuBLAS,cuFFT,cuDNN,andCUTLASSarejustafewexamplesoflibrariesthat | |
| helpdevelopersavoidreimplementingwell-establishedalgorithms. Theselibrarieshavetheaddedben- | |
| efitofbeingoptimizedforeachGPUarchitecture,providinganidealmixofproductivity,performance, | |
| andportability. | |
| There are also frameworks, particularly those used for artificial intelligence, that provide GPU- | |
| accelerated building blocks. Many of these frameworks achieve their acceleration by leveraging the | |
| GPU-acceleratedlibrariesmentionedabove. | |
| Additionally, domain-specific languages (DSLs) such as NVIDIA’s Warp or OpenAI’s Triton compile to | |
| rundirectlyontheCUDAplatform. Thisprovidesanevenhigher-levelmethodofprogrammingGPUs | |
| thanthehigh-levellanguagescoveredinthisguide. | |
| TheNVIDIAAcceleratedComputingHubcontainsresources,examples,andtutorialstoteachGPUand | |
| CUDAcomputing. | |
| 1.2. Programming Model | |
| ThischapterintroducestheCUDAprogrammingmodelatahighlevelandseparatefromanylanguage. | |
| The terminology and concepts introduced here apply to CUDA in any supported programming lan- | |
| guage. LaterchapterswillillustratetheseconceptsinC++. | |
| 4 Chapter1. IntroductiontoCUDA | |
| CUDAProgrammingGuide,Release13.1 | |
| 1.2.1. Heterogeneous Systems | |
| TheCUDAprogrammingmodelassumesaheterogeneouscomputingsystem,whichmeansasystem | |
| thatincludesbothGPUsandCPUs. TheCPUandthememorydirectlyconnectedtoitarecalledthe | |
| host and hostmemory, respectively. A GPU and the memory directly connected to it are referred to | |
| asthedeviceanddevicememory,respectively. Insomesystem-on-chip(SoC)systems,thesemaybe | |
| partofasinglepackage. Inlargersystems,theremaybemultipleCPUsorGPUs. | |
| CUDAapplicationsexecutesomepartoftheircodeontheGPU,butapplicationsalwaysstartexecution | |
| ontheCPU.Thehostcode,whichisthecodethatrunsontheCPU,canuseCUDAAPIstocopydata | |
| between the host memory and device memory, start code executing on the GPU, and wait for data | |
| copiesorGPUcodetocomplete. TheCPUandGPUcanbothbeexecutingcodesimultaneously,and | |
| bestperformanceisusuallyfoundbymaximizingutilizationofbothCPUsandGPUs. | |
| ThecodeanapplicationexecutesontheGPUisreferredtoasdevicecode,andafunctionthatisinvoked | |
| forexecutionontheGPUis,forhistoricalreasons,calledakernel. Theactofstartingakernelrunningis | |
| calledlaunchingthekernel. Akernellaunchcanbethoughtofasstartingmanythreadsexecutingthe | |
| kernelcodeinparallelontheGPU.GPUthreadsoperatesimilarlytothreadsonCPUs,thoughthereare | |
| somedifferencesimportanttobothcorrectnessandperformancethatwillbecoveredinlatersections | |
| (seeSection3.2.2.1.1). | |
| 1.2.2. GPU Hardware Model | |
| Likeanyprogrammingmodel,CUDAreliesonaconceptualmodeloftheunderlyinghardware. Forthe | |
| purposesofCUDAprogramming,theGPUcanbeconsideredtobeacollectionofStreamingMultipro- | |
| cessors (SMs) which are organized into groups called GraphicsProcessingClusters (GPCs). Each SM | |
| containsalocalregisterfile,aunifieddatacache,andanumberoffunctionalunitsthatperformcom- | |
| putations. The unified data cache provides the physical resources for sharedmemory and L1 cache. | |
| TheallocationoftheunifieddatacachetoL1andsharedmemorycanbeconfiguredatruntime. The | |
| sizesofdifferenttypesofmemoryandthenumberoffunctionalunitswithinanSMcanvaryacross | |
| GPUarchitectures. | |
| Note | |
| TheactualhardwarelayoutofaGPUorthewayitphysicallycarriesouttheexecutionofthepro- | |
| grammingmodelmayvary. Thesedifferencesdonotaffectcorrectnessofsoftwarewrittenusing | |
| theCUDAprogrammingmodel. | |
| 1.2.2.1 ThreadBlocksandGrids | |
| Whenanapplicationlaunchesakernel,itdoessowithmanythreads,oftenmillionsofthreads. These | |
| threadsareorganizedintoblocks. Ablockofthreadsisreferredto,perhapsunsurprisingly,asathread | |
| block. Threadblocksareorganizedintoagrid. Allthethreadblocksinagridhavethesamesizeand | |
| dimensions. Figure3showsanillustrationofagridofthreadblocks. | |
| Thread blocks and grids may be 1, 2, or 3 dimensional. These dimensions can simplify mapping of | |
| individualthreadstounitsofworkordataitems. | |
| Whenakernelislaunched,itislaunchedusingaspecificexecutionconfigurationwhichspecifiesthe | |
| gridandthreadblockdimensions. Theexecutionconfigurationmayalsoincludeoptionalparameters | |
| suchasclustersize,stream,andSMconfigurationsettings,whichwillbeintroducedinlatersections. | |
| Using built-in variables, each thread executing the kernel can determine its location within its con- | |
| taining block and the location of its block within the containing grid. A thread can also use these | |
| 1.2. ProgrammingModel 5 | |
| CUDAProgrammingGuide,Release13.1 | |
| Figure2: AGPUhasmanystreamingmultiprocessors(SMs),eachofwhichcontainsmanyfunctional | |
| units. Graphicsprocessingclusters(GPCs)arecollectionsofSMs. AGPUisasetofGPCsconnected | |
| totheGPUmemory. ACPUtypicallyhasseveralcoresandamemorycontrollerwhichconnectstothe | |
| systemmemory. ACPUandaGPUareconnectedbyaninterconnectsuchasPCIeorNVLINK. | |
| Figure3: Grid ofThreadBlocks. Eacharrowrepresentsa thread(the number ofarrowsisnotrepre- | |
| sentativeofactualnumberofthreads). | |
| 6 Chapter1. IntroductiontoCUDA | |
| CUDAProgrammingGuide,Release13.1 | |
| built-invariablestodeterminethedimensionsofthethreadblocksandthegridonwhichthekernel | |
| waslaunched. Thisgiveseachthreadauniqueidentityamongallthethreadsrunningthekernel. This | |
| identityisfrequentlyusedtodeterminewhatdataoroperationsathreadisresponsiblefor. | |
| AllthreadsofathreadblockareexecutedinasingleSM.Thisallowsthreadswithinathreadblockto | |
| communicateandsynchronizewitheachotherefficiently. Threadswithinathreadblockallhaveaccess | |
| totheon-chipsharedmemory, whichcanbeusedforexchanginginformationbetweenthreadsofa | |
| threadblock. | |
| Agridmayconsistofmillionsofthreadblocks,whiletheGPUexecutingthegridmayhaveonlytensor | |
| hundredsofSMs. AllthreadsofathreadblockareexecutedbyasingleSMand,inmostcases1,runto | |
| completiononthatSM.Thereisnoguaranteeofschedulingbetweenthreadblocks,soathreadblock | |
| cannot rely on results from other thread blocks, as they may not be able to be scheduled until that | |
| threadblockhascompleted. Figure4showsanexampleofhowthreadblocksfromagridareassigned | |
| toanSM. | |
| The CUDA programming model enables arbitrarily large grids to run on GPUs of any size, whether it | |
| has only one SM or thousands of SMs. To achieve this, the CUDA programming model, with some | |
| exceptions,requiresthattherebenodatadependenciesbetweenthreadsindifferentthreadblocks. | |
| Thatis,athreadshouldnotdependonresultsfromorsynchronizewithathreadinadifferentthread | |
| blockofthesame grid. All thethreadswithina threadblockrunon thesameSMatthesametime. | |
| DifferentthreadblockswithinthegridarescheduledamongtheavailableSMsandmaybeexecuted | |
| in any order. In short, the CUDA programming model requires that it be possible to execute thread | |
| blocksinanyorder,inparallelorinseries. | |
| 1.2.2.1.1 ThreadBlockClusters | |
| In addition to thread blocks, GPUs with compute capability 9.0 and higher have an optional level of | |
| groupingcalledclusters. Clustersareagroupofthreadblockswhich,likethreadblocksandgrids,can | |
| be laid out in 1, 2, or 3 dimensions. Figure 5 illustrates a grid of thread blocks that is also organized | |
| intoclusters. Specifyingclustersdoesnotchangethegriddimensionsortheindicesofathreadblock | |
| withinagrid. | |
| Specifyingclustersgroupsadjacentthreadblocksintoclustersandprovidessomeadditionaloppor- | |
| tunities for synchronization and communication at the cluster level. Specifically, all thread blocks in | |
| a cluster are executed in a single GPC. Figure 6 shows how thread blocks are scheduled to SMs in a | |
| GPCwhenclustersarespecified. Becausethethreadblocksarescheduledsimultaneouslyandwithin | |
| asingleGPC,threadsindifferentblocksbutwithinthesameclustercancommunicateandsynchro- | |
| nize with each other using software interfaces provided by CooperativeGroups. Threads in clusters | |
| can access the shared memory of all blocks in the cluster, which is referred to as distributedshared | |
| memory.Themaximumsizeofaclusterishardwaredependentandvariesbetweendevices. | |
| Figure6illustratesthehowthreadblockswithinaclusterarescheduledsimultaneouslyonSMswithin | |
| aGPC.Threadblockswithinaclusterarealwaysadjacenttoeachotherwithinthegrid. | |
| 1.2.2.2 WarpsandSIMT | |
| Withinathreadblock,threadsareorganizedintogroupsof32threadscalledwarps. Awarpexecutes | |
| the kernel code in a Single-InstructionMultiple-Threads (SIMT) paradigm. In SIMT, all threads in the | |
| warpareexecutingthesamekernelcode,buteachthreadmayfollowdifferentbranchesthroughthe | |
| code. Thatis,thoughallthreadsoftheprogramexecutethesamecode,threadsdonotneedtofollow | |
| thesameexecutionpath. | |
| 1IncertainsituationswhenusingfeaturessuchasCUDADynamicParallelism,athreadblockmaybesuspendedtomemory. | |
| ThismeansthestateoftheSMisstoredtoasystem-managedareaofGPUmemoryandtheSMisfreedtoexecuteother | |
| threadblocks.ThisissimilartocontextswappingonCPUs.Thisisnotcommon. | |
| 1.2. ProgrammingModel 7 | |
| CUDAProgrammingGuide,Release13.1 | |
| Figure 4: Each SM has one or more active thread blocks. In this example, each SM has three thread | |
| blocks scheduled simultaneously. There are no guarantees about the order in which thread blocks | |
| fromagridareassignedtoSMs. | |
| | 8 | Chapter1. | IntroductiontoCUDA | | |
| | --- | --------- | ------------------ | | |
| CUDAProgrammingGuide,Release13.1 | |
| Figure5: Whenclustersarespecified,threadblocksareinthesamelocationinthegridbutalsohave | |
| apositionwithinthecontainingcluster. | |
| Whenthreadsareexecutedbyawarp,theyareassignedawarplane. Warplanesarenumbered0to31 | |
| andthreadsfromathreadblockareassignedtowarpsinapredictablefashiondetailedinHardware | |
| Multithreading. | |
| All threads in the warp execute the same instruction simultaneously. If some threads within a warp | |
| follow a control flow branch in execution while others do not, the threads which do not follow the | |
| branch will be masked off while the threads which follow the branch are executed. For example, if a | |
| conditional is only true for half the threads in a warp, the other half of the warp would be masked | |
| offwhiletheactivethreadsexecutethoseinstructions. ThissituationisillustratedinFigure7. When | |
| different threads in a warp follow different code paths, this is sometimes called warp divergence. It | |
| followsthatutilizationoftheGPUismaximizedwhenthreadswithinawarpfollowthesamecontrol | |
| flowpath. | |
| In the SIMT model, all threads in a warp progress through the kernel in lock step. Hardware execu- | |
| tionmaydiffer. SeethesectionsonIndependentThreadExecutionformoreinformationonwherethis | |
| distinctionisimportant. Exploitingknowledgeofhowwarpexecutionisactuallymappedtorealhard- | |
| wareisdiscouraged. TheCUDAprogrammingmodelandSIMTsaythatallthreadsinawarpprogress | |
| throughthecodetogether. Hardwaremayoptimizemaskedlanesinwaysthataretransparenttothe | |
| programso long as the programming model is followed. If the program violates this model, this can | |
| resultinundefinedbehaviorthatcanbedifferentindifferentGPUhardware. | |
| WhileitisnotnecessarytoconsiderwarpswhenwritingCUDAcode,understandingthewarpexecution | |
| modelishelpfulinunderstandingconceptssuchasglobalmemorycoalescingandsharedmemorybank | |
| accesspatterns. Someadvancedprogrammingtechniquesusespecializationofwarpswithinathread | |
| block to limit thread divergence and maximize utilization. This and other optimizations make use of | |
| theknowledgethatthreadsaregroupedintowarpswhenexecuting. | |
| Oneimplicationofwarpexecutionisthatthreadblocksarebestspecifiedtohaveatotalnumberof | |
| threads which is a multiple of 32. It is legal to use any number of threads, but when the total is not | |
| a multiple of 32, the last warp of the thread block will have some lanes that are unused throughout | |
| execution. This will likely lead to suboptimal functional units utilization and memory access for that | |
| warp. | |
| SIMTisoftencomparedtoSingleInstructionMultipleData(SIMD)parallelism,butthereare | |
| someimportantdifferences. InSIMD,executionfollowsasinglecontrolflowpath,whilein | |
| SIMT,eachthreadisallowedtofollowitsowncontrolflowpath. Becauseofthis,SIMTdoes | |
| nothaveafixeddata-widthlikeSIMD.AmoredetaileddiscussionofSIMTcanbefoundin | |
| SIMTExecutionModel. | |
| 1.2. ProgrammingModel 9 | |
| CUDAProgrammingGuide,Release13.1 | |
| Figure6: Whenclustersarespecified,thethreadblocksinaclusterarearrangedintheirclustershape | |
| within the grid. The thread blocks of a cluster are scheduled simultaneously on the SMs of a single | |
| GPC. | |
| | 10 | Chapter1. | IntroductiontoCUDA | | |
| | --- | --------- | ------------------ | | |
| CUDAProgrammingGuide,Release13.1 | |
| Figure7: Inthisexample, onlythreadswitheventhreadindexexecutethebodyoftheifstatement, | |
| theothersaremaskedoffwhilethebodyisexecuted. | |
| 1.2.3. GPU Memory | |
| Inmoderncomputingsystems,efficientlyutilizingmemoryisjustasimportantasmaximizingtheuse | |
| offunctionalunitsperformingcomputations. Heterogeneoussystemshavemultiplememoryspaces, | |
| andGPUscontainvarioustypesofprogrammableon-chipmemoryinadditiontocaches. Thefollowing | |
| sectionsintroducethesememoryspacesinmoredetails. | |
| 1.2.3.1 DRAMMemoryinHeterogeneousSystems | |
| GPUsandCPUsbothhavedirectlyattachedDRAMchips. InsystemswithmorethanoneGPU,each | |
| GPU has its own memory. From the perspective of device code, the DRAM attached to the GPU is | |
| calledglobalmemory,becauseitisaccessibletoallSMsintheGPU.Thisterminologydoesnotmean | |
| itisnecessarilyaccessibleeverywherewithinthesystem. TheDRAMattachedtotheCPU(s)iscalled | |
| systemmemoryorhostmemory. | |
| Like CPUs, GPUs use virtual memory addressing. On all currently-supported systems, the CPU and | |
| GPUuseasingleunifiedvirtualmemoryspace. Thismeansthatthevirtualmemoryaddressrangefor | |
| eachGPUinthesystemisuniqueanddistinctfromtheCPUandeveryotherGPUinthesystem. For | |
| agivenvirtualmemoryaddress,itispossibletodeterminewhetherthataddressisinGPUmemoryor | |
| systemmemoryand,onsystemswithmultipleGPUs,whichGPUmemorycontainsthataddress. | |
| ThereareCUDAAPIstoallocateGPUmemory,CPUmemory,andtocopybetweenallocationsonthe | |
| CPUandGPU,withinaGPU,orbetweenGPUsinmulti-GPUsystems. Thelocalityofdatacanbeex- | |
| plicitlycontrolledwhendesired. UnifiedMemory,discussedbelow,allowstheplacementofmemoryto | |
| behandledautomaticallybytheCUDAruntimeorsystemhardware. | |
| 1.2.3.2 On-ChipMemoryinGPUs | |
| Inadditiontotheglobalmemory,eachGPUhassomeon-chipmemory. EachSMhasitsownregister | |
| fileandsharedmemory. ThesememoriesarepartoftheSMandcanbeaccessedextremelyquickly | |
| fromthreadsexecutingwithintheSM,buttheyarenotaccessibletothreadsrunninginotherSMs. | |
| Theregisterfilestoresthreadlocalvariableswhichareusuallyallocatedbythecompiler. Theshared | |
| memoryisaccessiblebyallthreadswithinathreadblockorcluster. Sharedmemorycanbeusedfor | |
| exchangingdatabetweenthreadsofathreadblockorcluster. | |
| The register file and unified data cache in an SM have finite sizes. The size of an SM’s register file, | |
| unified data cache, and how the unified data cache can be configured for L1 and shared memory | |
| 1.2. ProgrammingModel 11 | |
| CUDAProgrammingGuide,Release13.1 | |
| balancecanbefoundinMemoryInformationperComputeCapability. Theregisterfile,sharedmemory | |
| space,andL1cachearesharedamongallthreadsinathreadblock. | |
| ToscheduleathreadblocktoanSM,thetotalnumberofregistersneededforeachthreadmultiplied | |
| bythenumberofthreadsinthethreadblockmustbelessthanorequaltotheavailableregistersin | |
| theSM.Ifthenumberofregistersrequiredforathreadblockexceedsthesizeoftheregisterfile,the | |
| kernel is not launchable and the number of threads in the thread block must be decreased to make | |
| thethreadblocklaunchable. | |
| Sharedmemoryallocationsaredoneatthethreadblocklevel. Thatis,unlikeregisterallocationswhich | |
| areperthread,allocationsofsharedmemoryarecommontotheentirethreadblock. | |
| 1.2.3.2.1 Caches | |
| Inadditiontoprogrammablememories,GPUshavebothL1andL2caches. EachSMhasanL1cache | |
| whichispartoftheunifieddatacache. AlargerL2cacheissharedbyallSMswithinaGPU.Thiscan | |
| be seen in the GPU block diagram in Figure 2. Each SM also has a separate constant cache, which | |
| is used to cache values in global memory that have been declared to be constant over the life of a | |
| kernel. The compiler may place kernel parameters into constant memory as well. This can improve | |
| kernelperformancebyallowingkernelparameterstobecachedintheSMseparatelyfromtheL1data | |
| cache. | |
| 1.2.3.3 UnifiedMemory | |
| When an applicationallocatesmemory explicitly on the GPU or CPU,thatmemory isonly accessible | |
| tocoderunningonthatdevice. Thatis,CPUmemorycanonlybeaccessedfromCPUcode,andGPU | |
| memory can only be accessed from kernels running on the GPU2 . CUDA APIs for copying memory | |
| betweentheCPUandGPUareusedtoexplicitlycopydatatothecorrectmemoryattherighttime. | |
| ACUDAfeaturecalledunifiedmemory allowsapplicationstomakememoryallocationswhichcanbe | |
| accessedfromCPUorGPU.TheCUDAruntimeorunderlyinghardwareenablesaccessorrelocatesthe | |
| data to the correct place when needed. Even with unified memory, optimal performance is attained | |
| by keeping the migration of memory to a minimum and accessing data from the processor directly | |
| attachedtothememorywhereitresidesasmuchaspossible. | |
| Thehardwarefeaturesofthesystemdeterminehowaccessandexchangeofdatabetweenmemory | |
| spacesisachieved. SectionUnifiedMemoryintroducesthedifferentcategoriesofunifiedmemorysys- | |
| tems. SectionUnifiedMemorycontainsmanymoredetailsaboutuseandbehaviorofunifiedmemory | |
| inallsituations. | |
| 1.3. The CUDA platform | |
| The NVIDIA CUDA platform consists of many pieces of software and hardware and many important | |
| technologies developed to enable computing on heterogeneous systems. This chapter serves to in- | |
| troducesomeofthefundamentalconceptsandcomponentsoftheCUDAplatformthatareimportant | |
| forapplicationdeveloperstounderstand. Thischapter,likeProgrammingModel,isnotspecifictoany | |
| programminglanguage,butappliestoeverythingthatusestheCUDAplatform. | |
| 2Anexceptiontothisismappedmemory,whichisCPUmemoryallocatedwithpropertiesthatenableittobedirectlyaccessed | |
| fromtheGPU.However,mappedaccessoccursoverthePCIeorNVLINKconnection.TheGPUisunabletohidethehigherlatency | |
| andlowerbandwidthbehindparallelism,somappedmemoryisnotaperformantreplacementtounifiedmemoryorplacingdata | |
| intheappropriatememoryspace. | |
| 12 Chapter1. IntroductiontoCUDA | |
| CUDAProgrammingGuide,Release13.1 | |
| 1.3.1. Compute Capability and Streaming Multiprocessor | |
| Versions | |
| EveryNVIDIAGPUhasaComputeCapability(CC)number,whichindicateswhatfeaturesaresupported | |
| by that GPU and specifies some hardware parameters for that GPU. These specifications are docu- | |
| mentedintheSection5.1appendix. AlistofallNVIDIAGPUsandtheircomputecapabilitiesismain- | |
| tainedontheCUDAGPUComputeCapabilitypage. | |
| ComputecapabilityisdenotedasamajorandminorversionnumberintheformatX.YwhereXisthe | |
| major version number and Y is the minor version number. For example, CC 12.0 has a major version | |
| of12andaminorversionof0. Thecomputecapabilitydirectlycorrespondstotheversionnumberof | |
| theSM.Forexample,theSMswithinaGPUofCC12.0haveSMversionsm_120. Thisversionisused | |
| tolabelbinaries. | |
| Section5.1.1showshowtoqueryanddeterminethecomputecapabilityoftheGPU(s)inasystem. | |
| 1.3.2. CUDA Toolkit and NVIDIA Driver | |
| TheNVIDIADrivercanbethoughtofastheoperatingsystemoftheGPU.TheNVIDIADriverisasoft- | |
| warecomponentwhichmustbeinstalledonthehostsystem’soperatingsystemandisnecessaryfor | |
| all GPU uses, including display and graphical functionality. The NVIDIA Driver is foundational to the | |
| CUDAplatform. InadditiontoCUDA,theNVIDIADriverprovidesallothermethodsofusingtheGPU, | |
| forexampleVulkanandDirect3D.TheNVIDIADriverhasversionnumberssuchasr580. | |
| TheCUDAToolkit isasetoflibraries, headers, andtoolsforwriting, building, andanalyzingsoftware | |
| whichutilizesGPUcomputing. TheCUDAToolkitisaseparatesoftwareproductfromtheNVIDIAdriver | |
| The CUDAruntime is a special case of one of the libraries provided by the CUDA Toolkit. The CUDA | |
| runtime provides both an API and some language extensions to handle common tasks such as allo- | |
| catingmemory,copyingdatabetweenGPUsandotherGPUsorCPUs,andlaunchingkernels. TheAPI | |
| componentsoftheCUDAruntimearereferredtoastheCUDAruntimeAPI. | |
| The CUDA Compatibility document provides full details of compatibility between different GPUs, | |
| NVIDIADrivers,andCUDAToolkitversions. | |
| 1.3.2.1 CUDARuntimeAPIandCUDADriverAPI | |
| TheCUDAruntimeAPIisimplementedontopofalower-levelAPIcalledtheCUDAdriverAPI,whichis | |
| an API exposed by the NVIDIA Driver. This guide focuses on the APIs exposed by the CUDA runtime | |
| API.AllthesamefunctionalitycanbeachievedusingonlythedriverAPIifdesired. Somefeaturesare | |
| onlyavailableusingthedriverAPI.ApplicationsmayuseeitherAPIorbothinteroperably. SectionThe | |
| CUDADriverAPIcoversinteroperationbetweentheruntimeanddriverAPIs. | |
| The full API reference for the CUDA runtime API functions can be found in the CUDA Runtime API | |
| Documentation. | |
| ThefullAPIreferencefortheCUDAdriverAPIcanbefoundintheCUDADriverAPIDocumentation. | |
| 1.3.3. Parallel Thread Execution (PTX) | |
| A fundamental but sometimes invisible layer of the CUDA platform is the Parallel Thread Execution | |
| (PTX)virtualinstructionsetarchitecture(ISA).PTXisahigh-levelassemblylanguageforNVIDIAGPUs. | |
| PTX provides an abstraction layer over the physical ISA of real GPU hardware. Like other platforms, | |
| 1.3. TheCUDAplatform 13 | |
| CUDAProgrammingGuide,Release13.1 | |
| applicationscanbewrittendirectlyinthisassemblylanguage,thoughdoingsocanaddunnecessary | |
| complexityanddifficultytosoftwaredevelopment. | |
| Domain-specific languages and compilers for high-level languages can generate PTX code as an in- | |
| termediaterepresentation(IR)andthenuseNVIDIA’sofflineorjust-in-time(JIT)compilationtoolsto | |
| produceexecutablebinaryGPUcode. ThisenablestheCUDAplatformtobeprogrammablefromlan- | |
| guages other than just those supported by NVIDIA-provided tools such as NVCC: The NVIDIA CUDA | |
| Compiler. | |
| SinceGPUcapabilitieschangeandgrowovertime,thePTXvirtualISAspecificationisversioned. PTX | |
| versions, like SM versions, correspond to a compute capability. For example, PTX which supports all | |
| thefeaturesofcomputecapability8.0iscalledcompute_80. | |
| FulldocumentationonPTXcanbefoundinthePTXISA. | |
| 1.3.4. Cubins and Fatbins | |
| CUDAapplicationsandlibrariesareusuallywritteninahigher-levellanguagelikeC++. Thathigher-level | |
| languageiscompiledtoPTX,andthenthePTXiscompiledintorealbinaryforaphysicalGPU,calleda | |
| CUDAbinary,orcubinforshort. AcubinhasaspecificbinaryformatforaspecificSMversion,suchas | |
| sm_120. | |
| ExecutablesandlibrarybinariesthatuseGPUcomputingcontainbothCPUandGPUcode. TheGPU | |
| code is stored within a container called a fatbin. Fatbins can contain cubins and PTX for multiple | |
| different targets. For example, an application could be built with binaries for multiple different GPU | |
| architectures,thatis,differentSMversions. Whenanapplicationisrun,itsGPUcodeisloadedontoa | |
| specificGPUandthebestbinaryforthatGPUfromthefatbinisused. | |
| FatbinscanalsocontainoneormorePTXversionsofGPUcode,theuseforwhichisdescribedinPTX | |
| Compatibility. Figure 8 shows an example of an application or library binary which contains multiple | |
| cubinversionsofGPUcodeaswellasoneversionofPTXcode. | |
| 1.3.4.1 BinaryCompatibility | |
| NVIDIAGPUsguaranteebinarycompatibilityincertaincircumstances. Specifically,withinamajorver- | |
| sionofcomputecapability,GPUswithminorcomputecapabilitygreaterthanorequaltothetargeted | |
| versionofcubincanloadandexecutethatcubin. Forexample,ifanapplicationcontainsacubinwith | |
| codecompiledforcomputecapability8.6,thatcubincanbeloadedandexecutedonGPUswithcom- | |
| putecapability8.6or8.9. Itcannot,however,beloadedonGPUswithcomputecapability8.0,because | |
| theGPU’sCCminorversion,0,islowerthanthecode’sminorversion,6. | |
| NVIDIA GPUs are not binary compatible between major compute capability versions. That is, cubin | |
| codecompiledforcomputecapability8.6willnotloadonGPUsofcomputecapability9.0. | |
| Whendiscussingbinarycode,thebinarycodeisoftenreferredtoashavingaversionsuchassm_86 | |
| intheaboveexample. Thisisthesameassayingthebinarywasbuiltforcomputecapability8.6. This | |
| shorthandisoftenusedbecauseitishowadeveloperspecifiesthisbinarybuildtargettotheNVIDIA | |
| CUDAcompiler,nvcc. | |
| Note | |
| Binary compatibility is promised only for binaries created by NVIDIA tools such as nvcc. Manual | |
| editing or generating binary code for NVIDIA GPUs is not supported. Compatibility promises are | |
| invalidatedifbinariesaremodifiedinanyway. | |
| 14 Chapter1. IntroductiontoCUDA | |
| CUDAProgrammingGuide,Release13.1 | |
| Figure8: ThebinaryforanexecutableorlibrarycontainsbothCPUbinarycodeandafatbincontainer | |
| forGPUcode. AfatbincancontainbothcubinGPUbinarycodeandPTXvirtualISAcode. PTXcode | |
| canbeJITcompiledforfuturetargets. | |
| 1.3. TheCUDAplatform 15 | |
| CUDAProgrammingGuide,Release13.1 | |
| 1.3.4.2 PTXCompatibility | |
| GPUcodecanbestoredinexecutablesinbinaryorPTXform,whichiscoveredinCubinsandFatbins. | |
| WhenanapplicationstoresthePTXversionofGPUcode,thatPTXcanbeJITcompiledatapplication | |
| runtime for any compute capability equal or higher to the compute capability of the PTX code. For | |
| example, if an application contains PTX for compute_80, that PTX code can be JIT compiled to later | |
| SM versions, such as sm_120 at application runtime. This enables forward compatibility with future | |
| GPUswithouttheneedtorebuildapplicationsorlibraries. | |
| 1.3.4.3 Just-in-TimeCompilation | |
| PTX code loaded by an application at runtime is compiled to binary code by the device driver. This | |
| is called just-in-time (JIT) compilation. Just-in-time compilation increases application load time, but | |
| allowstheapplicationtobenefitfromanynewcompilerimprovementscomingwitheachnewdevice | |
| driver. Italsoenablesapplicationstorunondevicesthatdidnotexistatthetimetheapplicationwas | |
| compiled. | |
| When the device driver just-in-time compiles PTX code for an application, it automatically caches a | |
| copyofthegeneratedbinarycodeinordertoavoidrepeatingthecompilationinsubsequentinvoca- | |
| tionsoftheapplication. Thecache-calledthecomputecache-isautomaticallyinvalidatedwhenthe | |
| devicedriverisupgraded,sothatapplicationscanbenefitfromtheimprovementsinthenewjust-in- | |
| timecompilerbuiltintothedevicedriver. | |
| HowandwhenPTXisJITcompiledatruntimehasbeenrelaxedsincetheearliestversionsofCUDA, | |
| allowingmoreflexibility forwhen and if to JIT compile some or all kernels. The section LazyLoading | |
| describes the available options and how to control JIT behavior. There are also a few environment | |
| variableswhichcontroljust-in-timecompilationbehavior,asdescribedinCUDAEnvironmentVariables. | |
| As an alternative to using nvcc to compile CUDA C++ device code, NVRTC can be used to compile | |
| CUDAC++devicecodetoPTXatruntime. NVRTCisaruntimecompilationlibraryforCUDAC++;more | |
| informationcanbefoundintheNVRTCUserguide. | |
| 16 Chapter1. IntroductiontoCUDA | |
| Chapter 2. Programming GPUs in CUDA | |
| 2.1. Intro to CUDA C++ | |
| ThischapterintroducessomeofthebasicconceptsoftheCUDAprogrammingmodelbyillustrating | |
| howtheyareexposedinC++. | |
| ThisprogrammingguidefocusesontheCUDAruntimeAPI.TheCUDAruntimeAPIisthemostcom- | |
| monlyusedwayofusingCUDAinC++andisbuiltontopofthelowerlevelCUDAdriverAPI. | |
| CUDARuntimeAPIandCUDADriverAPIdiscussesthedifferencebetweentheAPIsandCUDAdriverAPI | |
| discusseswritingcodethatmixestheAPIs. | |
| This guide assumes the CUDA Toolkit and NVIDIA Driver are installed and that a supported NVIDIA | |
| GPU is present. See The CUDA Quickstart Guide for instructions on installing the necessary CUDA | |
| components. | |
| 2.1.1. Compilation with NVCC | |
| GPUcodewritteninC++iscompiledusingtheNVIDIACudaCompiler,nvcc. nvccisacompilerdriver | |
| thatsimplifiestheprocessofcompilingC++orPTXcode: Itprovidessimpleandfamiliarcommandline | |
| optionsandexecutesthembyinvokingthecollectionoftoolsthatimplementthedifferentcompilation | |
| stages. | |
| This guide will show nvcc command lines which can be used on any Linux system with the CUDA | |
| Toolkitinstalled,ataWindowscommandlineorpowershell,oronWindowsSubsystemforLinuxwith | |
| the CUDA Toolkit. The nvccchapter of this guide covers common use cases of nvcc, and complete | |
| documentationisprovidedbythenvccusermanual. | |
| 2.1.2. Kernels | |
| As mentioned in the introduction to the CUDAProgrammingModel, functions which execute on the | |
| GPU which can be invoked from the host are called kernels. Kernels are written to be run by many | |
| parallelthreadssimultaneously. | |
| 2.1.2.1 SpecifyingKernels | |
| The code for a kernel is specified using the __global__ declaration specifier. This indicates to the | |
| compiler that this function will be compiled for the GPU in a way that allows it to be invoked from a | |
| kernel launch. A kernel launch is an operation which starts a kernel running, usually from the CPU. | |
| Kernelsarefunctionswithavoidreturntype. | |
| 17 | |
| CUDAProgrammingGuide,Release13.1 | |
| ∕∕ Kernel definition | |
| __global__ void vecAdd(float* A, float* B, float* C) | |
| { | |
| } | |
| 2.1.2.2 LaunchingKernels | |
| Thenumberofthreadsthatwillexecutethekernelinparallelisspecifiedaspartofthekernellaunch. | |
| Thisiscalledtheexecutionconfiguration. Differentinvocationsofthesamekernelmayusedifferent | |
| executionconfigurations,suchasadifferentnumberofthreadsorthreadblocks. | |
| There are two ways of launching kernels from CPU code, triplechevronnotation and cudaLaunchK- | |
| ernelEx. Triple chevron notation, the most common way of launching kernels, is introduced here. | |
| AnexampleoflaunchingakernelusingcudaLaunchKernelExisshownanddiscussedindetailinin | |
| sectionSection3.1.1. | |
| 2.1.2.2.1 TripleChevronNotation | |
| TriplechevronnotationisaCUDAC++LanguageExtensionwhichisusedtolaunchkernels. Itiscalled | |
| triple chevron because it uses three chevron characters to encapsulate the execution configuration | |
| for the kernel launch, i.e. <<< >>>. Execution configuration parameters are specified as a comma | |
| separated list inside the chevrons, similar to parameters to a function call. The syntax for a kernel | |
| launchofthevecAddkernelisshownbelow. | |
| __global__ void vecAdd(float* A, float* B, float* C) | |
| { | |
| } | |
| int main() | |
| { | |
| ... | |
| ∕∕ Kernel invocation | |
| vecAdd<<<1, 256>>>(A, B, C); | |
| ... | |
| } | |
| Thefirsttwoparameterstothetriplechevronnotationarethegriddimensionsandthethreadblock | |
| dimensions, respectively. When using 1-dimensional thread blocks or grids, integers can be used to | |
| specifydimensions. | |
| Theabovecodelaunchesasinglethreadblockcontaining256threads. Eachthreadwillexecutethe | |
| exact same kernel code. In ThreadandGridIndexIntrinsics, we’ll show how each thread can use its | |
| indexwithinthethreadblockandgridtochangethedataitoperateson. | |
| There is a limit to the number of threads per block, since all threads of a block reside on the same | |
| streaming multiprocessor(SM) and must share the resources of the SM. On current GPUs, a thread | |
| blockmaycontainupto1024threads. Ifresourcesallow,morethanonethreadblockcanbescheduled | |
| onanSMsimultaneously. | |
| Kernel launches are asynchronous with respect to the host thread. That is, the kernel will be setup | |
| for execution on the GPU, but the host code will not wait for the kernel to complete (or even start) | |
| executingontheGPUbeforeproceeding. SomeformofsynchronizationbetweentheGPUandCPU | |
| 18 Chapter2. ProgrammingGPUsinCUDA | |
| CUDAProgrammingGuide,Release13.1 | |
| must be used to determine that the kernel has completed. The most basic version, completely syn- | |
| chronizing the entire GPU, is shown in Synchronizing CPU and GPU. More sophisticated methods of | |
| synchronizationarecoveredinAsynchronousExecution. | |
| When using 2 or 3-dimensional grids or thread blocks, the CUDA type dim3 is used as the grid and | |
| thread block dimension parameters. The code fragment below shows a kernel launch of a MatAdd | |
| kernelusing16by16gridofthreadblocks,eachthreadblockis8by8. | |
| int main() | |
| { | |
| ... | |
| dim3 grid(16,16); | |
| dim3 block(8,8); | |
| MatAdd<<<grid, block>>>(A, B, C); | |
| ... | |
| } | |
| 2.1.2.3 ThreadandGridIndexIntrinsics | |
| Withinkernelcode,CUDAprovidesintrinsicstoaccessparametersoftheexecutionconfigurationand | |
| theindexofathreadorblock. | |
| ▶ threadIdxgivestheindexofathreadwithinitsthreadblock. Eachthreadinathreadblockwill | |
| haveadifferentindex. | |
| ▶ blockDimgivesthedimensionsofthethreadblock, whichwasspecifiedintheexecutioncon- | |
| figurationofthekernellaunch. | |
| ▶ blockIdxgivestheindexofathreadblockwithinthegrid. Eachthreadblockwillhaveadifferent | |
| index. | |
| ▶ gridDim gives the dimensions of the grid, which was specified in the execution configuration | |
| whenthekernelwaslaunched. | |
| Eachoftheseintrinsicsisa3-componentvectorwitha.x,.y,and.zmember. Dimensionsnotspec- | |
| ifiedbyalaunchconfigurationwilldefaultto1. threadIdxandblockIdxarezeroindexed. Thatis, | |
| threadIdx.xwilltakeonvaluesfrom0uptoandincludingblockDim.x-1. .yand.zoperatethe | |
| sameintheirrespectivedimensions. | |
| Similarly,blockIdx.xwillhavevaluesfrom0uptoandincludinggridDim.x-1,andthesamefor.y | |
| and.zdimensions,respectively. | |
| These allow an individual thread to identify what work it should carry out. Returning to the vecAdd | |
| kernel,thekerneltakesthreeparameters,eachisavectoroffloats. Thekernelperformsanelement- | |
| wiseadditionofAandBandstorestheresultinC.Thekernelisparallelizedsuchthateachthreadwill | |
| performoneaddition. Whichelementitcomputesisdeterminedbyitsthreadandgridindex. | |
| __global__ void vecAdd(float* A, float* B, float* C) | |
| { | |
| ∕∕ calculate which element this thread is responsible for computing | |
| int workIndex = threadIdx.x + blockDim.x * blockIdx.x | |
| ∕∕ Perform computation | |
| C[workIndex] = A[workIndex] + B[workIndex]; | |
| } | |
| int main() | |
| (continuesonnextpage) | |
| 2.1. IntrotoCUDAC++ 19 | |
| CUDAProgrammingGuide,Release13.1 | |
| (continuedfrompreviouspage) | |
| { | |
| ... | |
| | | ∕∕ A, | B, | and C | are vectors | | of 1024 elements | | | | |
| | --- | ----------- | --- | --------- | ----------- | ------ | ---------------- | --- | --- | | |
| | | vecAdd<<<4, | | 256>>>(A, | | B, C); | | | | | |
| ... | |
| } | |
| Inthisexample,4threadblocksof256threadsareusedtoaddavectorof1024elements. Inthefirst | |
| threadblock,blockIdx.xwillbezero,andsoeachthread’sworkIndexwillsimplybeitsthreadIdx.x. | |
| Inthesecondthreadblock,blockIdx.xwillbe1,soblockDim.x * blockIdx.xwillbethesame | |
| asblockDim.x,whichis256inthiscase. TheworkIndexforeachthreadinthesecondthreadblock | |
| willbeitsthreadIdx.x + 256. InthethirdthreadblockworkIndexwillbethreadIdx.x + 512. | |
| ThiscomputationofworkIndexisverycommonfor1-dimensionalparallelizations. Expandingtotwo | |
| orthreedimensionsoftenfollowsthesamepatternineachofthosedimensions. | |
| | 2.1.2.3.1 | BoundsChecking | | | | | | | | | |
| | --------- | -------------- | --- | --- | --- | --- | --- | --- | --- | | |
| Theexamplegivenaboveassumesthatthelengthofthevectorisamultipleofthethreadblocksize, | |
| 256 threads in this case. To make the kernel handle any vector length, we can add checks that the | |
| memoryaccessisnotexceedingtheboundsofthearraysasshownbelow,andthenlaunchonethread | |
| blockwhichwillhavesomeinactivethreads. | |
| __global__ void vecAdd(float* A, float* B, float* C, int vectorLength) | |
| { | |
| ∕∕ calculate which element this thread is responsible for computing | |
| | | int | workIndex | | = threadIdx.x | | + blockDim.x | * blockIdx.x | | | |
| | --- | ------------ | --------- | --- | ------------- | --- | ------------ | ------------ | --- | | |
| | | if(workIndex | | < | vectorLength) | | | | | | |
| { | |
| | | | ∕∕ Perform | | computation | | | | | | |
| | --- | --- | ------------ | --- | ----------- | ------------ | --------------- | --- | --- | | |
| | | | C[workIndex] | | = | A[workIndex] | + B[workIndex]; | | | | |
| } | |
| } | |
| With the above kernel code, more threads than needed can be launched without causing out-of- | |
| bounds accesses to the arrays. When workIndex exceeds vectorLength, threads exit and do not | |
| doanywork. Launchingextrathreadsinablockthatdonoworkdoesnotincuralargeoverheadcost, | |
| howeverlaunchingthreadblocksinwhichnothreadsdoworkshouldbeavoided. Thiskernelcannow | |
| handlevectorlengthswhicharenotamultipleoftheblocksize. | |
| The number of thread blocks which are needed can be calculated as the ceiling of the number of | |
| threads needed, the vector length in this case, divided by the number of threads per block. That is, | |
| theintegerdivisionofthenumberofthreadsneededbythenumberofthreadsperblock,roundedup. | |
| Acommonwayofexpressingthisasasingleintegerdivisionisgivenbelow. Byaddingthreads | |
| - 1 | |
| beforetheintegerdivision,thisbehaveslikeaceilingfunction,addinganotherthreadblockonlyifthe | |
| vectorlengthisnotdivisiblebythenumberofthreadsperblock. | |
| ∕∕ vectorLength is an integer storing number of elements in the vector | |
| | int | threads | = | 256; | | | | | | | |
| | ---------------- | ------- | --- | ---------------- | --- | --------------------- | ----------- | -------------- | --- | | |
| | int | blocks | = | (vectorLength | | + threads-1)∕threads; | | | | | |
| | vecAdd<<<blocks, | | | threads>>>(devA, | | | devB, devC, | vectorLength); | | | |
| TheCUDACoreComputeLibrary(CCCL)providesaconvenientutility,cuda::ceil_div,fordoingthis | |
| | 20 | | | | | | | Chapter2. | ProgrammingGPUsinCUDA | | |
| | --- | --- | --- | --- | --- | --- | --- | --------- | --------------------- | | |
| CUDAProgrammingGuide,Release13.1 | |
| ceilingdividetocalculatethenumberofblocksneededforakernellaunch. Thisutilityisavailableby | |
| includingtheheader<cuda∕cmath>. | |
| ∕∕ vectorLength is an integer storing number of elements in the vector | |
| | int threads | | = 256; | | | | | | | |
| | ---------------- | --- | ---------------------------- | --- | --- | --- | ----- | -------------------- | | |
| | int blocks | = | cuda::ceil_div(vectorLength, | | | | | threads); | | |
| | vecAdd<<<blocks, | | threads>>>(devA, | | | | devB, | devC, vectorLength); | | |
| Thechoiceof256threadsperblockhereisarbitrary,butthisisquiteoftenagoodvaluetostartwith. | |
| | 2.1.3. | Memory | | in | GPU | Computing | | | | |
| | ------ | ------ | --- | --- | --- | --------- | --- | --- | | |
| InordertousethevecAddkernelshownabove,thearraysA,B,andCmustbeinmemoryaccessible | |
| to the GPU. There are several different ways to do this, two of which will be illustrated here. Other | |
| methods will be covered in later sections on unifiedmemory. The memory spaces available to code | |
| running on the GPU were introduced in GPU Memory and are covered in more detail in GPU Device | |
| MemorySpaces. | |
| 2.1.3.1 UnifiedMemory | |
| Unified memory is a feature of the CUDA runtime which lets the NVIDIA Driver manage movement | |
| of data between host and device(s). Memory is allocated using the cudaMallocManaged API or by | |
| declaringavariablewiththe__managed__specifier. TheNVIDIADriverwillmakesurethatthememory | |
| isaccessibletotheGPUorCPUwhenevereithertriestoaccessit. | |
| ThecodebelowshowsacompletefunctiontolaunchthevecAddkernelwhichusesunifiedmemory | |
| fortheinputandoutputvectorsthatwillbeusedontheGPU.cudaMallocManagedallocatesbuffers | |
| whichcanbeaccessedfromeithertheCPUortheGPU.ThesebuffersarereleasedusingcudaFree. | |
| | void unifiedMemExample(int | | | | | vectorLength) | | | | |
| | -------------------------- | --- | --- | --- | --- | ------------- | --- | --- | | |
| { | |
| | ∕∕ | Pointers | to | memory | vectors | | | | | |
| | ------ | -------- | ---------- | ------ | ------- | --- | --- | --- | | |
| | float* | A | = nullptr; | | | | | | | |
| | float* | B | = nullptr; | | | | | | | |
| | float* | C | = nullptr; | | | | | | | |
| float* comparisonResult = (float*)malloc(vectorLength*sizeof(float)); | |
| | ∕∕ | Use unified | | memory | to | allocate | buffers | | | |
| | --------------------- | ----------- | -------------- | ------- | ---------------------------- | -------- | ------- | --- | | |
| | cudaMallocManaged(&A, | | | | vectorLength*sizeof(float)); | | | | | |
| | cudaMallocManaged(&B, | | | | vectorLength*sizeof(float)); | | | | | |
| | cudaMallocManaged(&C, | | | | vectorLength*sizeof(float)); | | | | | |
| | ∕∕ | Initialize | | vectors | on | the | host | | | |
| | initArray(A, | | vectorLength); | | | | | | | |
| | initArray(B, | | vectorLength); | | | | | | | |
| ∕∕ Launch the kernel. Unified memory will make sure A, B, and C are | |
| | ∕∕ | accessible | | to the | GPU | | | | | |
| | ---------------- | ---------- | ------------------------------ | ------------- | --- | -------- | --------- | -------------- | | |
| | int | threads | = | 256; | | | | | | |
| | int | blocks | = cuda::ceil_div(vectorLength, | | | | | threads); | | |
| | vecAdd<<<blocks, | | | threads>>>(A, | | | B, C, | vectorLength); | | |
| | ∕∕ | Wait | for the | kernel | to | complete | execution | | | |
| cudaDeviceSynchronize(); | |
| (continuesonnextpage) | |
| 2.1. IntrotoCUDAC++ 21 | |
| CUDAProgrammingGuide,Release13.1 | |
| (continuedfrompreviouspage) | |
| | ∕∕ | Perform | computation | | serially | on | CPU for | comparison | | | |
| | --------------- | ------- | ----------- | -------------------- | ----------- | --- | -------------- | ---------- | --- | | |
| | serialVecAdd(A, | | | B, comparisonResult, | | | vectorLength); | | | | |
| | ∕∕ | Confirm | that | CPU | and GPU got | the | same | answer | | | |
| if(vectorApproximatelyEqual(C, comparisonResult, vectorLength)) | |
| { | |
| | | printf("Unified | | | Memory: | CPU and | GPU | answers | match\n"); | | |
| | --- | --------------- | --- | --- | ------- | ------- | --- | ------- | ---------- | | |
| } | |
| else | |
| { | |
| printf("Unified Memory: Error - CPU and GPU answers do not match\n"); | |
| } | |
| | ∕∕ | Clean | Up | | | | | | | | |
| | --- | ----- | --- | --- | --- | --- | --- | --- | --- | | |
| cudaFree(A); | |
| cudaFree(B); | |
| cudaFree(C); | |
| free(comparisonResult); | |
| } | |
| Unified memory is supported on all operating systems and GPUs supported by CUDA, though the | |
| underlying mechanism and performance may differ based on system architecture. Unified Memory | |
| providesmoredetails. OnsomeLinuxsystems,(e.g. thosewithaddresstranslationservicesorhetero- | |
| geneousmemorymanagement) all system memory is automatically unified memory, and there is no | |
| needtousecudaMallocManagedorthe__managed__specifier. | |
| 2.1.3.2 ExplicitMemoryManagement | |
| Explicitlymanagingmemoryallocationanddatamigrationbetweenmemoryspacescanhelpimprove | |
| applicationperformance,thoughitdoesmakeformoreverbosecode. Thecodebelowexplicitlyallo- | |
| catesmemoryontheGPUusingcudaMalloc. MemoryontheGPUisfreedusingthesamecudaFree | |
| APIaswasusedforunifiedmemoryinthepreviousexample. | |
| | void | explicitMemExample(int | | | vectorLength) | | | | | | |
| | ---- | ---------------------- | --- | --- | ------------- | --- | --- | --- | --- | | |
| { | |
| | ∕∕ | Pointers | for | host | memory | | | | | | |
| | ------ | -------- | ---------- | ---- | ------ | --- | --- | --- | --- | | |
| | float* | A | = nullptr; | | | | | | | | |
| | float* | B | = nullptr; | | | | | | | | |
| | float* | C | = nullptr; | | | | | | | | |
| float* comparisonResult = (float*)malloc(vectorLength*sizeof(float)); | |
| | ∕∕ | Pointers | for | device | memory | | | | | | |
| | ------ | -------- | --- | -------- | ------ | --- | --- | --- | --- | | |
| | float* | devA | = | nullptr; | | | | | | | |
| | float* | devB | = | nullptr; | | | | | | | |
| | float* | devC | = | nullptr; | | | | | | | |
| ∕∕Allocate Host Memory using cudaMallocHost API. This is best practice | |
| ∕∕ when buffers will be used for copies between CPU and GPU memory | |
| | cudaMallocHost(&A, | | | vectorLength*sizeof(float)); | | | | | | | |
| | ------------------ | --- | --- | ---------------------------- | --- | --- | --- | --- | --- | | |
| | cudaMallocHost(&B, | | | vectorLength*sizeof(float)); | | | | | | | |
| (continuesonnextpage) | |
| | 22 | | | | | | | Chapter2. | ProgrammingGPUsinCUDA | | |
| | --- | --- | --- | --- | --- | --- | --- | --------- | --------------------- | | |
| CUDAProgrammingGuide,Release13.1 | |
| (continuedfrompreviouspage) | |
| | cudaMallocHost(&C, | | | vectorLength*sizeof(float)); | | | | | | |
| | ------------------ | --- | -------------- | ---------------------------- | ---- | --- | --- | --- | | |
| | ∕∕ Initialize | | vectors | on the | host | | | | | |
| | initArray(A, | | vectorLength); | | | | | | | |
| | initArray(B, | | vectorLength); | | | | | | | |
| ∕∕ start-allocate-and-copy | |
| | ∕∕ Allocate | memory | | on the | GPU | | | | | |
| | ----------------- | ------ | ------ | ---------------------------- | --- | --- | --- | --- | | |
| | cudaMalloc(&devA, | | | vectorLength*sizeof(float)); | | | | | | |
| | cudaMalloc(&devB, | | | vectorLength*sizeof(float)); | | | | | | |
| | cudaMalloc(&devC, | | | vectorLength*sizeof(float)); | | | | | | |
| | ∕∕ Copy | data | to the | GPU | | | | | | |
| cudaMemcpy(devA, A, vectorLength*sizeof(float), cudaMemcpyDefault); | |
| cudaMemcpy(devB, B, vectorLength*sizeof(float), cudaMemcpyDefault); | |
| | cudaMemset(devC, | | 0, | vectorLength*sizeof(float)); | | | | | | |
| | ---------------- | --- | --- | ---------------------------- | --- | --- | --- | --- | | |
| ∕∕ end-allocate-and-copy | |
| | ∕∕ Launch | the | kernel | | | | | | | |
| | ---------------- | ---------- | ---------------------------- | --------- | --- | -------- | --------- | --- | | |
| | int threads | = | 256; | | | | | | | |
| | int blocks | = | cuda::ceil_div(vectorLength, | | | | threads); | | | |
| | vecAdd<<<blocks, | | threads>>>(devA, | | | devB, | devC); | | | |
| | ∕∕ wait | for kernel | | execution | to | complete | | | | |
| cudaDeviceSynchronize(); | |
| | ∕∕ Copy | results | back | to host | | | | | | |
| | ------- | ------- | ---- | ------- | --- | --- | --- | --- | | |
| cudaMemcpy(C, devC, vectorLength*sizeof(float), cudaMemcpyDefault); | |
| | ∕∕ Perform | computation | | serially | | on CPU | for comparison | | | |
| | --------------- | ----------- | --- | ----------------- | --- | -------- | -------------- | --- | | |
| | serialVecAdd(A, | | B, | comparisonResult, | | | vectorLength); | | | |
| | ∕∕ Confirm | that | CPU | and GPU | got | the same | answer | | | |
| if(vectorApproximatelyEqual(C, comparisonResult, vectorLength)) | |
| { | |
| | printf("Explicit | | | Memory: | CPU | and | GPU answers | match\n"); | | |
| | ---------------- | --- | --- | ------- | --- | --- | ----------- | ---------- | | |
| } | |
| else | |
| { | |
| printf("Explicit Memory: Error - CPU and GPU answers to not match\n"); | |
| } | |
| | ∕∕ clean | up | | | | | | | | |
| | -------- | --- | --- | --- | --- | --- | --- | --- | | |
| cudaFree(devA); | |
| cudaFree(devB); | |
| cudaFree(devC); | |
| cudaFreeHost(A); | |
| cudaFreeHost(B); | |
| cudaFreeHost(C); | |
| free(comparisonResult); | |
| } | |
| TheCUDAAPIcudaMemcpyisusedtocopydatafromabufferresidingontheCPUtoabufferresiding | |
| ontheGPU.Alongwiththedestinationpointer,sourcepointer,andsizeinbytes,thefinalparameter | |
| 2.1. IntrotoCUDAC++ 23 | |
| CUDAProgrammingGuide,Release13.1 | |
| of cudaMemcpy is a cudaMemcpyKind_t. This can have values such as cudaMemcpyHostToDevice | |
| forcopiesfromtheCPUtoaGPU,cudaMemcpyDeviceToHostforcopiesfromtheCPUtotheGPU, | |
| orcudaMemcpyDeviceToDeviceforcopieswithinaGPUorbetweenGPUs. | |
| In this example, cudaMemcpyDefault is passed as the last argument to cudaMemcpy. This causes | |
| CUDAtousethevalueofthesourceanddestinationpointerstodeterminethetypeofcopytoperform. | |
| ThecudaMemcpyAPIissynchronous. Thatis,itdoesnotreturnuntilthecopyhascompleted. Asyn- | |
| chronouscopiesareintroducedinLaunchingMemoryTransfersinCUDAStreams. | |
| ThecodeusescudaMallocHosttoallocatememoryontheCPU.Thisallocatespage-lockedmemory | |
| onthehost,whichcanimprovecopyperformanceandisnecessaryforasynchronousmemorytrans- | |
| fers. Ingeneral,itisgoodpracticetousepage-lockedmemoryforCPUbuffersthatwillbeusedindata | |
| transfers to and from GPUs. Performance can degrade on some systems if too much host memory | |
| ispage-locked. Bestpracticeistopage-lockonlybufferswhichwillbeusedforsendingorreceiving | |
| datafromtheGPU. | |
| 2.1.3.3 MemoryManagementandApplicationPerformance | |
| As can be seen in the above example, explicit memory management is more verbose, requiring the | |
| programmertospecifycopiesbetweenthehostanddevice. Thisistheadvantageanddisadvantage | |
| of explicit memory management: it affords more control of when data is copied between host and | |
| devices, where memory is resident, and exactly what memory is allocated where. Explicit memory | |
| management can provide performance opportunities controlling memory transfers and overlapping | |
| themwithothercomputations. | |
| When using unified memory, there are CUDA APIs (which will be covered in Memory Advise and | |
| Prefetch), which provide hints to the NVIDIA driver managing the memory, which can enable some | |
| oftheperformancebenefitsofusingexplicitmemorymanagementwhenusingunifiedmemory. | |
| 2.1.4. Synchronizing CPU and GPU | |
| AsmentionedinLaunchingKernels,kernellaunchesareasynchronouswithrespecttotheCPUthread | |
| whichcalledthem. ThismeansthecontrolflowoftheCPUthreadwillcontinueexecutingbeforethe | |
| kernelhascompleted,andpossiblyevenbeforeithaslaunched. Inordertoguaranteethatakernelhas | |
| completedexecutionbeforeproceedinginhostcode,somesynchronizationmechanismisnecessary. | |
| ThesimplestwaytosynchronizetheGPUandahostthreadiswiththeuseofcudaDeviceSynchro- | |
| nize,whichblocksthehostthreaduntilallpreviouslyissuedworkontheGPUhascompleted. Inthe | |
| examples of this chapter this is sufficient because only single operations are being executed on the | |
| GPU. In larger applications, there may be multiple streams executing work on the GPU and cudaDe- | |
| viceSynchronizewillwaitforworkinallstreamstocomplete. Intheseapplications, usingStream | |
| Synchronization APIs to synchronize only with a specific stream or CUDA Events is recommended. | |
| ThesewillbecoveredindetailintheAsynchronousExecutionchapter. | |
| 2.1.5. Putting it All Together | |
| The following listings show the entire code for the simple vector addition kernel introduced in this | |
| chapteralongwithallhostcodeandutilityfunctionsforcheckingtoverifythattheanswerobtained | |
| is correct. These examples default to using a vector length of 1024, but accept a different vector | |
| lengthasacommandlineargumenttotheexecutable. | |
| 24 Chapter2. ProgrammingGPUsinCUDA | |
| CUDAProgrammingGuide,Release13.1 | |
| UnifiedMemory | |
| | #include | <cuda_runtime_api.h> | | | | | | | | | | |
| | -------- | -------------------- | --- | --- | --- | --- | --- | --- | --- | --- | | |
| | #include | <memory.h> | | | | | | | | | | |
| | #include | <cstdlib> | | | | | | | | | | |
| | #include | <ctime> | | | | | | | | | | |
| | #include | <stdio.h> | | | | | | | | | | |
| | #include | <cuda∕cmath> | | | | | | | | | | |
| __global__ void vecAdd(float* A, float* B, float* C, int vectorLength) | |
| { | |
| | int | workIndex | | = threadIdx.x | | | + blockIdx.x*blockDim.x; | | | | | |
| | ------------ | --------- | --- | ------------- | --- | --- | ------------------------ | --- | --- | --- | | |
| | if(workIndex | | < | vectorLength) | | | | | | | | |
| { | |
| | | C[workIndex] | | = | A[workIndex] | | + B[workIndex]; | | | | | |
| | --- | ------------ | --- | --- | ------------ | --- | --------------- | --- | --- | --- | | |
| } | |
| } | |
| | void | initArray(float* | | A, | int | length) | | | | | | |
| | ---- | ---------------- | --- | --- | --- | ------- | --- | --- | --- | --- | | |
| { | |
| std::srand(std::time({})); | |
| | for(int | i=0; | i<length; | | | i++) | | | | | | |
| | ------- | ---- | --------- | --- | --- | ---- | --- | --- | --- | --- | | |
| { | |
| | | A[i] | = rand() | ∕ | (float)RAND_MAX; | | | | | | | |
| | --- | ---- | -------- | --- | ---------------- | --- | --- | --- | --- | --- | | |
| } | |
| } | |
| | void | serialVecAdd(float* | | | A, | float* | B, float* | C, | int length) | | | |
| | ---- | ------------------- | --- | --- | --- | ------ | --------- | --- | ----------- | --- | | |
| { | |
| | for(int | i=0; | i<length; | | | i++) | | | | | | |
| | ------- | ---- | --------- | --- | --- | ---- | --- | --- | --- | --- | | |
| { | |
| | | C[i] | = A[i] | + B[i]; | | | | | | | | |
| | --- | ---- | ------ | ------- | --- | --- | --- | --- | --- | --- | | |
| } | |
| } | |
| bool vectorApproximatelyEqual(float* A, float* B, int length, float epsilon=0. | |
| ,→00001) | |
| { | |
| | for(int | i=0; | i<length; | | | i++) | | | | | | |
| | ------- | ---- | --------- | --- | --- | ---- | --- | --- | --- | --- | | |
| { | |
| | | if(fabs(A[i] | | -B[i]) | | > epsilon) | | | | | | |
| | --- | ------------ | --- | ------ | --- | ---------- | --- | --- | --- | --- | | |
| { | |
| | | printf("Index | | | %d | mismatch: | %f | != %f", | i, A[i], | B[i]); | | |
| | --- | ------------- | --- | ------ | --- | --------- | --- | ------- | -------- | ------ | | |
| | | return | | false; | | | | | | | | |
| } | |
| } | |
| | return | true; | | | | | | | | | | |
| | ------ | ----- | --- | --- | --- | --- | --- | --- | --- | --- | | |
| } | |
| ∕∕unified-memory-begin | |
| | void | unifiedMemExample(int | | | | vectorLength) | | | | | | |
| | ---- | --------------------- | --- | --- | --- | ------------- | --- | --- | --- | --- | | |
| { | |
| | ∕∕ | Pointers | to | memory | vectors | | | | | | | |
| | --- | -------- | --- | ------ | ------- | --- | --- | --- | --- | --- | | |
| (continuesonnextpage) | |
| 2.1. IntrotoCUDAC++ 25 | |
| CUDAProgrammingGuide,Release13.1 | |
| (continuedfrompreviouspage) | |
| | float* | | A = | nullptr; | | | | | | | | |
| | ------ | --- | --- | -------- | --- | --- | --- | --- | --- | --- | | |
| | float* | | B = | nullptr; | | | | | | | | |
| | float* | | C = | nullptr; | | | | | | | | |
| float* comparisonResult = (float*)malloc(vectorLength*sizeof(float)); | |
| | ∕∕ | Use | unified | memory | | to allocate | buffers | | | | | |
| | --------------------- | ---------- | ------- | -------------- | --- | ---------------------------- | ------- | --- | --- | --- | | |
| | cudaMallocManaged(&A, | | | | | vectorLength*sizeof(float)); | | | | | | |
| | cudaMallocManaged(&B, | | | | | vectorLength*sizeof(float)); | | | | | | |
| | cudaMallocManaged(&C, | | | | | vectorLength*sizeof(float)); | | | | | | |
| | ∕∕ | Initialize | | vectors | | on the | host | | | | | |
| | initArray(A, | | | vectorLength); | | | | | | | | |
| | initArray(B, | | | vectorLength); | | | | | | | | |
| ∕∕ Launch the kernel. Unified memory will make sure A, B, and C are | |
| | ∕∕ | accessible | | to | the GPU | | | | | | | |
| | ---------------- | ---------- | --- | ------------------------------ | ------------- | ----------- | --- | ----------------- | --------- | --- | | |
| | int | threads | | = 256; | | | | | | | | |
| | int | blocks | | = cuda::ceil_div(vectorLength, | | | | | threads); | | | |
| | vecAdd<<<blocks, | | | | threads>>>(A, | | B, | C, vectorLength); | | | | |
| | ∕∕ | Wait | for | the kernel | | to complete | | execution | | | | |
| cudaDeviceSynchronize(); | |
| | ∕∕ | Perform | | computation | | serially | on | CPU for | comparison | | | |
| | --------------- | ------- | --- | ----------- | ----------------- | -------- | --- | -------------- | ---------- | --- | | |
| | serialVecAdd(A, | | | B, | comparisonResult, | | | vectorLength); | | | | |
| | ∕∕ | Confirm | | that CPU | and | GPU got | the | same | answer | | | |
| if(vectorApproximatelyEqual(C, comparisonResult, vectorLength)) | |
| { | |
| | | printf("Unified | | | Memory: | | CPU and | GPU | answers | match\n"); | | |
| | --- | --------------- | --- | --- | ------- | --- | ------- | --- | ------- | ---------- | | |
| } | |
| else | |
| { | |
| printf("Unified Memory: Error - CPU and GPU answers do not match\n"); | |
| } | |
| | ∕∕ | Clean | Up | | | | | | | | | |
| | --- | ----- | --- | --- | --- | --- | --- | --- | --- | --- | | |
| cudaFree(A); | |
| cudaFree(B); | |
| cudaFree(C); | |
| free(comparisonResult); | |
| } | |
| ∕∕unified-memory-end | |
| | int main(int | | argc, | char** | | argv) | | | | | | |
| | ------------ | --- | ----- | ------ | --- | ----- | --- | --- | --- | --- | | |
| { | |
| | int | vectorLength | | | = 1024; | | | | | | | |
| | ------- | ------------ | ---- | --- | ------- | --- | --- | --- | --- | --- | | |
| | if(argc | | >=2) | | | | | | | | | |
| { | |
| | | vectorLength | | | = std::atoi(argv[1]); | | | | | | | |
| | --- | ------------ | --- | --- | --------------------- | --- | --- | --- | --- | --- | | |
| } | |
| (continuesonnextpage) | |
| | 26 | | | | | | | | Chapter2. | ProgrammingGPUsinCUDA | | |
| | --- | --- | --- | --- | --- | --- | --- | --- | --------- | --------------------- | | |
| CUDAProgrammingGuide,Release13.1 | |
| (continuedfrompreviouspage) | |
| unifiedMemExample(vectorLength); | |
| | return | 0; | | | | | | | | | |
| | ------ | --- | --- | --- | --- | --- | --- | --- | --- | | |
| } | |
| ExplicitMemoryManagement | |
| | #include | <cuda_runtime_api.h> | | | | | | | | | |
| | -------- | -------------------- | --- | --- | --- | --- | --- | --- | --- | | |
| | #include | <memory.h> | | | | | | | | | |
| | #include | <cstdlib> | | | | | | | | | |
| | #include | <ctime> | | | | | | | | | |
| | #include | <stdio.h> | | | | | | | | | |
| | #include | <cuda∕cmath> | | | | | | | | | |
| __global__ void vecAdd(float* A, float* B, float* C, int vectorLength) | |
| { | |
| | int | workIndex | = threadIdx.x | | | + blockIdx.x*blockDim.x; | | | | | |
| | ------------ | --------- | --------------- | --- | --- | ------------------------ | --- | --- | --- | | |
| | if(workIndex | | < vectorLength) | | | | | | | | |
| { | |
| | | C[workIndex] | = | A[workIndex] | | + B[workIndex]; | | | | | |
| | --- | ------------ | --- | ------------ | --- | --------------- | --- | --- | --- | | |
| } | |
| } | |
| | void | initArray(float* | | A, int | length) | | | | | | |
| | ---- | ---------------- | --- | ------ | ------- | --- | --- | --- | --- | | |
| { | |
| std::srand(std::time({})); | |
| | for(int | i=0; | i<length; | | i++) | | | | | | |
| | ------- | ---- | --------- | --- | ---- | --- | --- | --- | --- | | |
| { | |
| | | A[i] = | rand() | ∕ (float)RAND_MAX; | | | | | | | |
| | --- | ------ | ------ | ------------------ | --- | --- | --- | --- | --- | | |
| } | |
| } | |
| | void | serialVecAdd(float* | | A, | float* | B, float* | C, | int length) | | | |
| | ---- | ------------------- | --- | --- | ------ | --------- | --- | ----------- | --- | | |
| { | |
| | for(int | i=0; | i<length; | | i++) | | | | | | |
| | ------- | ---- | --------- | --- | ---- | --- | --- | --- | --- | | |
| { | |
| | | C[i] = | A[i] + | B[i]; | | | | | | | |
| | --- | ------ | ------ | ----- | --- | --- | --- | --- | --- | | |
| } | |
| } | |
| bool vectorApproximatelyEqual(float* A, float* B, int length, float epsilon=0. | |
| ,→00001) | |
| { | |
| | for(int | i=0; | i<length; | | i++) | | | | | | |
| | ------- | ---- | --------- | --- | ---- | --- | --- | --- | --- | | |
| { | |
| | | if(fabs(A[i] | -B[i]) | | > epsilon) | | | | | | |
| | --- | ------------ | ------ | --- | ---------- | --- | --- | --- | --- | | |
| { | |
| | | printf("Index | | %d | mismatch: | %f | != %f", | i, A[i], | B[i]); | | |
| | --- | ------------- | ------ | --- | --------- | --- | ------- | -------- | ------ | | |
| | | return | false; | | | | | | | | |
| } | |
| } | |
| | return | true; | | | | | | | | | |
| | ------ | ----- | --- | --- | --- | --- | --- | --- | --- | | |
| (continuesonnextpage) | |
| 2.1. IntrotoCUDAC++ 27 | |
| CUDAProgrammingGuide,Release13.1 | |
| (continuedfrompreviouspage) | |
| } | |
| ∕∕explicit-memory-begin | |
| | void explicitMemExample(int | | | vectorLength) | | | | | | |
| | --------------------------- | --- | --- | ------------- | --- | --- | --- | --- | | |
| { | |
| | ∕∕ Pointers | for | host | memory | | | | | | |
| | ----------- | ---------- | ---- | ------ | --- | --- | --- | --- | | |
| | float* A | = nullptr; | | | | | | | | |
| | float* B | = nullptr; | | | | | | | | |
| | float* C | = nullptr; | | | | | | | | |
| float* comparisonResult = (float*)malloc(vectorLength*sizeof(float)); | |
| | ∕∕ Pointers | for | device | memory | | | | | | |
| | ----------- | --- | -------- | ------ | --- | --- | --- | --- | | |
| | float* devA | = | nullptr; | | | | | | | |
| | float* devB | = | nullptr; | | | | | | | |
| | float* devC | = | nullptr; | | | | | | | |
| ∕∕Allocate Host Memory using cudaMallocHost API. This is best practice | |
| ∕∕ when buffers will be used for copies between CPU and GPU memory | |
| | cudaMallocHost(&A, | | | vectorLength*sizeof(float)); | | | | | | |
| | ------------------ | --- | -------------- | ---------------------------- | ---- | --- | --- | --- | | |
| | cudaMallocHost(&B, | | | vectorLength*sizeof(float)); | | | | | | |
| | cudaMallocHost(&C, | | | vectorLength*sizeof(float)); | | | | | | |
| | ∕∕ Initialize | | vectors | on the | host | | | | | |
| | initArray(A, | | vectorLength); | | | | | | | |
| | initArray(B, | | vectorLength); | | | | | | | |
| ∕∕ start-allocate-and-copy | |
| | ∕∕ Allocate | memory | | on the | GPU | | | | | |
| | ----------------- | ------ | ------ | ---------------------------- | --- | --- | --- | --- | | |
| | cudaMalloc(&devA, | | | vectorLength*sizeof(float)); | | | | | | |
| | cudaMalloc(&devB, | | | vectorLength*sizeof(float)); | | | | | | |
| | cudaMalloc(&devC, | | | vectorLength*sizeof(float)); | | | | | | |
| | ∕∕ Copy | data | to the | GPU | | | | | | |
| cudaMemcpy(devA, A, vectorLength*sizeof(float), cudaMemcpyDefault); | |
| cudaMemcpy(devB, B, vectorLength*sizeof(float), cudaMemcpyDefault); | |
| | cudaMemset(devC, | | 0, | vectorLength*sizeof(float)); | | | | | | |
| | ---------------- | --- | --- | ---------------------------- | --- | --- | --- | --- | | |
| ∕∕ end-allocate-and-copy | |
| | ∕∕ Launch | the | kernel | | | | | | | |
| | ---------------- | ---------- | ---------------------------- | --------- | --- | -------- | --------- | --- | | |
| | int threads | = | 256; | | | | | | | |
| | int blocks | = | cuda::ceil_div(vectorLength, | | | | threads); | | | |
| | vecAdd<<<blocks, | | threads>>>(devA, | | | devB, | devC); | | | |
| | ∕∕ wait | for kernel | | execution | to | complete | | | | |
| cudaDeviceSynchronize(); | |
| | ∕∕ Copy | results | back | to host | | | | | | |
| | ------- | ------- | ---- | ------- | --- | --- | --- | --- | | |
| cudaMemcpy(C, devC, vectorLength*sizeof(float), cudaMemcpyDefault); | |
| | ∕∕ Perform | computation | | serially | | on CPU | for comparison | | | |
| | --------------- | ----------- | --- | ----------------- | --- | -------- | -------------- | --- | | |
| | serialVecAdd(A, | | B, | comparisonResult, | | | vectorLength); | | | |
| | ∕∕ Confirm | that | CPU | and GPU | got | the same | answer | | | |
| (continuesonnextpage) | |
| | 28 | | | | | | Chapter2. | ProgrammingGPUsinCUDA | | |
| | --- | --- | --- | --- | --- | --- | --------- | --------------------- | | |
| CUDAProgrammingGuide,Release13.1 | |
| (continuedfrompreviouspage) | |
| if(vectorApproximatelyEqual(C, comparisonResult, vectorLength)) | |
| { | |
| | | printf("Explicit | Memory: CPU | and GPU answers | match\n"); | | |
| | --- | ---------------- | ----------- | --------------- | ---------- | | |
| } | |
| else | |
| { | |
| printf("Explicit Memory: Error - CPU and GPU answers to not match\n"); | |
| } | |
| | ∕∕ | clean up | | | | | |
| | --- | -------- | --- | --- | --- | | |
| cudaFree(devA); | |
| cudaFree(devB); | |
| cudaFree(devC); | |
| cudaFreeHost(A); | |
| cudaFreeHost(B); | |
| cudaFreeHost(C); | |
| free(comparisonResult); | |
| } | |
| ∕∕explicit-memory-end | |
| | int main(int | argc, | char** argv) | | | | |
| | ------------ | ----- | ------------ | --- | --- | | |
| { | |
| | int | vectorLength | = 1024; | | | | |
| | ------- | ------------ | ------- | --- | --- | | |
| | if(argc | >=2) | | | | | |
| { | |
| | | vectorLength | = std::atoi(argv[1]); | | | | |
| | --- | ------------ | --------------------- | --- | --- | | |
| } | |
| explicitMemExample(vectorLength); | |
| | return | 0; | | | | | |
| | ------ | --- | --- | --- | --- | | |
| } | |
| Thesecanbebuiltandrunusingnvccasfollows: | |
| | $ nvcc | vecAdd_unifiedMemory.cu | -o | vecAdd_unifiedMemory | | | |
| | ------ | ----------------------- | --- | -------------------- | --- | | |
| $ .∕vecAdd_unifiedMemory | |
| | Unified | Memory: CPU | and GPU answers | match | | | |
| | ------------------------ | ------------------------ | --------------- | --------------------- | --- | | |
| | $ .∕vecAdd_unifiedMemory | | 4096 | | | | |
| | Unified | Memory: CPU | and GPU answers | match | | | |
| | $ nvcc | vecAdd_explicitMemory.cu | -o | vecAdd_explicitMemory | | | |
| $ .∕vecAdd_explicitMemory | |
| | Explicit | Memory: CPU | and GPU answers | match | | | |
| | ------------------------- | ----------- | --------------- | ----- | --- | | |
| | $ .∕vecAdd_explicitMemory | | 4096 | | | | |
| | Explicit | Memory: CPU | and GPU answers | match | | | |
| Intheseexamples,allthreadsaredoingindependentworkanddonotneedtocoordinateorsynchro- | |
| nizewitheachother. Frequently,threadswillneedtocooperateandcommunicatewithotherthreads | |
| tocarryouttheirwork. Threadswithinablockcansharedatathroughsharedmemoryandsynchronize | |
| tocoordinatememoryaccesses. | |
| Themostbasicmechanismforsynchronizationattheblocklevelisthe__syncthreads()intrinsic, | |
| whichactsasabarrieratwhichallthreadsintheblockmustwaitbeforeanythreadsareallowedto | |
| 2.1. IntrotoCUDAC++ 29 | |
| CUDAProgrammingGuide,Release13.1 | |
| proceed. SharedMemorygivesanexampleofusingsharedmemory. | |
| Forefficientcooperation,sharedmemoryisexpectedtobealow-latencymemoryneareachprocessor | |
| core(muchlikeanL1cache)and__syncthreads()isexpectedtobelightweight. __syncthreads() | |
| onlysynchronizesthethreadswithinasinglethreadblock. Synchronizationbetweenblocksisnotsup- | |
| portedbytheCUDAprogrammingmodel. CooperativeGroupsprovidesmechanismtosetsynchroniza- | |
| tiondomainsotherthanasinglethreadblock. | |
| Best performance is usually achieved when synchronization is kept within a thread block. Thread | |
| blockscanstillworkoncommonresultsusingatomicmemoryfunctions,whichwillbecoveredincom- | |
| ingsections. | |
| Section Section 3.2.4 covers CUDA synchronization primitives that provide very fine-grained control | |
| formaximizingperformanceandresourceutilization. | |
| 2.1.6. Runtime Initialization | |
| TheCUDAruntimecreatesaCUDAcontextforeachdeviceinthesystem. Thiscontextistheprimary | |
| contextforthisdeviceandisinitializedatthefirstruntimefunctionwhichrequiresanactivecontext | |
| onthisdevice. Thecontextissharedamongallthehostthreadsoftheapplication. Aspartofcontext | |
| creation,thedevicecodeisjust-in-timecompiledifnecessaryandloadedintodevicememory. Thisall | |
| happenstransparently. TheprimarycontextcreatedbytheCUDAruntimecanbeaccessedfromthe | |
| driverAPIforinteroperabilityasdescribedinInteroperabilitybetweenRuntimeandDriverAPIs. | |
| AsofCUDA12.0, thecudaInitDeviceandcudaSetDevicecallsinitializetheruntimeandthepri- | |
| mary context associated with the specified device. The runtime will implicitly use device 0 and self- | |
| initializeasneededtoprocessruntimeAPIrequestsiftheyoccurbeforethesecalls. Thisisimportant | |
| when timing runtime function calls and when interpreting the error code from the first call into the | |
| runtime. PriortoCUDA12.0,cudaSetDevicewouldnotinitializetheruntime. | |
| cudaDeviceResetdestroystheprimarycontextofthecurrentdevice. IfCUDAruntimeAPIsarecalled | |
| aftertheprimarycontexthasbeendestroyed,anewprimarycontextforthatdevicewillbecreated. | |
| Note | |
| TheCUDAinterfacesuseglobalstatethatisinitializedduringhostprograminitiationanddestroyed | |
| duringhostprogramtermination. Usinganyoftheseinterfaces(implicitlyorexplicitly)duringpro- | |
| graminitiationorterminationaftermainwillresultinundefinedbehavior. | |
| As of CUDA 12.0, cudaSetDevice explicitly initializes the runtime, if it has not already been ini- | |
| tialized,afterchangingthecurrentdeviceforthehostthread. PreviousversionsofCUDAdelayed | |
| runtimeinitializationonthenewdeviceuntilthefirstruntimecallwasmadeaftercudaSetDevice. | |
| Becauseofthis,itisveryimportanttocheckthereturnvalueofcudaSetDeviceforinitialization | |
| errors. | |
| Theruntimefunctionsfromtheerrorhandlingandversionmanagementsectionsofthereference | |
| manualdonotinitializetheruntime. | |
| 2.1.7. Error Checking in CUDA | |
| EveryCUDAAPIreturnsavalueofanenumeratedtype,cudaError_t. Inexamplecodetheseerrors | |
| areoftennotchecked. Inproductionapplications,itisbestpracticetoalwayscheckandmanagethe | |
| return value of every CUDA API call. When there are no errors, the value returned is cudaSuccess. | |
| Manyapplicationschoosetoimplementautilitymacrosuchastheoneshownbelow | |
| 30 Chapter2. ProgrammingGPUsinCUDA | |
| CUDAProgrammingGuide,Release13.1 | |
| | #define CUDA_CHECK(expr_to_check) | | do { | \ | | | |
| | --------------------------------- | --------------- | ---------------- | -------- | --- | | |
| | cudaError_t | result | = expr_to_check; | \ | | | |
| | if(result | != cudaSuccess) | | \ | | | |
| | { | | | \ | | | |
| | fprintf(stderr, | | | \ | | | |
| | | "CUDA Runtime | Error: %s:%i:%d | = %s\n", | \ | | |
| | | __FILE__, | | \ | | | |
| | | __LINE__, | | \ | | | |
| result,\ | |
| | | cudaGetErrorString(result)); | | \ | | | |
| | --- | ---------------------------- | --- | --- | --- | | |
| | } | | | \ | | | |
| } while(0) | |
| This macro uses the cudaGetErrorString API, which returns a human readable string describing | |
| themeaningofaspecificcudaError_tvalue. Usingtheabovemacro,anapplicationwouldcallCUDA | |
| runtimeAPIcallswithinaCUDA_CHECK(expression)macro,asshownbelow: | |
| CUDA_CHECK(cudaMalloc(&devA, vectorLength*sizeof(float))); | |
| CUDA_CHECK(cudaMalloc(&devB, vectorLength*sizeof(float))); | |
| CUDA_CHECK(cudaMalloc(&devC, vectorLength*sizeof(float))); | |
| Ifanyofthesecallsdetectanerror,itwillbeprintedtostderrusingthismacro. Thismacroiscommon | |
| for smaller projects, but can be adapted to a logging system or other error handling mechanism in | |
| largerapplications. | |
| Note | |
| It is important to note that the error state returned from any CUDA API call can also indicate an | |
| errorfromapreviouslyissuedasynchronousoperation. SectionAsynchronousErrorHandlingcovers | |
| thisinmoredetail. | |
| 2.1.7.1 ErrorState | |
| TheCUDAruntimemaintainsacudaError_tstateforeachhostthread. Thevaluedefaultstocuda- | |
| Successandisoverwrittenwheneveranerroroccurs. cudaGetLastErrorreturnscurrenterrorstate | |
| and then resets it to cudaSuccess. Alternatively, returns error state without | |
| cudaPeekLastError | |
| resettingit. | |
| KernellaunchesusingtriplechevronnotationdonotreturnacudaError_t. Itisgoodpracticetocheck | |
| theerrorstateimmediatelyafterkernellaunchestodetectimmediateerrorsinthekernellaunchor | |
| asynchronouserrorspriortothekernellaunch. AvalueofcudaSuccesswhencheckingtheerrorstate | |
| immediatelyafterakernellaunchdoesnotmeanthekernelhasexecutedsuccessfullyorevenstarted | |
| execution. It only verifies that the kernel launch parameters and execution configuration passed to | |
| theruntimedidnottriggeranyerrorsandthattheerrorstateisnotapreviousorasynchronouserror | |
| beforethekernelstarted. | |
| 2.1.7.2 AsynchronousErrors | |
| CUDA kernel launches and many runtime APIs are asynchronous. Asynchronous CUDA runtime APIs | |
| will be discussed in detail in Asynchronous Execution. The CUDA error state is set and overwritten | |
| wheneveranerroroccurs. Thismeansthaterrorswhichoccurduringtheexecutionofasynchronous | |
| operationswillonlybereportedwhentheerrorstateisexaminednext. Asnoted,thismaybeacallto | |
| cudaGetLastError,cudaPeekLastError,oritcouldbeanyCUDAAPIwhichreturnscudaError_t. | |
| 2.1. IntrotoCUDAC++ 31 | |
| CUDAProgrammingGuide,Release13.1 | |
| WhenerrorsarereturnedbyCUDAruntimeAPIfunctions,theerrorstateisnotcleared. Thismeans | |
| thaterrorcodefromanasynchronouserror,suchasaninvalidmemoryaccessbyakernel,willbere- | |
| turnedbyeveryCUDAruntimeAPIuntiltheerrorstatehasbeenclearedbycallingcudaGetLastEr- | |
| ror. | |
| | vecAdd<<<blocks, | | | threads>>>(devA, | | devB, | devC); | | | |
| | ---------------- | ----- | ----- | ---------------- | ------ | ------ | ------ | --- | | |
| | ∕∕ | check | error | state after | kernel | launch | | | | |
| CUDA_CHECK(cudaGetLastError()); | |
| | ∕∕ | wait for | kernel | execution | to | complete | | | | |
| | --- | -------- | ------ | --------- | --- | -------- | --- | --- | | |
| ∕∕ The CUDA_CHECK will report errors that occurred during execution of the | |
| ,→kernel | |
| CUDA_CHECK(cudaDeviceSynchronize()); | |
| Note | |
| ThecudaError_tvaluecudaErrorNotReady,whichmaybereturnedbycudaStreamQueryand | |
| cudaEventQuery, is not considered an error and is not reported by cudaPeekAtLastError or | |
| cudaGetLastError. | |
| 2.1.7.3 CUDA_LOG_FILE | |
| Another good way to identify CUDA errors is with the CUDA_LOG_FILE environment variable. When | |
| this environment variable is set, the CUDA driver will write error messages encountered out to a file | |
| whosepathisspecifiedintheenvironmentvariable. Forexample,takethefollowingincorrectCUDA | |
| code, which attemtps to launch a thread block which is larger than the maximum supported by any | |
| architecture. | |
| | __global__ | void | k() | | | | | | | |
| | ---------- | ---- | --- | --- | --- | --- | --- | --- | | |
| { } | |
| int main() | |
| { | |
| | | k<<<8192, | | 4096>>>(); | ∕∕ Invalid | block | size | | | |
| | --- | --------- | --- | ---------- | ---------- | ----- | ---- | --- | | |
| CUDA_CHECK(cudaGetLastError()); | |
| | | return | 0; | | | | | | | |
| | --- | ------ | --- | --- | --- | --- | --- | --- | | |
| } | |
| Building and running this, the check after the kernel launch detects and reports the error using the | |
| macrosillustratedinSection2.1.7. | |
| | $ nvcc | errorLogIllustration.cu | | | -o | errlog | | | | |
| | ------ | ----------------------- | --- | --- | --- | ------ | --- | --- | | |
| $ .∕errlog | |
| CUDA Runtime Error: ∕home∕cuda∕intro-cpp∕errorLogIllustration.cu:24:1 = | |
| | ,→invalid | argument | | | | | | | | |
| | --------- | -------- | --- | --- | --- | --- | --- | --- | | |
| However, when the application is run with CUDA_LOG_FILE set to a text file, that file contains a bit | |
| moreinformationabouttheerror. | |
| | $ env | CUDA_LOG_FILE=cudaLog.txt | | | .∕errlog | | | | | |
| | ----- | ------------------------- | --- | --- | -------- | --- | --- | --- | | |
| CUDA Runtime Error: ∕home∕cuda∕intro-cpp∕errorLogIllustration.cu:24:1 = | |
| | ,→invalid | argument | | | | | | | | |
| | --------- | -------- | --- | --- | --- | --- | --- | --- | | |
| $ cat cudaLog.txt | |
| (continuesonnextpage) | |
| | 32 | | | | | | Chapter2. | ProgrammingGPUsinCUDA | | |
| | --- | --- | --- | --- | --- | --- | --------- | --------------------- | | |
| CUDAProgrammingGuide,Release13.1 | |
| (continuedfrompreviouspage) | |
| [12:46:23.854][137216133754880][CUDA][E] One or more of block dimensions of | |
| ,→(4096,1,1) exceeds corresponding maximum value of (1024,1024,64) | |
| [12:46:23.854][137216133754880][CUDA][E] Returning 1 (CUDA_ERROR_INVALID_ | |
| ,→VALUE) from cuLaunchKernel | |
| SettingCUDA_LOG_FILEtostdoutorstderrwillprinttostandardoutandstandarderror, respec- | |
| tively. Using the CUDA_LOG_FILE environment variable, it is possible to capture and identify CUDA | |
| errors,eveniftheapplicationdoesnotimplementpropererrorcheckingonCUDAreturnvalues. This | |
| approachcanbeextremelypowerfulfordebugging,buttheenvironmentvariablealonedoesnotallow | |
| anapplicationtohandleandrecoverfromCUDAerrorsatruntime. Theerrorlogmanagementfeature | |
| ofCUDAalsoallowsacallbackfunctiontoberegisteredwiththedriverwhichwillbecalledwhenever | |
| anerrorisdetected. Thiscanbeusedtocaptureandhandleerrorsatruntime,andalsotointegrate | |
| CUDAerrorloggingseamlesslyintoanapplication’sexistingloggingsystem. | |
| Section4.8 showsmoreexamplesoftheerrorlogmanagementfeatureofCUDA.Errorlogmanage- | |
| mentandCUDA_LOG_FILEareavailablewithNVIDIADriverversionr570andlater. | |
| 2.1.8. Device and Host Functions | |
| The__global__specifierisusedtoindicatetheentrypointforakernel. Thatis,afunctionwhichwill | |
| beinvokedforparallelexecutionontheGPU.Mostoften,kernelsarelaunchedfromthehost,however | |
| itispossibletolaunchakernelfromwithinanotherkernelusingdynamicparallelism. | |
| The specifier __device__ indicates that a function should be compiled for the GPU and be callable | |
| from other __device__ or __global__ functions. A function, including class member functions, | |
| functors,andlambdas,canbespecifiedasboth__device__and__host__asintheexamplebelow. | |
| 2.1.9. Variable Specifiers | |
| CUDAspecifierscanbeusedonstaticvariabledeclarationstocontrolplacement. | |
| ▶ __device__specifiesthatavariableisstoredinGlobalMemory | |
| ▶ __constant__specifiesthatavariableisstoredinConstantMemory | |
| ▶ __managed__specifiesthatavariableisstoredasUnifiedMemory | |
| ▶ __shared__specifiesthatavariableisstoreinSharedMemory | |
| When a variable is declared with no specifier inside a __device__ or __global__ function, it is al- | |
| locatedtoregisterswhenpossible, andlocalmemory whennecessary. Anyvariabledeclaredwithno | |
| specifieroutsidea__device__or__global__functionwillbeallocatedinsystemmemory. | |
| 2.1.9.1 DetectingDeviceCompilation | |
| Whenafunctionisspecifiedwith__host__ __device__,thecompilerisinstructedtogenerateboth | |
| aGPUandaCPUcodeforthisfunction. Insuchfunctions,itmaybedesirabletousethepreprocessor | |
| tospecifycodeonlyfortheGPUortheCPUcopyofthefunction. Checkingwhether__CUDA_ARCH_ | |
| isdefinedisthemostcommonwayofdoingthis,asillustratedintheexamplebelow. | |
| 2.1. IntrotoCUDAC++ 33 | |
| CUDAProgrammingGuide,Release13.1 | |
| 2.1.10. Thread Block Clusters | |
| From compute capability 9.0 onward, the CUDA programming model includes an optional level of hi- | |
| erarchy called thread block clusters that are made up of thread blocks. Similar to how threads in a | |
| thread block are guaranteed to be co-scheduled on a streaming multiprocessor, thread blocks in a | |
| clusterarealsoguaranteedtobeco-scheduledonaGPUProcessingCluster(GPC)intheGPU. | |
| Similar to thread blocks, clusters are also organized into a one-dimension, two-dimension, or three- | |
| dimensiongridofthreadblockclustersasillustratedbyFigure5. | |
| The number of thread blocks in a cluster can be user-defined, and a maximum of 8 thread blocks in | |
| aclusterissupportedasaportableclustersizeinCUDA.NotethatonGPUhardwareorMIGconfig- | |
| urationswhicharetoosmalltosupport8multiprocessorsthemaximumclustersizewillbereduced | |
| accordingly. Identification of these smaller configurations, as well as of larger configurations sup- | |
| porting a thread block cluster size beyond 8, is architecture-specific and can be queried using the | |
| cudaOccupancyMaxPotentialClusterSizeAPI. | |
| Allthethreadblocksintheclusterareguaranteedtobeco-scheduledtoexecutesimultaneouslyon | |
| a single GPU Processing Cluster (GPC) and allow thread blocks in the cluster to perform hardware- | |
| supported synchronization using the cooperative groups API cluster.sync(). Cluster group also | |
| providesmemberfunctionstoqueryclustergroupsizeintermsofnumberofthreadsornumberof | |
| blocks using num_threads() and num_blocks() API respectively. The rank of a thread or block in | |
| theclustergroupcanbequeriedusingdim_threads()anddim_blocks()APIrespectively. | |
| Thread blocks that belong to a cluster have access to the distributed shared memory, which is the | |
| combinedsharedmemoryofallthreadblocksinthecluster. Threadblocksinaclusterhavetheability | |
| to read, write, and perform atomics to any address in the distributed shared memory. Distributed | |
| SharedMemorygivesanexampleofperforminghistogramsindistributedsharedmemory. | |
| Note | |
| In a kernel launched using cluster support, the gridDim variable still denotes the size in terms of | |
| numberofthreadblocks,forcompatibilitypurposes. Therankofablockinaclustercanbefound | |
| usingtheCooperativeGroupsAPI. | |
| 2.1.10.1 LaunchingwithClustersinTripleChevronNotation | |
| A thread block cluster can be enabled in a kernel either using a compile-time kernel attribute using | |
| __cluster_dims__(X,Y,Z)orusingtheCUDAkernellaunchAPIcudaLaunchKernelEx. Theexam- | |
| plebelowshowshowtolaunchaclusterusingacompile-timekernelattribute. Theclustersizeusing | |
| kernelattributeisfixedatcompiletimeandthenthekernelcanbelaunchedusingtheclassical<<< | |
| , >>>. Ifakernelusescompile-timeclustersize,theclustersizecannotbemodifiedwhenlaunching | |
| thekernel. | |
| ∕∕ Kernel definition | |
| ∕∕ Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension | |
| __global__ void __cluster_dims__(2, 1, 1) cluster_kernel(float *input, float* | |
| ,→output) | |
| { | |
| } | |
| int main() | |
| { | |
| (continuesonnextpage) | |
| 34 Chapter2. ProgrammingGPUsinCUDA | |
| CUDAProgrammingGuide,Release13.1 | |
| (continuedfrompreviouspage) | |
| float *input, *output; | |
| ∕∕ Kernel invocation with compile time cluster size | |
| dim3 threadsPerBlock(16, 16); | |
| dim3 numBlocks(N ∕ threadsPerBlock.x, N ∕ threadsPerBlock.y); | |
| ∕∕ The grid dimension is not affected by cluster launch, and is still | |
| ,→enumerated | |
| ∕∕ using number of blocks. | |
| ∕∕ The grid dimension must be a multiple of cluster size. | |
| cluster_kernel<<<numBlocks, threadsPerBlock>>>(input, output); | |
| } | |
| 2.2. Writing CUDA SIMT Kernels | |
| CUDAC++kernelscanlargelybewritteninthesamewaythattraditionalCPUcodewouldbewritten | |
| foragivenproblem. However,therearesomeuniquefeaturesoftheGPUthatcanbeusedtoimprove | |
| performance. Additionally,someunderstandingofhowthreadsontheGPUarescheduled,howthey | |
| access memory, and how their execution proceeds can help developers write kernels that maximize | |
| utilizationoftheavailablecomputingresources. | |
| 2.2.1. Basics of SIMT | |
| Fromthedeveloper’sperspective,theCUDAthreadisthefundamentalunitofparallelism. Warpsand | |
| SIMT describesthebasicSIMTmodelofGPUexecutionandSIMTExecutionModelprovidesadditional | |
| details of the SIMT model. The SIMT model allows each thread to maintain its own state and con- | |
| trol flow. From a functional perspective, each thread can execute a separate code path. However, | |
| substantialperformanceimprovementscanberealizedbytakingcarethatkernelcodeminimizesthe | |
| situationswherethreadsinthesamewarptakedivergentcodepaths. | |
| 2.2.2. Thread Hierarchy | |
| Threads are organized into thread blocks, which are then organized into a grid. Grids may be 1, 2, | |
| or 3 dimensional and the size of the grid can be queried inside a kernel with the gridDim built-in | |
| variable. Threadblocksmayalsobe1,2,or3dimensional. Thesizeofthethreadblockcanbequeried | |
| insideakernelwiththeblockDimbuilt-invariable. Theindexofthethreadblockcanbequeriedwith | |
| the blockIdx built-in variable. Within a thread block, the index of the thread is obtained using the | |
| threadIdxbuilt-invariable. Thesebuilt-invariablesareusedtocomputeauniqueglobalthreadindex | |
| for each thread, thereby enabling each thread to load/store specific data from global memory and | |
| executeauniquecodepathasneeded. | |
| ▶ gridDim.[x|y|z]: Size of the grid in the x, y and z dimension respectively. These values are | |
| setatkernellaunch. | |
| ▶ blockDim.[x|y|z]: Sizeoftheblockinthex,yandzdimensionrespectively. Thesevaluesare | |
| setatkernellaunch. | |
| ▶ blockIdx.[x|y|z]: Index of the block in the x, y and z dimension respectively. These values | |
| changedependingonwhichblockisexecuting. | |
| 2.2. WritingCUDASIMTKernels 35 | |
| CUDAProgrammingGuide,Release13.1 | |
| ▶ | |
| threadIdx.[x|y|z]: Indexofthethreadinthex,yandzdimensionrespectively. Thesevalues | |
| changedependingonwhichthreadisexecuting. | |
| The use of multi-dimensional thread blocks and grids is for convenience only and does not affect | |
| performance. The threads of a block are linearized predictably: the first index x moves the fastest, | |
| followed by y and then z. This means that in the linearization of a thread indices, consecutive val- | |
| ues of threadIdx.x indicate consecutive threads, threadIdx.y has a stride of blockDim.x, and | |
| threadIdx.zhasastrideofblockDim.x * blockDim.y. Thisaffectshowthreadsareassignedto | |
| warps,asdetailedinHardwareMultithreading. | |
| Figure9showsasimpleexampleofa2Dgrid,with1Dthreadblocks. | |
| | | | Figure9: | GridofThreadBlocks | | | | |
| | ------ | ---------- | -------- | ------------------ | --- | --- | | |
| | 2.2.3. | GPU Device | Memory | Spaces | | | | |
| CUDA devices have several memory spaces that can be accessed by CUDA threads within kernels. | |
| Table1showsasummaryofthecommonmemorytypes,theirthreadscopes,andtheirlifetimes. The | |
| followingsectionsexplaineachofthesememorytypesinmoredetail. | |
| | | | Table1: MemoryTypes,ScopesandLifetimes | | | | | |
| | --- | --- | -------------------------------------- | ---------------- | -------- | --- | | |
| | | | MemoryType | Scope Lifetime | Location | | | |
| | | | Global | Grid Application | Device | | | |
| | | | Constant | Grid Application | Device | | | |
| | | | Shared | Block Kernel | SM | | | |
| | | | Local | Thread Kernel | Device | | | |
| | | | Register | Thread Kernel | SM | | | |
| 2.2.3.1 GlobalMemory | |
| Globalmemory(alsocalleddevicememory)istheprimarymemoryspaceforstoringdatathatisac- | |
| cessiblebyallthreadsinakernel. ItissimilartoRAMinaCPUsystem. KernelsrunningontheGPU | |
| havedirectaccesstoglobalmemoryinthesamewaycoderunningontheCPUhasaccesstosystem | |
| memory. | |
| | 36 | | | | Chapter2. | ProgrammingGPUsinCUDA | | |
| | --- | --- | --- | --- | --------- | --------------------- | | |
| CUDAProgrammingGuide,Release13.1 | |
| Global memory is persistent. That is, an allocation made in global memory and the data stored in it | |
| persistuntiltheallocationisfreedoruntiltheapplicationisterminated. cudaDeviceResetalsofrees | |
| allallocations. | |
| GlobalmemoryisallocatedwithCUDAAPIcallssuchascudaMallocandcudaMallocManaged. Data | |
| canbecopiedintoglobalmemoryfromCPUmemoryusingCUDAruntimeAPIcallssuchascudaMem- | |
| cpy. GlobalmemoryallocationsmadewithCUDAAPIsarefreedusingcudaFree. | |
| Prior to a kernel launch, global memory is allocated and initialized by CUDA API calls. During kernel | |
| execution,datafromglobalmemorycanbereadbytheCUDAthreads,andtheresultfromoperations | |
| carried out by CUDA threads can be written back to global memory. Once a kernel has completed | |
| execution, the results it wrote to global memory can be copied back to the host or used by other | |
| kernelsontheGPU. | |
| Becauseglobalmemoryisaccessiblebyallthreadsinagrid, caremustbetakentoavoiddataraces | |
| betweenthreads. SinceCUDAkernelslaunchedfromthehosthavethereturntypevoid,theonlyway | |
| for numerical results computed by a kernel to be returned to the host is by writing those results to | |
| globalmemory. | |
| A simple example illustrating the use of global memory is the vecAdd kernel below, where the three | |
| arraysA,B,andCareinglobalmemoryandarebeingaccessedbythisvectoraddkernel. | |
| __global__ void vecAdd(float* A, float* B, float* C, int vectorLength) | |
| { | |
| int workIndex = threadIdx.x + blockIdx.x*blockDim.x; | |
| if(workIndex < vectorLength) | |
| { | |
| C[workIndex] = A[workIndex] + B[workIndex]; | |
| 2.2.3.2 SharedMemory | |
| Sharedmemoryisamemoryspacethatisaccessiblebyallthreadsinathreadblock. Itisphysically | |
| locatedoneachSMandusesthesamephysicalresourceastheL1cache,theunifieddatacache. The | |
| datainsharedmemorypersiststhroughoutthekernelexecution. Sharedmemorycanbeconsidered | |
| a user-managed scratchpad for use during kernel execution. While small in size compared to global | |
| memory, because shared memory is located on each SM, the bandwidth is higher and the latency is | |
| lowerthanaccessingglobalmemory. | |
| Sincesharedmemoryisaccessiblebyallthreadsinathreadblock,caremustbetakentoavoiddata | |
| racesbetweenthreadsinthesamethreadblock. Synchronizationbetweenthreadsinthesamethread | |
| blockcanbeachievedusingthe__syncthreads()function. Thisfunctionblocksallthreadsinthe | |
| threadblockuntilallthreadshavereachedthecallto__syncthreads(). | |
| ∕∕ assuming blockDim.x is 128 | |
| __global__ void example_syncthreads(int* input_data, int* output_data) { | |
| __shared__ int shared_data[128]; | |
| ∕∕ Every thread writes to a distinct element of 'shared_data': | |
| shared_data[threadIdx.x] = input_data[threadIdx.x]; | |
| ∕∕ All threads synchronize, guaranteeing all writes to 'shared_data' are | |
| ,→ordered | |
| ∕∕ before any thread is unblocked from '__syncthreads()': | |
| __syncthreads(); | |
| ∕∕ A single thread safely reads 'shared_data': | |
| if (threadIdx.x == 0) { | |
| (continuesonnextpage) | |
| 2.2. WritingCUDASIMTKernels 37 | |
| CUDAProgrammingGuide,Release13.1 | |
| (continuedfrompreviouspage) | |
| int sum = 0; | |
| for (int i = 0; i < blockDim.x; ++i) { | |
| sum += shared_data[i]; | |
| } | |
| output_data[blockIdx.x] = sum; | |
| } | |
| } | |
| The size of shared memory varies depending on the GPU architecture being used. Because shared | |
| memory and L1 cache share the same physical space, using shared memory reduces the size of the | |
| usable L1 cache for a kernel. Additionally, if no shared memory is used by the kernel, the entire | |
| physical space will be utilized by L1 cache. The CUDA runtime API provides functions to query the | |
| shared memory size on a per SM basis and a per thread block basis, using the cudaGetDevice- | |
| PropertiesfunctionandinvestigatingthecudaDeviceProp.sharedMemPerMultiprocessorand | |
| cudaDeviceProp.sharedMemPerBlockdeviceproperties. | |
| TheCUDAruntimeAPIprovidesafunctioncudaFuncSetCacheConfigtotelltheruntimewhetherto | |
| allocatemorespacetosharedmemory,ormorespacetoL1cache. Thisfunctionspecifiesapreference | |
| totheruntime,butisnotguaranteedtobehonored. Theruntimeisfreetomakedecisionsbasedon | |
| theavailableresourcesandtheneedsofthekernel. | |
| Sharedmemorycanbeallocatedbothstaticallyanddynamically. | |
| 2.2.3.2.1 StaticAllocationofSharedMemory | |
| Toallocatesharedmemorystatically,theprogrammermustdeclareavariableinsidethekernelusing | |
| the __shared__ specifier. The variable will be allocated in shared memory and will persist for the | |
| durationofthekernelexecution. Thesizeofthesharedmemorydeclaredinthiswaymustbespecified | |
| atcompiletime. Forexample,thefollowingcodesnippet,locatedinthebodyofthekernel,declaresa | |
| sharedmemoryarrayoftypefloatwith1024elements. | |
| __shared__ float sharedArray[1024]; | |
| Afterthisdeclaration,allthethreadsinthethreadblockwillhaveaccesstothissharedmemoryarray. | |
| Caremustbetakentoavoiddataracesbetweenthreadsinthesamethreadblock,typicallywiththe | |
| useof__syncthreads(). | |
| 2.2.3.2.2 DynamicAllocationofSharedMemory | |
| To allocate shared memory dynamically, the programmer can specify the desired amount of shared | |
| memory per thread block in bytes as the third (and optional) argument to the kernel launch in the | |
| triplechevronnotationlikethisfunctionName<<<grid, block, sharedMemoryBytes>>>(). | |
| Then,insidethekernel,theprogrammercanusetheextern __shared__specifiertodeclareavari- | |
| ablethatwillbeallocateddynamicallyatkernellaunch. | |
| extern __shared__ float sharedArray[]; | |
| Onecaveatisthatifonewantsmultipledynamicallyallocatedsharedmemoryarrays,thesingleextern | |
| __shared__ must be partitioned manually using pointer arithmetic. For example, if one wants the | |
| equivalentofthefollowing, | |
| 38 Chapter2. ProgrammingGPUsinCUDA | |
| CUDAProgrammingGuide,Release13.1 | |
| short array0[128]; | |
| float array1[64]; | |
| int array2[256]; | |
| indynamicallyallocatedsharedmemory,onecoulddeclareandinitializethearraysinthefollowingway: | |
| extern __shared__ float array[]; | |
| short* array0 = (short*)array; | |
| float* array1 = (float*)&array0[128]; | |
| int* array2 = (int*)&array1[64]; | |
| Note that pointers need to be aligned to the type they point to, so the following code, for example, | |
| doesnotworksincearray1isnotalignedto4bytes. | |
| extern __shared__ float array[]; | |
| short* array0 = (short*)array; | |
| float* array1 = (float*)&array0[127]; | |
| 2.2.3.3 Registers | |
| RegistersarelocatedontheSMandhavethreadlocalscope. Registerusageismanagedbythecom- | |
| pilerandregistersareusedforthreadlocalstorageduringtheexecutionofakernel. Thenumberof | |
| registersperSMandthenumberofregistersperthreadblockcanbequeriedusingtheregsPerMul- | |
| tiprocessorandregsPerBlockdevicepropertiesoftheGPU. | |
| NVCCallowsthedevelopertospecifyamaximumnumberofregisterstobeusedbyakernelviathe | |
| -maxrregcount option. Using this option to reduce the number of registers a kernel can use may | |
| result in more thread blocks being scheduled on the SM concurrently, but may also result in more | |
| registerspilling. | |
| 2.2.3.4 LocalMemory | |
| LocalmemoryisthreadlocalstoragesimilartoregistersandmanagedbyNVCC,butthephysicallo- | |
| cationoflocalmemoryisintheglobalmemoryspace. The‘local’labelreferstoitslogicalscope,not | |
| itsphysicallocation. Localmemoryisusedforthreadlocalstorageduringtheexecutionofakernel. | |
| Automaticvariablesthatthecompilerislikelytoplaceinlocalmemoryare: | |
| ▶ Arraysforwhichitcannotdeterminethattheyareindexedwithconstantquantities, | |
| ▶ Largestructuresorarraysthatwouldconsumetoomuchregisterspace, | |
| ▶ Anyvariableifthekernelusesmoreregistersthanavailable,thatisregisterspilling. | |
| Becausethelocalmemoryspaceresidesindevicememory,localmemoryaccesseshavethesamela- | |
| tencyandbandwidthasglobalmemoryaccessesandaresubjecttothesamerequirementsformem- | |
| orycoalescingasdescribedinCoalescedGlobalMemoryAccess. Localmemoryishoweverorganized | |
| suchthatconsecutive32-bitwordsareaccessedbyconsecutivethreadIDs. Accessesaretherefore | |
| fully coalesced as long as all threads in a warp access the same relative address, such as the same | |
| indexinanarrayvariableorthesamememberinastructurevariable. | |
| 2.2. WritingCUDASIMTKernels 39 | |
| CUDAProgrammingGuide,Release13.1 | |
| 2.2.3.5 ConstantMemory | |
| Constantmemoryhasagridscopeandisaccessibleforthelifetimeoftheapplication. Theconstant | |
| memoryresidesonthedeviceandisread-onlytothekernel. Assuch,itmustbedeclaredandinitialized | |
| onthehostwiththe__constant__specifier,outsideanyfunction. | |
| The__constant__memoryspacespecifierdeclaresavariablethat: | |
| ▶ | |
| Residesinconstantmemoryspace, | |
| ▶ HasthelifetimeoftheCUDAcontextinwhichitiscreated, | |
| ▶ | |
| Hasadistinctobjectperdevice, | |
| ▶ Isaccessiblefromallthethreadswithinthegridandfromthehostthroughtheruntimelibrary | |
| (cudaGetSymbolAddress()/cudaGetSymbolSize()/cudaMemcpyToSymbol()/cudaMem- | |
| cpyFromSymbol()). | |
| ThetotalamountofconstantmemorycanbequeriedwiththetotalConstMemdevicepropertyele- | |
| ment. | |
| Constantmemoryisusefulforsmallamountsofdatathateachthreadwilluseinaread-onlyfashion. | |
| Constantmemoryissmallrelativetoothermemories,typically64KBperdevice. | |
| Anexamplesnippetofdeclaringandusingconstantmemoryfollows. | |
| | ∕∕ In your | .cu file | | | | | | |
| | ------------ | ------------------ | ------------------ | --- | --- | --- | | |
| | __constant__ | float coeffs[4]; | | | | | | |
| | __global__ | void compute(float | *out) | { | | | | |
| | int | idx = threadIdx.x; | | | | | | |
| | out[idx] | = coeffs[0] | * idx + coeffs[1]; | | | | | |
| } | |
| | ∕∕ In your | host code | | | | | | |
| | -------------------------- | ------------------ | ----------- | ------------------ | --- | --- | | |
| | float h_coeffs[4] | = {1.0f, | 2.0f, 3.0f, | 4.0f}; | | | | |
| | cudaMemcpyToSymbol(coeffs, | | h_coeffs, | sizeof(h_coeffs)); | | | | |
| | compute<<<1, | 10>>>(device_out); | | | | | | |
| 2.2.3.6 Caches | |
| GPUdeviceshaveamulti-levelcachestructurewhichincludesL2andL1caches. | |
| TheL2cacheislocatedonthedeviceandissharedamongalltheSMs. ThesizeoftheL2cachecan | |
| bequeriedwiththel2CacheSizedevicepropertyelementfromthefunctioncudaGetDeviceProp- | |
| erties. | |
| As described above in Shared Memory, L1 cache is physically located on each SM and is the same | |
| physicalspaceusedbysharedmemory. Ifnosharedmemoryisutilizedbyakernel,theentirephysical | |
| spacewillbeutilizedbytheL1cache. | |
| The L2 and L1 caches can be controlled via functions that allow the developer to specify various | |
| cachingbehaviors. ThedetailsofthesefunctionsarefoundinConfiguringL1/SharedMemoryBalance, | |
| L2CacheControl,andLow-LevelLoadandStoreFunctions. | |
| Ifthesehintsarenotused,thecompilerandruntimewilldotheirbesttoutilizethecachesefficiently. | |
| | 40 | | | | Chapter2. | ProgrammingGPUsinCUDA | | |
| | --- | --- | --- | --- | --------- | --------------------- | | |
| CUDAProgrammingGuide,Release13.1 | |
| 2.2.3.7 TextureandSurfaceMemory | |
| Note | |
| Some older CUDA code may use texture memory because, in older NVIDIA GPUs, doing so pro- | |
| videdperformancebenefitsinsomescenarios. OnallcurrentlysupportedGPUs,thesescenarios | |
| maybehandledusingdirectloadandstoreinstructions, anduseoftextureandsurfacememory | |
| instructionsnolongerprovidesanyperformancebenefit. | |
| AGPUmayhavespecializedinstructionsforloadingdatafromanimagetobeusedastexturesin3D | |
| rendering. CUDAexposestheseinstructionsandthemachinerytousetheminthetextureobjectAPI | |
| andthesurfaceobjectAPI. | |
| TextureandSurfacememoryarenotdiscussedfurtherinthisguideasthereisnoadvantagetousing | |
| them in CUDA on any currently supported NVIDIA GPU. CUDA developers should feel free to ignore | |
| theseAPIs. Fordevelopersworkingonexistingcodebaseswhichstillusethem,explanationsofthese | |
| APIscanstillbefoundinthelegacyCUDAC++ProgrammingGuide. | |
| 2.2.3.8 DistributedSharedMemory | |
| ThreadBlockClustersintroducedincomputecapability9.0andfacilitatedbyCooperativeGroups,pro- | |
| videtheabilityforthreadsinathreadblockclustertoaccesssharedmemoryofalltheparticipating | |
| threadblocksinthatcluster. ThispartitionedsharedmemoryiscalledDistributedSharedMemory,and | |
| the corresponding address space is called Distributed Shared Memory address space. Threads that | |
| belongtoathreadblockclustercanread,writeorperformatomicsinthedistributedaddressspace, | |
| regardlesswhethertheaddressbelongstothelocalthreadblockoraremotethreadblock. Whether | |
| akernelusesdistributedsharedmemoryornot,thesharedmemorysizespecifications,staticordy- | |
| namic is still per thread block. The size of distributed shared memory is just the number of thread | |
| blocksperclustermultipliedbythesizeofsharedmemoryperthreadblock. | |
| Accessingdataindistributedsharedmemoryrequiresallthethreadblockstoexist. Ausercanguar- | |
| anteethatallthreadblockshavestartedexecutingusingcluster.sync()fromclasscluster_group. | |
| Theuseralsoneedstoensurethatalldistributedsharedmemoryoperationshappenbeforetheexit | |
| ofathreadblock,e.g.,ifaremotethreadblockistryingtoreadagiventhreadblock’ssharedmemory, | |
| theprogramneedstoensurethatthesharedmemoryreadbytheremotethreadblockiscompleted | |
| beforeitcanexit. | |
| Let’s look at a simple histogram computation and how to optimize it on the GPU using thread block | |
| cluster. Astandardwayofcomputinghistogramsistoperformthecomputationinthesharedmem- | |
| ory of each thread block and then perform global memory atomics. A limitation of this approach is | |
| the shared memory capacity. Once the histogram bins no longer fit in the shared memory, a user | |
| needstodirectlycomputehistogramsandhencetheatomicsintheglobalmemory. Withdistributed | |
| sharedmemory,CUDAprovidesanintermediatestep,wheredependingonthehistogrambinssize,the | |
| histogramcanbecomputedinsharedmemory,distributedsharedmemoryorglobalmemorydirectly. | |
| TheCUDAkernelexamplebelowshowshowtocomputehistogramsinsharedmemoryordistributed | |
| sharedmemory,dependingonthenumberofhistogrambins. | |
| #include <cooperative_groups.h> | |
| ∕∕ Distributed Shared memory histogram kernel | |
| __global__ void clusterHist_kernel(int *bins, const int nbins, const int bins_ | |
| ,→per_block, const int *__restrict__ input, | |
| size_t array_size) | |
| (continuesonnextpage) | |
| 2.2. WritingCUDASIMTKernels 41 | |
| CUDAProgrammingGuide,Release13.1 | |
| (continuedfrompreviouspage) | |
| { | |
| | extern | __shared__ | | int smem[]; | | | | | | | |
| | --------- | -------------------------------- | --------------------- | ----------- | --- | --- | --- | --- | --- | | |
| | namespace | cg | = cooperative_groups; | | | | | | | | |
| | int tid | = cg::this_grid().thread_rank(); | | | | | | | | | |
| ∕∕ Cluster initialization, size and calculating local bin offsets. | |
| | cg::cluster_group | | | cluster | = | cg::this_cluster(); | | | | | |
| | ----------------- | --- | ---------------- | ------------------------- | --- | ------------------- | --------------------- | --- | --- | | |
| | unsigned | int | clusterBlockRank | | | = | cluster.block_rank(); | | | | |
| | int cluster_size | | | = cluster.dim_blocks().x; | | | | | | | |
| for (int i = threadIdx.x; i < bins_per_block; i += blockDim.x) | |
| { | |
| | smem[i] | = 0; | ∕∕Initialize | | | shared | memory | histogram | to zeros | | |
| | ------- | ---- | ------------ | --- | --- | ------ | ------ | --------- | -------- | | |
| } | |
| ∕∕ cluster synchronization ensures that shared memory is initialized to zero | |
| ,→in | |
| ∕∕ all thread blocks in the cluster. It also ensures that all thread blocks | |
| | ∕∕ have | started | executing | | and | they | exist | concurrently. | | | |
| | ------- | ------- | --------- | --- | --- | ---- | ----- | ------------- | --- | | |
| cluster.sync(); | |
| for (int i = tid; i < array_size; i += blockDim.x * gridDim.x) | |
| { | |
| | int | ldata = | input[i]; | | | | | | | | |
| | ------------- | -------------- | --------- | --------- | ----------- | --------------- | ------------------ | ------------- | --- | | |
| | ∕∕Find | the | right | histogram | | bin. | | | | | |
| | int | binid = | ldata; | | | | | | | | |
| | if (ldata | < | 0) | | | | | | | | |
| | binid | = 0; | | | | | | | | | |
| | else | if (ldata | | >= nbins) | | | | | | | |
| | binid | = nbins | | - 1; | | | | | | | |
| | ∕∕Find | destination | | block | rank | | and offset | for computing | | | |
| | ∕∕distributed | | shared | | memory | histogram | | | | | |
| | int | dst_block_rank | | = | (int)(binid | | ∕ bins_per_block); | | | | |
| | int | dst_offset | | = binid | % | bins_per_block; | | | | | |
| | ∕∕Pointer | to | target | block | | shared | memory | | | | |
| int *dst_smem = cluster.map_shared_rank(smem, dst_block_rank); | |
| | ∕∕Perform | atomic | | update | of | the | histogram | bin | | | |
| | ------------------ | ------ | --- | ------ | ----------- | --- | --------- | --- | --- | | |
| | atomicAdd(dst_smem | | | + | dst_offset, | | 1); | | | | |
| } | |
| ∕∕ cluster synchronization is required to ensure all distributed shared | |
| ∕∕ memory operations are completed and no thread block exits while | |
| ∕∕ other thread blocks are still accessing distributed shared memory | |
| cluster.sync(); | |
| ∕∕ Perform global memory histogram, using the local distributed memory | |
| ,→histogram | |
| | int *lbins | = | bins | + cluster.block_rank() | | | | * bins_per_block; | | | |
| | ---------- | --- | ---- | ---------------------- | --- | --- | --- | ----------------- | --- | | |
| (continuesonnextpage) | |
| | 42 | | | | | | | Chapter2. | ProgrammingGPUsinCUDA | | |
| | --- | --- | --- | --- | --- | --- | --- | --------- | --------------------- | | |
| CUDAProgrammingGuide,Release13.1 | |
| (continuedfrompreviouspage) | |
| for (int i = threadIdx.x; i < bins_per_block; i += blockDim.x) | |
| { | |
| | atomicAdd(&lbins[i], | | | smem[i]); | | | | | | |
| | -------------------- | --- | --- | --------- | --- | --- | --- | --- | | |
| } | |
| } | |
| The above kernel can be launched at runtime with a cluster size depending on the amount of dis- | |
| tributedsharedmemoryrequired. Ifthehistogramissmallenoughtofitinsharedmemoryofjustone | |
| block,theusercanlaunchthekernelwithclustersize1. Thecodesnippetbelowshowshowtolaunch | |
| aclusterkerneldynamicallybasedonsharedmemoryrequirements. | |
| | ∕∕ Launch | via | extensible | launch | | | | | | |
| | --------- | --- | ---------- | ------ | --- | --- | --- | --- | | |
| { | |
| | cudaLaunchConfig_t | | | config | = {0}; | | | | | |
| | ------------------ | --- | ------- | ------------------ | -------------------- | --- | ----- | --- | | |
| | config.gridDim | | = | array_size | ∕ threads_per_block; | | | | | |
| | config.blockDim | | = | threads_per_block; | | | | | | |
| | ∕∕ cluster_size | | depends | | on the histogram | | size. | | | |
| ∕∕ ( cluster_size == 1 ) implies no distributed shared memory, just thread | |
| | ,→block | local | shared | memory | | | | | | |
| | --------- | --------------- | ------ | ------- | --------------- | ------ | ------- | ---- | | |
| | int | cluster_size | | = 2; ∕∕ | size 2 is | an | example | here | | |
| | int | nbins_per_block | | = nbins | ∕ cluster_size; | | | | | |
| | ∕∕dynamic | shared | memory | | size is per | block. | | | | |
| ∕∕Distributed shared memory size = cluster_size * nbins_per_block * | |
| ,→sizeof(int) | |
| | config.dynamicSmemBytes | | | | = nbins_per_block | | * | sizeof(int); | | |
| | ----------------------- | --- | --- | --- | ----------------- | --- | --- | ------------ | | |
| CUDA_CHECK(::cudaFuncSetAttribute((void *)clusterHist_kernel, | |
| ,→cudaFuncAttributeMaxDynamicSharedMemorySize, config.dynamicSmemBytes)); | |
| | cudaLaunchAttribute | | | attribute[1]; | | | | | | |
| | ----------------------------- | --- | ------------ | ------------------------------------ | --- | ------------- | --- | --- | | |
| | attribute[0].id | | = | cudaLaunchAttributeClusterDimension; | | | | | | |
| | attribute[0].val.clusterDim.x | | | | = | cluster_size; | | | | |
| | attribute[0].val.clusterDim.y | | | | = | 1; | | | | |
| | attribute[0].val.clusterDim.z | | | | = | 1; | | | | |
| | config.numAttrs | | = | 1; | | | | | | |
| | config.attrs | | = attribute; | | | | | | | |
| cudaLaunchKernelEx(&config, clusterHist_kernel, bins, nbins, nbins_per_ | |
| | ,→block, | input, | array_size); | | | | | | | |
| | -------- | ------ | ------------ | --- | --- | --- | --- | --- | | |
| } | |
| | 2.2.4. | Memory | | Performance | | | | | | |
| | ------ | ------ | --- | ----------- | --- | --- | --- | --- | | |
| Ensuring proper memory usage is key to achieving high performance in CUDA kernels. This section | |
| discussessomegeneralprinciplesandexamplesforachievinghighmemorythroughputinCUDAker- | |
| nels. | |
| 2.2. WritingCUDASIMTKernels 43 | |
| CUDAProgrammingGuide,Release13.1 | |
| 2.2.4.1 CoalescedGlobalMemoryAccess | |
| Globalmemoryisaccessedvia32-bytememorytransactions. WhenaCUDAthreadrequestsaword | |
| of data from global memory, the relevant warp coalesces the memory requests from all the threads | |
| inthatwarpintothenumberofmemorytransactionsnecessarytosatisfytherequest,dependingon | |
| the size of the word accessed by each thread and the distribution of the memory addresses across | |
| thethreads. Forexample,ifathreadrequestsa4-byteword,theactualmemorytransactionthewarp | |
| willgeneratetoglobalmemorywillbe32bytesintotal. Tousethememorysystemmostefficiently, | |
| thewarpshoulduseallthememorythatisfetchedinasinglememorytransaction. Thatis,ifathread | |
| isrequestinga4-bytewordfromglobalmemory,andthetransactionsizeis32bytes,ifotherthreads | |
| inthatwarpcanuseother4-bytewordsofdatafromthat32-byterequest,thiswillresultinthemost | |
| efficientuseofthememorysystem. | |
| Asasimpleexample,ifconsecutivethreadsinthewarprequestconsecutive4-bytewordsinmemory, | |
| then the warp will request 128 bytes of memory total, and this 128 bytes required will be fetched | |
| in four 32-byte memory transactions. This results in 100% utilization of the memory system. That | |
| is, 100%ofthememorytrafficisutilizedbythewarp. Figure10illustratesthisexampleofperfectly | |
| coalescedmemoryaccess. | |
| Figure10: Coalescedmemoryaccess | |
| Conversely,thepathologicallyworstcasescenarioiswhenconsecutivethreadsaccessdataelements | |
| that are 32 bytes or more apart from each other in memory. In this case, the warp will be forced to | |
| issuea32-bytememorytransactionforeachthread,andthetotalnumberofbytesofmemorytraffic | |
| will be 32 bytes times 32 threads/warp = 1024 bytes. However, the amount of memory used will be | |
| 128bytesonly(4bytesforeachthreadinthewarp),sothememoryutilizationwillonlybe128/1024 | |
| = 12.5%. This is a very inefficient use of the memory system. Figure 11 illustrates this example of | |
| uncoalescedmemoryaccess. | |
| Figure11: Uncoalescedmemoryaccess | |
| Themoststraightforwardwaytoachievecoalescedmemoryaccessisforconsecutivethreadstoac- | |
| cessconsecutiveelementsinmemory. Forexample,forakernellaunchedwith1dthreadblocks,the | |
| following VecAdd kernel will achieve coalesced memory access. Notice how thread workIndex ac- | |
| cessesthethreearrays,andconsecutivethreads(indicatedbyconsecutivevaluesofworkIndex)ac- | |
| cessconsecutiveelementsinthearrays. | |
| 44 Chapter2. ProgrammingGPUsinCUDA | |
| CUDAProgrammingGuide,Release13.1 | |
| __global__ void vecAdd(float* A, float* B, float* C, int vectorLength) | |
| { | |
| int workIndex = threadIdx.x + blockIdx.x*blockDim.x; | |
| if(workIndex < vectorLength) | |
| { | |
| C[workIndex] = A[workIndex] + B[workIndex]; | |
| Thereisnorequirementthatconsecutivethreadsaccessconsecutiveelementsofmemorytoachieve | |
| coalescedmemoryaccess,itismerelythetypicalwaycoalescingisachieved. Coalescedmemoryac- | |
| cess occurs provided all the threads in the warp access elements from the same 32-byte segments | |
| of memory in some linear or permuted way. Stated another way, the best way to achieve coalesced | |
| memoryaccessistomaximizetheratioofbytesusedtobytestransferred. | |
| Note | |
| Ensuringpropercoalescingofglobalmemoryaccessesisoneofthemostimportantperformance | |
| considerations for writing performant CUDA kernels. It is imperative that applications use the | |
| memorysystemasefficientlyaspossible. | |
| 2.2.4.1.1 MatrixTransposeExampleUsingGlobalMemory | |
| Asasimpleexample,consideranout-of-placematrixtransposekernelthattransposesa32bitfloat | |
| square matrix of size N x N, from matrix a to matrix c. This example uses a 2d grid, and assumes a | |
| launchof2dthreadblocksofsize32x32threads,thatis,blockDim.x = 32andblockDim.y = 32, | |
| soeach2dthreadblockwilloperateona32x32tileofthematrix. Eachthreadoperatesonaunique | |
| elementofthematrix,sonoexplicitsynchronizationofthreadsisnecessary. Figure12illustratesthis | |
| matrixtransposeoperation. Thekernelsourcecodefollowsthefigure. | |
| Figure12: MatrixTransposeusingGlobalmemory | |
| Thelabelsonthetopandleftofeachmatrixarethe2dthreadblockindicesandalsocanbeconsidered | |
| thetileindices,whereeachsmallsquareindicatesatileofthematrixthatwillbeoperatedonbya2d | |
| threadblock.Inthisexample,thetilesizeis32x32elements,soeachofthesmallsquaresrepresents | |
| a32x32tileofthematrix. Thegreenshadedsquareshowsthelocationofanexampletilebefore | |
| andafterthetransposeoperation. | |
| 2.2. WritingCUDASIMTKernels 45 | |
| CUDAProgrammingGuide,Release13.1 | |
| ∕* macro to index a 1D memory array with 2D indices in row-major order *∕ | |
| ∕* ld is the leading dimension, i.e. the number of columns in the matrix *∕ | |
| | #define | INDX( | row, col, | ld | ) ( ( (row) | * (ld) | ) + (col) | ) | | | |
| | ------- | ------ | --------- | ------ | ----------- | ------ | --------- | --- | --- | | |
| | ∕* CUDA | kernel | for naive | matrix | transpose | *∕ | | | | | |
| __global__ void naive_cuda_transpose(int m, float *a, float *c ) | |
| { | |
| | int | myCol | = blockDim.x | | * blockIdx.x | + | threadIdx.x; | | | | |
| | --- | ----- | ------------ | ----- | ------------ | --- | ------------ | --- | --- | | |
| | int | myRow | = blockDim.y | | * blockIdx.y | + | threadIdx.y; | | | | |
| | if( | myRow | < m && | myCol | < m ) | | | | | | |
| { | |
| | | c[INDX( | myCol, | myRow, | m )] = | a[INDX( | myRow, | myCol, | m )]; | | |
| | --- | ------- | ------ | ------ | ------ | ------- | ------ | ------ | ----- | | |
| | } | ∕* end | if *∕ | | | | | | | | |
| return; | |
| | } ∕* | end naive_cuda_transpose | | | *∕ | | | | | | |
| | ---- | ------------------------ | --- | --- | --- | --- | --- | --- | --- | | |
| To determine whether this kernel is achieving coalesced memory access one needs to determine | |
| whether consecutive threads are accessing consecutive elements of memory. In a 2d thread block, | |
| thexindexmovesthefastest,soconsecutivevaluesofthreadIdx.xshouldbeaccessingconsecu- | |
| tiveelementsofmemory. threadIdx.xappearsinmyCol,andonecanobservethatwhenmyColis | |
| thesecondargumenttotheINDXmacro,consecutivethreadsarereadingconsecutivevaluesofa,so | |
| thereadofaisperfectlycoalesced. | |
| However,thewritingofcisnotcoalesced,becauseconsecutivevaluesofthreadIdx.x(againexamine | |
| myCol)arewritingelementstocthatareld(leadingdimension)elementsapartfromeachother. This | |
| isobservedbecausenowmyColisthefirstargumenttotheINDXmacro,andasthefirstargumentto | |
| INDXincrementsby1,thememorylocationchangesbyld. Whenldislargerthan32(whichoccurs | |
| whenever the matrix sizes are larger than 32), this is equivalent to the pathological case shown in | |
| Figure11. | |
| Toalleviatetheseuncoalescedwrites, theuseofsharedmemorycanbeemployed, whichwillbede- | |
| scribedinthenextsection. | |
| 2.2.4.2 SharedMemoryAccessPatterns | |
| Sharedmemoryhas32banksthatareorganizedsuchthatsuccessive32-bitwordsmaptosuccessive | |
| banks. Eachbankhasabandwidthof32bitsperclockcycle. | |
| When multiple threads in the same warp attempt to access different elements in the same bank, a | |
| bankconflictoccurs. Inthiscase,theaccesstothedatainthatbankwillbeserializeduntilthedata | |
| inthatbankhasbeenobtainedbyallthethreadsthathaverequestedit. Thisserializationofaccess | |
| resultsinaperformancepenalty. | |
| The two exceptions to this scenario happen when multiple threads in the same warp are accessing | |
| (eitherreadingorwriting)thesamesharedmemorylocation. Forreadaccesses,thewordisbroadcast | |
| totherequestingthreads. Forwriteaccesses,eachsharedmemoryaddressiswrittenbyonlyoneof | |
| thethreads(whichthreadperformsthewriteisundefined). | |
| Figure 13 shows some examples of strided access. The red box inside the bank indicates a unique | |
| locationinsharedmemory. | |
| Figure14showssomeexamplesofmemoryreadaccessesthatinvolvethebroadcastmechanism. The | |
| redboxinsidethebankindicatesauniquelocationinsharedmemory. Ifmultiplearrowspointtothe | |
| | 46 | | | | | | Chapter2. | | ProgrammingGPUsinCUDA | | |
| | --- | --- | --- | --- | --- | --- | --------- | --- | --------------------- | | |
| CUDAProgrammingGuide,Release13.1 | |
| Figure13: StridedSharedMemoryAccessesin32bitbanksizemode. | |
| Left | |
| Linearaddressingwithastrideofone32-bitword(nobankconflict). | |
| Middle | |
| Linearaddressingwithastrideoftwo32-bitwords(two-waybankconflict). | |
| Right | |
| Linearaddressingwithastrideofthree32-bitwords(nobankconflict). | |
| 2.2. WritingCUDASIMTKernels 47 | |
| CUDAProgrammingGuide,Release13.1 | |
| samelocation,thedataisbroadcasttoallthreadsthatrequestedit. | |
| Note | |
| Avoiding bank conflicts is an important performance consideration for writing performant CUDA | |
| kernelsthatusesharedmemory. | |
| | 2.2.4.2.1 | | MatrixTransposeExampleUsingSharedMemory | | | | | | | | | | |
| | --------- | --- | --------------------------------------- | --- | --- | --- | --- | --- | --- | --- | --- | | |
| In the previous example Matrix Transpose Example Using Global Memory, a naive implementation of | |
| matrixtransposewasillustratedthatwasfunctionallycorrect, butnotoptimizedforefficientuseof | |
| globalmemorybecausethewriteofthecmatrixwasnotcoalescedproperly. Inthisexample,shared | |
| memorywillbetreatedasauser-managedcachetostageloadsandstoresfromglobalmemory,re- | |
| sultingincoalescedglobalmemoryaccessofbothreadsandwrites. | |
| Example | |
| 1 ∕* definitions of thread block size in X and Y directions *∕ | |
| 2 | |
| | 3 #define | | THREADS_PER_BLOCK_X | | | 32 | | | | | | | |
| | --------- | --- | ------------------- | --- | --- | --- | --- | --- | --- | --- | --- | | |
| | #define | | THREADS_PER_BLOCK_Y | | | 32 | | | | | | | |
| 4 | |
| 5 | |
| 6 ∕* macro to index a 1D memory array with 2D indices in row-major order *∕ | |
| ∕* ld is the leading dimension, i.e. the number of columns in the matrix *∕ | |
| 7 | |
| 8 | |
| | #define | | INDX( | row, col, | ld | ) ( ( | (row) | * (ld) | ) | + (col) | ) | | |
| | ------- | --- | ----- | --------- | --- | ----- | ----- | ------ | --- | ------- | --- | | |
| 9 | |
| 10 | |
| | 11 ∕* | CUDA | kernel | for shared | | memory | matrix | transpose | | *∕ | | | |
| | ----- | ---- | ------ | ---------- | --- | ------ | ------ | --------- | --- | --- | --- | | |
| 12 | |
| __global__ void smem_cuda_transpose(int m, float *a, float *c ) | |
| 13 | |
| { | |
| 14 | |
| 15 | |
| | 16 | ∕* | declare | a statically | | allocated | | shared | memory | array | *∕ | | |
| | --- | --- | ------- | ------------ | --- | --------- | --- | ------ | ------ | ----- | --- | | |
| 17 | |
| __shared__ float smemArray[THREADS_PER_BLOCK_X][THREADS_PER_BLOCK_Y]; | |
| 18 | |
| 19 | |
| | | ∕* | determine | my row | tile | and | column | tile | index | *∕ | | | |
| | --- | --- | --------- | ------ | ---- | --- | ------ | ---- | ----- | --- | --- | | |
| 20 | |
| 21 | |
| | | const | int | tileCol | = blockDim.x | | * | blockIdx.x; | | | | | |
| | --- | ----- | --- | ------- | ------------ | --- | --- | ----------- | --- | --- | --- | | |
| 22 | |
| | | const | int | tileRow | = blockDim.y | | * | blockIdx.y; | | | | | |
| | --- | ----- | --- | ------- | ------------ | --- | --- | ----------- | --- | --- | --- | | |
| 23 | |
| 24 | |
| | | ∕* | read from | global | memory | into | shared | | memory | array | *∕ | | |
| | --- | --- | --------- | ------ | ------ | ---- | ------ | --- | ------ | ----- | --- | | |
| 25 | |
| 26 smemArray[threadIdx.x][threadIdx.y] = a[INDX( tileRow + threadIdx.y, | |
| | | ,→tileCol | + threadIdx.x, | | m | )]; | | | | | | | |
| | --- | --------- | -------------- | --- | --- | --- | --- | --- | --- | --- | --- | | |
| 27 | |
| | 28 | ∕* | synchronize | the | threads | in | the | thread | block | *∕ | | | |
| | --- | --- | ----------- | --- | ------- | --- | --- | ------ | ----- | --- | --- | | |
| __syncthreads(); | |
| 29 | |
| 30 | |
| | | ∕* | write the | result | from | shared | memory | | to global | | memory *∕ | | |
| | --- | --- | --------- | ------ | ---- | ------ | ------ | --- | --------- | --- | --------- | | |
| 31 | |
| c[INDX( tileCol + threadIdx.y, tileRow + threadIdx.x, m )] = | |
| 32 | |
| ,→smemArray[threadIdx.y][threadIdx.x]; | |
| return; | |
| 33 | |
| (continuesonnextpage) | |
| | 48 | | | | | | | | | Chapter2. | ProgrammingGPUsinCUDA | | |
| | --- | --- | --- | --- | --- | --- | --- | --- | --- | --------- | --------------------- | | |
| CUDAProgrammingGuide,Release13.1 | |
| Figure14: IrregularSharedMemoryAccesses. | |
| Left | |
| Conflict-freeaccessviarandompermutation. | |
| Middle | |
| Conflict-freeaccesssincethreads3,4,6,7,and9accessthesamewordwithinbank5. | |
| Right | |
| Conflict-freebroadcastaccess(threadsaccessthesamewordwithinabank). | |
| 2.2. WritingCUDASIMTKernels 49 | |
| CUDAProgrammingGuide,Release13.1 | |
| (continuedfrompreviouspage) | |
| 34 | |
| | } | ∕* end | smem_cuda_transpose | | | | *∕ | | | | | | | | |
| | --- | ------ | ------------------- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | |
| 35 | |
| Examplewitharraychecks | |
| 1 ∕* definitions of thread block size in X and Y directions *∕ | |
| 2 | |
| | #define | | THREADS_PER_BLOCK_X | | | | 32 | | | | | | | | |
| | ------- | --- | ------------------- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | |
| 3 | |
| | 4 #define | | THREADS_PER_BLOCK_Y | | | | 32 | | | | | | | | |
| | --------- | --- | ------------------- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | |
| 5 | |
| 6 ∕* macro to index a 1D memory array with 2D indices in column-major order *∕ | |
| ∕* ld is the leading dimension, i.e. the number of rows in the matrix *∕ | |
| 7 | |
| 8 | |
| | 9 #define | | INDX( | row, | col, | ld | ) ( | ( (col) | * (ld) | ) | + (row) | ) | | | |
| | --------- | --- | ----- | ---- | ---- | --- | --- | ------- | ------ | --- | ------- | --- | --- | | |
| 10 | |
| | ∕* | CUDA | kernel | for | shared | | memory | matrix | transpose | | *∕ | | | | |
| | --- | ---- | ------ | --- | ------ | --- | ------ | ------ | --------- | --- | --- | --- | --- | | |
| 11 | |
| 12 | |
| | __global__ | | | void | smem_cuda_transpose(int | | | | m, | | | | | | |
| | ---------- | --- | --- | ---- | ----------------------- | --- | --- | --- | --- | --- | --- | --- | --- | | |
| 13 | |
| | 14 | | | | | | | | float | *a, | | | | | |
| | --- | --- | --- | --- | --- | --- | --- | --- | ----- | ---- | --- | --- | --- | | |
| | | | | | | | | | float | *c ) | | | | | |
| 15 | |
| { | |
| 16 | |
| 17 | |
| | | ∕* | declare | a | statically | | allocated | | shared | memory | array | *∕ | | | |
| | --- | --- | ------- | --- | ---------- | --- | --------- | --- | ------ | ------ | ----- | --- | --- | | |
| 18 | |
| 19 | |
| __shared__ float smemArray[THREADS_PER_BLOCK_X][THREADS_PER_BLOCK_Y]; | |
| 20 | |
| 21 | |
| 22 ∕* determine my row and column indices for the error checking code *∕ | |
| 23 | |
| | 24 | const | int | myRow | | = blockDim.x | | * | blockIdx.x | | + threadIdx.x; | | | | |
| | --- | ----- | --- | ----- | --- | ------------ | --- | --- | ---------- | --- | -------------- | --- | --- | | |
| | | const | int | myCol | | = blockDim.y | | * | blockIdx.y | | + threadIdx.y; | | | | |
| 25 | |
| 26 | |
| | 27 | ∕* | determine | | my row | tile | and | column | tile | index | *∕ | | | | |
| | --- | --- | --------- | --- | ------ | ---- | --- | ------ | ---- | ----- | --- | --- | --- | | |
| 28 | |
| | 29 | const | int | tileX | | = blockDim.x | | * | blockIdx.x; | | | | | | |
| | --- | ----- | --- | ----- | --- | ------------ | --- | --- | ----------- | --- | --- | --- | --- | | |
| | | const | int | tileY | | = blockDim.y | | * | blockIdx.y; | | | | | | |
| 30 | |
| 31 | |
| | 32 | if( | myRow | < | m && | myCol | < m | ) | | | | | | | |
| | --- | --- | ----- | --- | ---- | ----- | --- | --- | --- | --- | --- | --- | --- | | |
| { | |
| 33 | |
| | | | ∕* | read | from | global | memory | | into shared | | memory | array *∕ | | | |
| | --- | --- | --- | ---- | ---- | ------ | ------ | --- | ----------- | --- | ------ | -------- | --- | | |
| 34 | |
| smemArray[threadIdx.x][threadIdx.y] = a[INDX( tileX + threadIdx.x, | |
| 35 | |
| | | ,→tileY | + threadIdx.y, | | | m | )]; | | | | | | | | |
| | --- | ------- | -------------- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | |
| | 36 | } | ∕* end | if | *∕ | | | | | | | | | | |
| 37 | |
| | | ∕* | synchronize | | the | threads | | in the | thread | block | *∕ | | | | |
| | --- | --- | ----------- | --- | --- | ------- | --- | ------ | ------ | ----- | --- | --- | --- | | |
| 38 | |
| | 39 | __syncthreads(); | | | | | | | | | | | | | |
| | --- | ---------------- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | |
| 40 | |
| | 41 | if( | myRow | < | m && | myCol | < m | ) | | | | | | | |
| | --- | --- | ----- | --- | ---- | ----- | --- | --- | --- | --- | --- | --- | --- | | |
| { | |
| 42 | |
| | | | ∕* | write | the | result | from | shared | memory | | to global | memory | *∕ | | |
| | --- | --- | --- | ----- | --- | ------ | ---- | ------ | ------ | --- | --------- | ------ | --- | | |
| 43 | |
| | 44 | | c[INDX( | | tileY | + threadIdx.x, | | | tileX | + threadIdx.y, | | m | )] = | | |
| | --- | --- | ------- | --- | ----- | -------------- | --- | --- | ----- | -------------- | --- | --- | ----- | | |
| ,→smemArray[threadIdx.y][threadIdx.x]; | |
| (continuesonnextpage) | |
| | 50 | | | | | | | | | | Chapter2. | ProgrammingGPUsinCUDA | | | |
| | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --------- | --------------------- | --- | | |
| CUDAProgrammingGuide,Release13.1 | |
| (continuedfrompreviouspage) | |
| } ∕* end if *∕ | |
| 45 | |
| return; | |
| 46 | |
| 47 | |
| } ∕* end smem_cuda_transpose *∕ | |
| 48 | |
| Thefundamentalperformanceoptimizationillustratedinthisexampleistoensurethatwhenaccess- | |
| ingglobalmemory, thememoryaccessesarecoalescedproperly. Priortotheexecutionofthecopy, | |
| each thread computes its tileRow and tileCol indices. These are the indices for the specific tile | |
| that will be operated on, and these tile indices are based on which thread block is executing. Each | |
| threadinthesamethreadblockhasthesametileRowandtileColvalues,soitcanbethoughtof | |
| asthestartingpositionofthetilethatthisspecificthreadblockwilloperateon. | |
| The kernel then proceeds with each thread block copying a 32 x 32 tile of the matrix from global | |
| memorytosharedmemorywiththefollowingstatement. Sincethesizeofawarpis32threads,this | |
| copyoperationwillbeexecutedby32warps,withnoguaranteedorderbetweenthewarps. | |
| smemArray[threadIdx.x][threadIdx.y] = a[INDX( tileRow + threadIdx.y, tileCol | |
| ,→+ threadIdx.x, m )]; | |
| NotethatbecausethreadIdx.xappearsinthesecondargumenttoINDX,consecutivethreadsare | |
| accessingconsecutiveelementsinmemory,andthereadofaisperfectlycoalesced. | |
| Thenextstepinthekernelisthecalltothe__syncthreads()function. Thisensuresthatallthreads | |
| inthethreadblockhavecompletedtheirexecutionofthepreviouscodebeforeproceedingandthere- | |
| forethatthewriteofaintosharedmemoryiscompletedbeforethenextstep. Thisiscriticallyimpor- | |
| tant because the next step will involve threads reading from shared memory. Without the __sync- | |
| threads()call,thereadofaintosharedmemorywouldnotbeguaranteedtobecompletedbyallthe | |
| warpsinthethreadblockbeforesomewarpsadvancefurtherinthecode. | |
| At this point in the kernel, for each thread block, the smemArray has a 32 x 32 tile of the matrix, | |
| arranged in the same order as the original matrix. To ensure that the elements within the tile are | |
| transposed properly, threadIdx.x and threadIdx.y are swapped when they read smemArray. To | |
| ensure that the overall tile is placed in the correct place in c, the tileRow and tileCol indices are | |
| alsoswappedwhentheywritetoc. Toensurepropercoalescing,threadIdx.xisusedinthesecond | |
| argumenttoINDX,asshownbythestatementbelow. | |
| c[INDX( tileCol + threadIdx.y, tileRow + threadIdx.x, m )] = | |
| ,→smemArray[threadIdx.y][threadIdx.x]; | |
| Thiskernelillustratestwocommonusesofsharedmemory. | |
| ▶ Sharedmemoryisusedtostagedatafromglobalmemorytoensurethatreadsfromandwrites | |
| toglobalmemoryarebothcoalescedproperly. | |
| ▶ Shared memory is used to allow threads in the same thread block to share data among them- | |
| selves. | |
| 2.2.4.2.2 SharedMemoryBankConflicts | |
| InSection2.2.4.2,thebankstructureofsharedmemorywasdescribed. Inthepreviousmatrixtrans- | |
| poseexample,thepropercoalescedmemoryaccessto/fromglobalmemorywasachieved,butnocon- | |
| siderationwasgiventowhethersharedmemorybankconflictswerepresent. Considerthefollowing | |
| 2dsharedmemorydeclaration, | |
| __shared__ float smemArray[32][32]; | |
| 2.2. WritingCUDASIMTKernels 51 | |
| CUDAProgrammingGuide,Release13.1 | |
| Sinceawarpis32threads,eachthreadinthesamewarpwillhaveafixedvalueforthreadIdx.yand | |
| willhave0 <= threadIdx.x < 32. | |
| The left panel of Figure 15 illustrates the situation when the threads in a warp access the data in a | |
| columnofsmemArray. Warp0isaccessingmemorylocationssmemArray[0][0]throughsmemAr- | |
| ray[31][0]. In C++ multi-dimensional array ordering, the last index moves the fastest, so consec- | |
| utivethreadsinwarp0areaccessingmemorylocationsthatare32elementsapart. Asillustratedin | |
| thefigure,thecolorsdenotethebanks,andthisaccessdowntheentirecolumnbywarp0resultsina | |
| 32-waybankconflict. | |
| TherightpanelofFigure15illustratesthesituationwhenthethreadsinawarpaccessthedataacross | |
| a row of smemArray. Warp 0 is accessing memory locations smemArray[0][0] through smemAr- | |
| ray[0][31]. In this case, consecutive threads in warp 0 are accessing memory locations that are | |
| adjacent. As illustrated in the figure, the colors denote the banks, and this access across the entire | |
| rowbywarp0resultsinnobankconflicts. Theidealscenarioisforeachthreadinawarptoaccessa | |
| sharedmemorylocationwithadifferentcolor. | |
| Figure15: Bankstructureina32x32sharedmemoryarray. | |
| Thenumbersintheboxesindicatethewarpindex. Thecolorsindicatewhichbankisassociatedwiththatshared | |
| memorylocation. | |
| Returning to the example from Section 2.2.4.2.1, one can examine the usage of shared memory to | |
| determinewhetherbankconflictsarepresent. Thefirstusageofsharedmemoryiswhendatafrom | |
| globalmemoryisstoredtosharedmemory: | |
| smemArray[threadIdx.x][threadIdx.y] = a[INDX( tileRow + threadIdx.y, tileCol | |
| ,→+ threadIdx.x, m )]; | |
| BecauseC++arraysarestoredinrow-majororder,consecutivethreadsinthesamewarp,asindicated | |
| byconsecutivevaluesofthreadIdx.x,willaccesssmemArraywithastrideof32elements,because | |
| threadIdx.xisthefirstindexintothearray. Thisresultsina32-waybankconflictandisillustrated | |
| bytheleftpanelofFigure15. | |
| Thesecondusageofsharedmemoryiswhendatafromsharedmemoryiswrittenbacktoglobalmem- | |
| ory: | |
| 52 Chapter2. ProgrammingGPUsinCUDA | |
| CUDAProgrammingGuide,Release13.1 | |
| c[INDX( tileCol + threadIdx.y, tileRow + threadIdx.x, m )] = | |
| ,→smemArray[threadIdx.y][threadIdx.x]; | |
| Inthiscase,becausethreadIdx.xisthesecondindexintothesmemArrayarray,consecutivethreads | |
| inthesamewarpwillaccesssmemArraywithastrideof1element. Thisresultsinnobankconflicts | |
| andisillustratedbytherightpanelofFigure15. | |
| ThematrixtransposekernelasillustratedinSection2.2.4.2.1hasoneaccessofsharedmemorythat | |
| has no bank conflicts and one access that has a 32-way bank conflict. A common fix to avoid bank | |
| conflictsistopadthesharedmemorybyaddingonetothecolumndimensionofthearrayasfollows: | |
| __shared__ float smemArray[THREADS_PER_BLOCK_X][THREADS_PER_BLOCK_Y+1]; | |
| ThisminoradjustmenttothedeclarationofsmemArraywilleliminatethebankconflicts. Toillustrate | |
| this,considerFigure16wherethesharedmemoryarrayhasbeendeclaredwithasizeof32x33. One | |
| observesthatwhetherthethreadsinthesamewarpaccessthesharedmemoryarraydownanentire | |
| columnoracrossanentirerow,thebankconflictshavebeeneliminated,i.e.,thethreadsinthesame | |
| warpaccesslocationswithdifferentcolors. | |
| Figure16: Bankstructureina32x33sharedmemoryarray. | |
| Thenumbersintheboxesindicatethewarpindex. Thecolorsindicatewhichbankisassociatedwiththatshared | |
| memorylocation. | |
| 2.2.5. Atomics | |
| Performant CUDA kernels rely on expressing as much algorithmic parallelism as possible. The asyn- | |
| chronousnatureofGPUkernelexecutionrequiresthatthreadsoperateasindependentlyaspossible. | |
| It’snotalwayspossibletohavecompleteindependenceofthreadsandaswesawinSharedMemory, | |
| thereexistsamechanismforthreadsinthesamethreadblocktoexchangedataandsynchronize. | |
| On the level of an entire grid there is no such mechanism to synchronize all threads in a grid. There | |
| is however a mechanism to provide synchronous access to global memory locations via the use of | |
| atomicfunctions. Atomic functionsallow a thread to obtaina lock on a global memory location and | |
| performaread-modify-writeoperationonthatlocation. Nootherthreadcanaccessthesamelocation | |
| while the lock is held. CUDA provides atomics with the same behavior as the C++ standard library | |
| atomicsascuda::std::atomicandcuda::std::atomic_ref. CUDAalsoprovidesextendedC++ | |
| 2.2. WritingCUDASIMTKernels 53 | |
| CUDAProgrammingGuide,Release13.1 | |
| atomicscuda::atomicandcuda::atomic_refwhichallowtheusertospecifythethreadscopeof | |
| theatomicoperation. ThedetailsofatomicfunctionsarecoveredinAtomicFunctions. | |
| An example usage of cuda::atomic_ref to perform a device-wide atomic addition is as follows, | |
| wherearrayisanarrayoffloats,andresultisafloatpointertoalocationinglobalmemorywhich | |
| isthelocationwherethesumofthearraywillbestored. | |
| __global__ void sumReduction(int n, float *array, float *result) { | |
| ... | |
| tid = threadIdx.x + blockIdx.x * blockDim.x; | |
| cuda::atomic_ref<float, cuda::thread_scope_device> result_ref(result); | |
| result_ref.fetch_add(array[tid]); | |
| ... | |
| } | |
| Atomic functions should be used sparingly as they enforce thread synchronization that can impact | |
| performance. | |
| 2.2.6. Cooperative Groups | |
| CooperativegroupsisasoftwaretoolavailableinCUDAC++thatallowsapplicationstodefinegroups | |
| ofthreadswhichcansynchronizewitheachother,evenifthatgroupofthreadsspansmultiplethread | |
| blocks,multiplegridsonasingleGPU,orevenacrossmultipleGPUs. TheCUDAprogrammingmodel | |
| ingeneralallowsthreadswithinathreadblockorthreadblockclustertosynchronizeefficiently, but | |
| does not provide a mechanism for specifying thread groups smaller than a thread block or cluster. | |
| Similarly,theCUDAprogrammingmodeldoesnotprovidemechanismsorguaranteesthatenablesyn- | |
| chronizationacrossthreadblocks. | |
| Cooperative groups provide both of these capabilities through software. Cooperative groups allows | |
| theapplicationtocreatethreadgroupsthatcrosstheboundaryofthreadblocksandclusters,though | |
| doingsocomeswithsomesemanticlimitationsandperformanceimplicationswhicharedescribedin | |
| detailinthefeaturesectioncoveringcooperativegroups. | |
| 2.2.7. Kernel Launch and Occupancy | |
| WhenaCUDAkernelislaunched, CUDAthreadsaregroupedintothreadblocksandagrid basedon | |
| the execution configuration specified at kernel launch. Once the kernel is launched, the scheduler | |
| assignsthreadblockstoSMs. Thedetailsofwhichthreadblocksarescheduledtoexecuteonwhich | |
| SMscannotbecontrolledorqueriedbytheapplicationandnoorderingguaranteesaremadebythe | |
| scheduler,soprogramscannotnotrelyonaspecificschedulingorderorschemeforcorrectexecution. | |
| The number of blocks that can be scheduled on an SM depends on the hardware resources a given | |
| threadblockrequires,andthehardwareresourcesavailableontheSM.Whenakernelisfirstlaunched, | |
| the scheduler begins assigning thread blocks to SMs. As long as SMs have sufficient hardware re- | |
| sources unoccupied by other thread blocks, the scheduler will continue assigning thread blocks to | |
| SMs. IfatsomepointnoSMhasthecapacitytoacceptanotherthreadblock,theschedulerwillwait | |
| untiltheSMscompletepreviouslyassignedthreadblocks. Oncethishappens,SMsarefreetoaccept | |
| morework,andtheschedulerassignsthreadblockstothem. Thisprocesscontinuesuntilallthread | |
| blockshavebeenscheduledandexecuted. | |
| The cudaGetDeviceProperties function allows an application to query the limits of each SM via | |
| deviceproperties. NotethattherearelimitsperSMandperthreadblock. | |
| ▶ maxBlocksPerMultiProcessor: ThemaximumnumberofresidentblocksperSM. | |
| 54 Chapter2. ProgrammingGPUsinCUDA | |
| CUDAProgrammingGuide,Release13.1 | |
| ▶ | |
| sharedMemPerMultiprocessor: TheamountofsharedmemoryavailableperSMinbytes. | |
| ▶ regsPerMultiprocessor: Thenumberof32-bitregistersavailableperSM. | |
| ▶ maxThreadsPerMultiProcessor: ThemaximumnumberofresidentthreadsperSM. | |
| ▶ | |
| sharedMemPerBlock: The maximum amount of shared memory that can be allocated by a | |
| threadblockinbytes. | |
| ▶ | |
| regsPerBlock: Themaximumnumberof32-bitregistersthatcanbeallocatedbyathreadblock. | |
| ▶ maxThreadsPerBlock: Themaximumnumberofthreadsperthreadblock. | |
| TheoccupancyofaCUDAkernelistheratioofthenumberofactivewarpstothemaximumnumber | |
| of active warps supported by the SM. In general, it’s a good practice to have occupancy as high as | |
| possiblewhichhideslatencyandincreasesperformance. | |
| Tocalculateoccupancy,oneneedstoknowtheresourcelimitsoftheSM,whichwerejustdescribed, | |
| and one needs to know what resources are required by the CUDA kernel in question. To determine | |
| resourceusageonaperkernelbasis,duringprogramcompilationonecanusethe--resource-usage | |
| optiontonvcc,whichwillshowthenumberofregistersandsharedmemoryrequiredbythekernel. | |
| Toillustrate,consideradevicesuchascomputecapability10.0withthedevicepropertiesenumerated | |
| inTable2. | |
| Table2: SMResourceExample | |
| | | | Resource | Value | | | | |
| | --- | --- | --------------------------- | ------ | --- | --- | | |
| | | | maxBlocksPerMultiProcessor | 32 | | | | |
| | | | sharedMemPerMultiprocessor | 233472 | | | | |
| | | | regsPerMultiprocessor | 65536 | | | | |
| | | | maxThreadsPerMultiProcessor | 2048 | | | | |
| | | | sharedMemPerBlock | 49152 | | | | |
| | | | regsPerBlock | 65536 | | | | |
| | | | maxThreadsPerBlock | 1024 | | | | |
| IfakernelwaslaunchedastestKernel<<<512, 768>>>(),i.e.,768threadsperblock,eachSMwould | |
| only be able to execute 2 thread blocks at a time. The scheduler cannot assign more than 2 thread | |
| | blocks | per SM because | the | is 2048. | So the occupancy | would be | | |
| | ------ | -------------- | --- | -------- | ---------------- | -------- | | |
| maxThreadsPerMultiProcessor | |
| (768*2)/2048,or75%. | |
| IfakernelwaslaunchedastestKernel<<<512, 32>>>(),i.e.,32threadsperblock,eachSMwould | |
| notrunintoalimitonmaxThreadsPerMultiProcessor,butsincethemaxBlocksPerMultiProces- | |
| soris32,theschedulerwouldonlybeabletoassign32threadblockstoeachSM.Sincethenumber | |
| ofthreadsintheblockis32,thetotalnumberofthreadsresidentontheSMwouldbe32blocks*32 | |
| threadsperblock,or1024totalthreads. Sinceacomputecapability10.0SMhasamaximumvalueof | |
| 2048residentthreadsperSM,theoccupancyinthiscaseis1024/2048,or50%. | |
| The same analysis can be done with shared memory. If a kernel uses 100KB of shared memory, for | |
| example, the scheduler would only be able to assign 2 thread blocks to each SM, because the third | |
| threadblockonthatSMwouldrequireanother100KBofsharedmemoryforatotalof300KB,which | |
| ismorethanthe233472bytesavailableperSM. | |
| Threadsperblockandsharedmemoryusageperblockareexplicitlycontrolledbytheprogrammerand | |
| canbeadjustedtoachievethedesiredoccupancy. Theprogrammerhaslimitedcontroloverregister | |
| | 2.2. WritingCUDASIMTKernels | | | | | 55 | | |
| | --------------------------- | --- | --- | --- | --- | --- | | |
| CUDAProgrammingGuide,Release13.1 | |
| usage as the compiler and runtime will attempt to optimize register usage. However the program- | |
| mercanspecifyamaximumnumberofregistersperthreadblockviathe--maxrregcountoptionto | |
| nvcc. Ifthekernelneedsmoreregistersthanthisspecifiedamount,thekernelislikelytospilltolocal | |
| memory,whichwillchangetheperformancecharacteristicsofthekernel. Insomecaseseventhough | |
| spilling occurs, limiting registers allows more thread blocks to be scheduled which in turn increases | |
| occupancyandmayresultinanetincreaseinperformance. | |
| 2.3. Asynchronous Execution | |
| 2.3.1. What is Asynchronous Concurrent Execution? | |
| CUDAallowsconcurrent,oroverlapping,executionofmultipletasks,specifically: | |
| ▶ computationonthehost | |
| ▶ computationonthedevice | |
| ▶ memorytransfersfromthehosttothedevice | |
| ▶ memorytransfersfromthedevicetothehost | |
| ▶ memorytransferswithinthememoryofagivendevice | |
| ▶ memorytransfersamongdevices | |
| Theconcurrencyisexpressedviaanasynchronousinterface,whereadispatchingfunctioncallorker- | |
| nel launch returns immediately. Asynchronous calls usually return before the dispatched operation | |
| has completed and may return before the asynchronous operation has started. The application is | |
| thenfreetoperformothertasksatthesametimeastheoriginallydispatchedoperation. Whenthe | |
| finalresultsoftheinitiallydispatchedoperationareneeded,theapplicationmustperformsomeform | |
| ofsynchronizationtoensurethattheoperationinquestionhascompleted. Atypicalexampleofacon- | |
| currentexecutionpatternistheoverlappingofhostanddevicememorytransferswithcomputation | |
| andthusreducingoreliminatingtheiroverhead. | |
| Figure17: AsynchronousCOncurrentExecutionwithCUDAstreams | |
| In general, asynchronous interfaces typically provide three main ways to synchronize with the dis- | |
| patchedoperation | |
| ▶ ablockingapproach,wheretheapplicationcallsafunctionthatblocks,orwaitsuntiltheopera- | |
| tionhascompleted | |
| 56 Chapter2. ProgrammingGPUsinCUDA | |
| CUDAProgrammingGuide,Release13.1 | |
| ▶ anon-blockingapproach,orpollingapproachwheretheapplicationcallsafunctionthatreturns | |
| immediatelyandsuppliesinformationaboutthestatusoftheoperation | |
| ▶ acallbackapproach,whereapre-registeredfunctionisexecutedwhentheoperationhascom- | |
| pleted. | |
| Whiletheprogramminginterfacesareasynchronous,theactualabilitytocarryoutvariousoperations | |
| concurrently will depend on the version of CUDA and the compute capability of the hardware being | |
| used–thesedetailswillbelefttoalatersectionofthisguide(seeComputeCapabilities). | |
| In SynchronizingCPUandGPU, the CUDA runtime function cudaDeviceSynchronize() was intro- | |
| duced, which is a blocking call which waits for all previously issued work to complete. The reason | |
| the cudaDeviceSynchronize() call was needed is because the kernel launch is asynchronous and | |
| returns immediately. CUDA provides an API for both blocking and non-blocking approaches to syn- | |
| chronizationandevensupportstheuseofhost-sidecallbackfunctions. | |
| ThecoreAPIcomponentsforasynchronousexecutioninCUDAareCUDAStreamsandCUDAEvents. | |
| In the rest of this section we will explain how these elements can be used to express asynchronous | |
| executioninCUDA. | |
| ArelatedtopicisthatofCUDAGraphs,whichallowagraphofasynchronousoperationstobedefined | |
| upfront,whichcanthenbeexecutedrepeatedlywithminimaloverhead. WecoverCUDAGraphsina | |
| veryintroductorylevelinsection2.4.9.2IntroductiontoCUDAGraphswithStreamCapture,andamore | |
| comprehensivediscussionisprovidedinsection4.1CUDAGraphs. | |
| 2.3.2. CUDA Streams | |
| At the most basic level, a CUDA stream is an abstraction which allows the programmer to express a | |
| sequenceofoperations. Astreamoperateslikeawork-queueintowhichprogramscanaddoperations, | |
| such as memory copies or kernel launches, to be executed in order. Operations at the front of the | |
| queueforagivenstreamareexecutedandthendequeuedallowingthenextqueuedoperationtocome | |
| to the front and to be considered for execution. The order of execution of operations in a stream is | |
| sequentialandtheoperationsareexecutedintheordertheyareenqueuedintothestream. | |
| Anapplicationmayusemultiplestreamssimultaneously. Insuchcases,theruntimewillselectatask | |
| toexecutefromthestreamsthathaveworkavailabledependingonthestateoftheGPUresources. | |
| Streams may be assigned a priority which acts as a hint to the runtime to influence the scheduling, | |
| butdoesnotguaranteeaspecificorderofexecution. | |
| TheAPIfunctioncallsandkernel-launchesoperatinginastreamareasynchronouswithrespecttothe | |
| hostthread. Applicationscansynchronizewithastreambywaitingforittobeemptyoftasks,orthey | |
| canalsosynchronizeatthedevicelevel. | |
| CUDAhasadefaultstream,andoperationsandkernellauncheswithoutaspecificstreamarequeued | |
| intothisdefaultstream. Codeexampleswhichdonotspecifyastreamareusingthisdefaultstream | |
| implicitly. ThedefaultstreamhassomespecificsemanticswhicharediscussedinsubsectionBlocking | |
| andnon-blockingstreamsandthedefaultstream. | |
| 2.3.2.1 CreatingandDestroyingCUDAStreams | |
| CUDAstreamscanbecreatedusingthecudaStreamCreate()function. Thefunctioncallinitializes | |
| thestreamhandlewhichcanbeusedtoidentifythestreaminsubsequentfunctioncalls. | |
| cudaStream_t stream; ∕∕ Stream handle | |
| cudaStreamCreate(&stream); ∕∕ Create a new stream | |
| (continuesonnextpage) | |
| 2.3. AsynchronousExecution 57 | |
| CUDAProgrammingGuide,Release13.1 | |
| (continuedfrompreviouspage) | |
| ∕∕ stream based operations ... | |
| cudaStreamDestroy(stream); ∕∕ Destroy the stream | |
| If the the device is still doing work in stream stream when the application calls cudaStreamDe- | |
| stroy(),thestreamwillcompletealltheworkinthestreambeforebeingdestroyed. | |
| 2.3.2.2 LaunchingKernelsinCUDAStreams | |
| Theusualtriple-chevronsyntaxforlaunchingakernelcanalsobeusedtolaunchakernelintoaspe- | |
| cific stream. The stream is specified as an extra parameter to the kernel launch. In the following | |
| examplethekernelnamedkernelislaunchedintothestreamwithhandlestream,whichisoftype | |
| cudaStream_tandhasbeenassumedtohavebeencreatedpreviously: | |
| kernel<<<grid, block, shared_mem_size, stream>>>(...); | |
| Thekernellaunchisasynchronousandthefunctioncallreturnsimmediately. Assumingthatthekernel | |
| launchissuccessful,thekernelwillexecuteinthestreamstreamandtheapplicationisfreetoperform | |
| othertasksontheCPUorinotherstreamsontheGPUwhilethekernelisexecuting. | |
| 2.3.2.3 LaunchingMemoryTransfersinCUDAStreams | |
| Tolaunchamemorytransferintoastream,wecanusethefunctioncudaMemcpyAsync(). Thisfunc- | |
| tion is similar to the cudaMemcpy() function, but it takes an additional parameter specifying the | |
| streamtouseforthememorytransfer. Thefunctioncallinthecodeblockbelowcopiessizebytes | |
| from the host memory pointed to by src to the device memory pointed to by dst in the stream | |
| stream. | |
| ∕∕ Copy `size` bytes from `src` to `dst` in stream `stream` | |
| cudaMemcpyAsync(dst, src, size, cudaMemcpyHostToDevice, stream); | |
| Likeotherasynchronousfunctioncalls,thisfunctioncallreturnsimmediately,whereasthecudaMem- | |
| cpy() function blocks until the memory transfer is complete. In order to access the results of the | |
| transfersafely,theapplicationmustdeterminethattheoperationhascompletedusingsomeformof | |
| synchronization. | |
| OtherCUDAmemorytransferfunctionssuchascudaMemcpy2D()alsohaveasynchronousvariants. | |
| Note | |
| In order for memory copies involving CPU memory to be carried out asynchronously, the host | |
| buffersmustbepinnedandpage-locked. cudaMemcpyAsync()willfunctioncorrectlyifhostmem- | |
| orywhichisnotpinnedandpage-lockedisused,butitwillreverttoasynchronousbehaviorwhich | |
| willnotoverlapwithotherwork. Thiscaninhibittheperformancebenefitsofusingasynchronous | |
| memorytransfers. ItisrecommendedprogramsusecudaMallocHost()toallocatebufferswhich | |
| willbeusedtosendorreceivedatafromGPUs. | |
| 2.3.2.4 StreamSynchronization | |
| Thesimplestwaytosynchronizewithastreamistowaitforthestreamtobeemptyoftasks. Thiscan | |
| be done in two ways, using the cudaStreamSynchronize() function or the cudaStreamQuery() | |
| function. | |
| 58 Chapter2. ProgrammingGPUsinCUDA | |
| CUDAProgrammingGuide,Release13.1 | |
| ThecudaStreamSynchronize()functionwillblockuntilalltheworkinthestreamhascompleted. | |
| | ∕∕ Wait | for the | stream | to | be empty | of | tasks | | | | |
| | ------- | ------- | ------ | --- | -------- | --- | ----- | --- | --- | | |
| cudaStreamSynchronize(stream); | |
| | ∕∕ At | this point | the | stream | is | done | | | | | |
| | ------ | ------------- | --- | ----------- | --- | --------- | ---------- | --- | ------ | | |
| | ∕∕ and | we can access | | the results | | of stream | operations | | safely | | |
| If we prefer not to block, but just need a quick check to see if the steam is empty we can use the | |
| cudaStreamQuery()function. | |
| | ∕∕ Have | a peek | at the | stream | | | | | | | |
| | ----------- | ----------------- | ------ | -------------------------- | ---------- | ---------- | ----- | ---------- | --- | | |
| | ∕∕ returns | cudaSuccess | | if | the stream | is | empty | | | | |
| | ∕∕ returns | cudaErrorNotReady | | | if | the stream | is | not empty | | | |
| | cudaError_t | status | | = cudaStreamQuery(stream); | | | | | | | |
| | switch | (status) | { | | | | | | | | |
| | case | cudaSuccess: | | | | | | | | | |
| | | ∕∕ The | stream | is empty | | | | | | | |
| | | std::cout | << | "The | stream | is empty" | << | std::endl; | | | |
| break; | |
| | case | cudaErrorNotReady: | | | | | | | | | |
| | ---- | ------------------ | ------ | ------ | ------ | ------ | ------ | --- | ---------- | | |
| | | ∕∕ The | stream | is not | empty | | | | | | |
| | | std::cout | << | "The | stream | is not | empty" | << | std::endl; | | |
| break; | |
| default: | |
| | | ∕∕ An | error | occurred | - | we should | handle | this | | | |
| | --- | ----- | ----- | -------- | --- | --------- | ------ | ---- | --- | | |
| break; | |
| }; | |
| | 2.3.3. | CUDA | Events | | | | | | | | |
| | ------ | ---- | ------ | --- | --- | --- | --- | --- | --- | | |
| CUDA events are a mechanism for inserting markers into a CUDA stream. They are essentially like | |
| tracer particles that can be used to track the progress of tasks in a stream. Imagine launching two | |
| kernelsintoastream. Withoutsuchtrackingevents,wewouldonlybeabletodeterminewhetherthe | |
| streamisemptyornot. Ifwehadanoperationthatwasdependentontheoutputofthefirstkernel, | |
| wewouldnotbeabletostartthatoperationsafelyuntilweknewthestreamwasemptybywhichtime | |
| bothkernelswouldhavecompleted. | |
| UsingCUDAEventswecandobetter. Byenqueuinganeventintoastreamdirectlyafterthefirstkernel, | |
| butbeforethesecondkernel,wecanwaitforthiseventtocometothefrontofthestream. Then,we | |
| can safely start our dependent operation knowing that the first kernel has completed, but before | |
| thesecondkernelhasstarted. UsingCUDAeventsinthiswaycanbuildupagraphofdependencies | |
| between operations and streams. This graph analogy translates directly into the later discussion of | |
| CUDAgraphs. | |
| CUDA streams also keep time information which can be used to time kernel launches and memory | |
| transfers. | |
| 2.3. AsynchronousExecution 59 | |
| CUDAProgrammingGuide,Release13.1 | |
| 2.3.3.1 CreatingandDestroyingCUDAEvents | |
| CUDA Events can be created and destroyed using the cudaEventCreate() and cudaEventDe- | |
| stroy()functions. | |
| | cudaEvent_t | | event; | | | | | | | |
| | ----------- | --- | ------ | --- | --- | --- | --- | --- | | |
| | ∕∕ Create | the | event | | | | | | | |
| cudaEventCreate(&event); | |
| | ∕∕ do some | work | involving | | the event | | | | | |
| | ---------- | ------- | --------- | --------- | ------------- | ----- | ------------- | --- | | |
| | ∕∕ Once | the | work is | done | and the event | is no | longer needed | | | |
| | ∕∕ we can | destroy | | the event | | | | | | |
| cudaEventDestroy(event); | |
| Theapplicationisresponsiblefordestroyingeventswhentheyarenolongerneeded. | |
| 2.3.3.2 InsertingEventsintoCUDAStreams | |
| CUDAEventscanbeinsertedintoastreamusingthecudaEventRecord()function. | |
| | cudaEvent_t | | event; | | | | | | | |
| | ------------ | --- | ------- | --- | --- | --- | --- | --- | | |
| | cudaStream_t | | stream; | | | | | | | |
| | ∕∕ Create | the | event | | | | | | | |
| cudaEventCreate(&event); | |
| | ∕∕ Insert | the | event | into | the stream | | | | | |
| | ---------------------- | --- | ----- | ---- | ---------- | --- | --- | --- | | |
| | cudaEventRecord(event, | | | | stream); | | | | | |
| 2.3.3.3 TimingOperationsinCUDAStreams | |
| CUDAeventscanbeusedtotimetheexecutionofvariousstreamoperationsincludingkernels. When | |
| an event reaches the front of a stream it records a timestamp. By surrounding a kernel in a stream | |
| withtwoeventswecangetanaccuratetimingofthedurationofthekernelexecutionasisshownin | |
| thecodesnippetbelow: | |
| | cudaStream_t | | stream; | | | | | | | |
| | ------------ | --- | ------- | --- | --- | --- | --- | --- | | |
| cudaStreamCreate(&stream); | |
| | cudaEvent_t | | start; | | | | | | | |
| | ----------- | --- | ------ | --- | --- | --- | --- | --- | | |
| | cudaEvent_t | | stop; | | | | | | | |
| | ∕∕ create | the | events | | | | | | | |
| cudaEventCreate(&start); | |
| cudaEventCreate(&stop); | |
| | ∕∕ record | the | start | event | | | | | | |
| | ---------------------- | --- | ------ | ----- | --------------- | --- | --- | --- | | |
| | cudaEventRecord(start, | | | | stream); | | | | | |
| | ∕∕ launch | the | kernel | | | | | | | |
| | kernel<<<grid, | | block, | 0, | stream>>>(...); | | | | | |
| (continuesonnextpage) | |
| | 60 | | | | | | Chapter2. | ProgrammingGPUsinCUDA | | |
| | --- | --- | --- | --- | --- | --- | --------- | --------------------- | | |
| CUDAProgrammingGuide,Release13.1 | |
| (continuedfrompreviouspage) | |
| | ∕∕ record | the | stop | event | | | | |
| | --------------------- | ------ | ---------- | ----- | ----------- | --------- | | |
| | cudaEventRecord(stop, | | | | stream); | | | |
| | ∕∕ wait | for | the stream | | to complete | | | |
| | ∕∕ both | events | will | have | been | triggered | | |
| cudaStreamSynchronize(stream); | |
| | ∕∕ get | the timing | | | | | | |
| | ------ | ---------- | --- | --- | --- | --- | | |
| float elapsedTime; | |
| | cudaEventElapsedTime(&elapsedTime, | | | | | start, stop); | | |
| | ---------------------------------- | --- | --- | --- | --- | ------------- | | |
| std::cout << "Kernel execution time: " << elapsedTime << " ms" << std::endl; | |
| | ∕∕ clean | up | | | | | | |
| | -------- | --- | --- | --- | --- | --- | | |
| cudaEventDestroy(start); | |
| cudaEventDestroy(stop); | |
| cudaStreamDestroy(stream); | |
| 2.3.3.4 CheckingtheStatusofCUDAEvents | |
| Like in the case of checking the status of streams, we can check the status of events in either a | |
| blockingoranon-blockingway. | |
| ThecudaEventSynchronize()functionwillblockuntiltheeventhascompleted. Inthecodesnippet | |
| belowwelaunchakernelintoastream,followedbyaneventandthenbyasecondkernel. Wecanuse | |
| thecudaEventSynchronize()functiontowaitfortheeventafterthefirstkerneltocompleteand | |
| inprinciplelaunchadependenttaskimmediately,potentiallybeforekernel2finishes. | |
| | cudaEvent_t | | event; | | | | | |
| | ------------ | --- | ------- | --- | --- | --- | | |
| | cudaStream_t | | stream; | | | | | |
| | ∕∕ create | the | stream | | | | | |
| cudaStreamCreate(&stream); | |
| | ∕∕ create | the | event | | | | | |
| | --------- | --- | ----- | --- | --- | --- | | |
| cudaEventCreate(&event); | |
| | ∕∕ launch | a | kernel | into | the stream | | | |
| | ---------------------- | ------ | --------- | ---------- | ------------------ | ----------------- | | |
| | kernel<<<grid, | | block, | | 0, stream>>>(...); | | | |
| | ∕∕ Record | the | event | | | | | |
| | cudaEventRecord(event, | | | | stream); | | | |
| | ∕∕ launch | a | kernel | into | the stream | | | |
| | kernel2<<<grid, | | block, | | 0, stream>>>(...); | | | |
| | ∕∕ Wait | for | the event | to | complete | | | |
| | ∕∕ Kernel | 1 | will be | guaranteed | | to have completed | | |
| | ∕∕ and | we can | launch | the | dependent | task. | | |
| cudaEventSynchronize(event); | |
| dependentCPUtask(); | |
| | ∕∕ Wait | for | the stream | | to be | empty | | |
| | ------- | --- | ---------- | --- | ----- | ----- | | |
| (continuesonnextpage) | |
| 2.3. AsynchronousExecution 61 | |
| CUDAProgrammingGuide,Release13.1 | |
| (continuedfrompreviouspage) | |
| | ∕∕ Kernel | 2 | is guaranteed | | to | have completed | | | | | |
| | --------- | --- | ------------- | --- | --- | -------------- | --- | --- | --- | | |
| cudaStreamSynchronize(stream); | |
| | ∕∕ destroy | | the event | | | | | | | | |
| | ---------- | --- | --------- | --- | --- | --- | --- | --- | --- | | |
| cudaEventDestroy(event); | |
| | ∕∕ destroy | | the stream | | | | | | | | |
| | ---------- | --- | ---------- | --- | --- | --- | --- | --- | --- | | |
| cudaStreamDestroy(stream); | |
| CUDA Events can be checked for completion in a non-blocking way using the cudaEventQuery() | |
| function. Intheexamplebelowwelaunch2kernelsintoastream. Thefirstkernel,kernel1generates | |
| some data which we would like to copy to the host, however we also have some CPU side work to | |
| do. Inthecodebelow,weenqueuekernel1followedbyanevent(event)andthenkernel2intostream | |
| stream1. WethengointoaCPUworkloop,butoccasionallytakeapeektoseeiftheeventhascom- | |
| pletedindicatingthatkernel1isdone. Ifso,welaunchahosttodevicecopyintostreamstream2. This | |
| approach allows the overlap of the CPU work with the GPU kernel execution and the device to host | |
| copy. | |
| | cudaEvent_t | | event; | | | | | | | | |
| | ------------ | ---- | --------------- | --- | --- | --- | --- | --- | --- | | |
| | cudaStream_t | | stream1; | | | | | | | | |
| | cudaStream_t | | stream2; | | | | | | | | |
| | size_t | size | = LARGE_NUMBER; | | | | | | | | |
| float *d_data; | |
| | ∕∕ Create | some | data | | | | | | | | |
| | --------------------------- | ------- | --------- | -------- | --------------- | ------------- | ------ | ------ | --- | | |
| | cudaMalloc(&d_data, | | | size); | | | | | | | |
| | float | *h_data | = (float | | *)malloc(size); | | | | | | |
| | ∕∕ create | the | streams | | | | | | | | |
| | cudaStreamCreate(&stream1); | | | | | ∕∕ Processing | | stream | | | |
| | cudaStreamCreate(&stream2); | | | | | ∕∕ Copying | stream | | | | |
| | bool copyStarted | | | = false; | | | | | | | |
| | ∕∕ create | | the event | | | | | | | | |
| cudaEventCreate(&event); | |
| | ∕∕ launch | kernel1 | | into | the | stream | | | | | |
| | ---------------------- | ------- | -------- | --------- | --------- | ------------------ | ---- | ---------- | --- | | |
| | kernel1<<<grid, | | block, | | 0, | stream1>>>(d_data, | | size); | | | |
| | ∕∕ enqueue | | an event | following | | kernel1 | | | | | |
| | cudaEventRecord(event, | | | | stream1); | | | | | | |
| | ∕∕ launch | kernel2 | | into | the | stream | | | | | |
| | kernel2<<<grid, | | block, | | 0, | stream1>>>(); | | | | | |
| | ∕∕ while | the | kernels | are | running | do some | work | on the CPU | | | |
| ∕∕ but check if kernel1 has completed because then we will start | |
| | ∕∕ a device | | to host | copy | in | stream2 | | | | | |
| | ----------- | ----- | ---------------- | ---- | --- | ------------------ | --- | --- | --- | | |
| | while | ( not | allCPUWorkDone() | | | || not copyStarted | | ) { | | | |
| doNextChunkOfCPUWork(); | |
| | ∕∕ | peek | to see | if | kernel | 1 has completed | | | | | |
| | --- | ---- | ------ | --- | ------ | --------------- | --- | --- | --- | | |
| (continuesonnextpage) | |
| | 62 | | | | | | | Chapter2. | ProgrammingGPUsinCUDA | | |
| | --- | --- | --- | --- | --- | --- | --- | --------- | --------------------- | | |
| CUDAProgrammingGuide,Release13.1 | |
| (continuedfrompreviouspage) | |
| | ∕∕ | if | so enqueue | a | non-blocking | copy into | stream2 | | |
| | --- | ----- | --------------------- | --- | ------------ | -------------- | ------- | | |
| | if | ( not | copyStarted | | ) { | | | | |
| | | if( | cudaEventQuery(event) | | | == cudaSuccess | ) { | | |
| cudaMemcpyAsync(h_data, d_data, size, cudaMemcpyDeviceToHost, | |
| ,→stream2); | |
| | | | copyStarted | | = true; | | | | |
| | --- | --- | ----------- | --- | ------- | --- | --- | | |
| } | |
| } | |
| } | |
| | ∕∕ wait | for | both streams | | to be | done | | | |
| | ------- | --- | ------------ | --- | ----- | ---- | --- | | |
| cudaStreamSynchronize(stream1); | |
| cudaStreamSynchronize(stream2); | |
| | ∕∕ destroy | | the event | | | | | | |
| | ---------- | --- | --------- | --- | --- | --- | --- | | |
| cudaEventDestroy(event); | |
| | ∕∕ destroy | | the streams | | and free | the data | | | |
| | ---------- | --- | ----------- | --- | -------- | -------- | --- | | |
| cudaStreamDestroy(stream1); | |
| cudaStreamDestroy(stream2); | |
| cudaFree(d_data); | |
| free(h_data); | |
| | 2.3.4. | Callback | | Functions | | from | Streams | | |
| | ------ | -------- | --- | --------- | --- | ---- | ------- | | |
| CUDAprovidesamechanismforlaunchingfunctionsonthehostfromwithinastream. Therearecur- | |
| rentlytwofunctionsavailableforthispurpose: cudaLaunchHostFunc()andcudaAddCallback(). | |
| However, cudaAddCallback() is slated for deprecation, so applications should use cudaLaunch- | |
| HostFunc(). | |
| UsingcudaLaunchHostFunc() | |
| ThesignatureofthecudaLaunchHostFunc()functionisasfollows: | |
| cudaError_t cudaLaunchHostFunc(cudaStream_t stream, void (*func)(void *), | |
| | ,→void | *data); | | | | | | | |
| | ------ | ------- | --- | --- | --- | --- | --- | | |
| where | |
| ▶ | |
| | stream: | | Thestreamtolaunchthecallbackfunctioninto. | | | | | | |
| | ------- | ---------------------------- | ----------------------------------------- | --- | --- | --- | --- | | |
| | ▶ func: | Thecallbackfunctiontolaunch. | | | | | | | |
| ▶ | |
| | data: | Apointertothedatatopasstothecallbackfunction. | | | | | | | |
| | ----- | --------------------------------------------- | --- | --- | --- | --- | --- | | |
| ThehostfunctionitselfisasimpleCfunctionwiththesignature: | |
| | void | hostFunction(void | | | *data); | | | | |
| | ---- | ----------------- | --- | --- | ------- | --- | --- | | |
| withthedataparameterpointingtoauserdefineddatastructurewhichthefunctioncaninterpret. | |
| Therearesomecaveatstokeepinmindwhenusingcallbackfunctionslikethis. Inparticular,thehost | |
| functionmaynotcallanyCUDAAPIs. | |
| Forthepurposesofbeingusedwithunifiedmemory,thefollowingexecutionguaranteesareprovided: | |
| - The stream is considered idle for the duration of the function’s execution. Thus, for example, the | |
| 2.3. AsynchronousExecution 63 | |
| CUDAProgrammingGuide,Release13.1 | |
| functionmayalwaysusememoryattachedtothestreamitwasenqueuedin. -Thestartofexecutionof | |
| thefunctionhasthesameeffectassynchronizinganeventrecordedinthesamestreamimmediately | |
| prior to the function. It thus synchronizes streams which have been “joined” prior to the function. | |
| - Adding device work to any stream does not have the effect of making the stream active until all | |
| preceding host functions and stream callbacks have executed. Thus, for example, a function might | |
| use global attached memory even if work has been added to another stream, if the work has been | |
| orderedbehindthefunctioncallwithanevent. -Completionofthefunctiondoesnotcauseastream | |
| tobecomeactiveexceptasdescribedabove. Thestreamwillremainidleifnodeviceworkfollowsthe | |
| function, and will remain idle across consecutive host functions or stream callbacks without device | |
| work in between. Thus, for example, stream synchronization can be done by signaling from a host | |
| functionattheendofthestream. | |
| 2.3.4.1 UsingcudaStreamAddCallback() | |
| Note | |
| ThecudaStreamAddCallback()functionisslatedfordeprecationandremovalandisdiscussed | |
| hereforcompletenessandbecauseitmaystillappearinexistingcode. Applicationsshoulduseor | |
| switchtousingcudaLaunchHostFunc(). | |
| ThesignatureofthecudaStreamAddCallback()functionisasfollows: | |
| cudaError_t cudaStreamAddCallback(cudaStream_t stream, cudaStreamCallback_t | |
| ,→callback, void* userData, unsigned int flags); | |
| where | |
| ▶ stream: Thestreamtolaunchthecallbackfunctioninto. | |
| ▶ callback: Thecallbackfunctiontolaunch. | |
| ▶ userData: Apointertothedatatopasstothecallbackfunction. | |
| ▶ flags: Currently,thisparametermustbe0forfuturecompatibility. | |
| The signature of the callback function is a little different from the case when we used the cud- | |
| aLaunchHostFunc()function. InthiscasethecallbackfunctionisaCfunctionwiththesignature: | |
| void callbackFunction(cudaStream_t stream, cudaError_t status, void | |
| ,→*userData); | |
| wherethefunctionisnowpassed | |
| ▶ stream: Thestreamhandlefromwhichthecallbackfunctionwaslaunched. | |
| ▶ status: Thestatusofthestreamoperationthattriggeredthecallback. | |
| ▶ userData: Apointertothedatathatwaspassedtothecallbackfunction. | |
| Inparticularthestatusparameterwillcontainthecurrenterrorstatusofthestream,whichmayhave | |
| beensetbypreviousoperations. SimilarlytothecudaLaunchHostFunc()funccase,thestreamwill | |
| notbeactiveandadvancetotasksuntilthehost-functionhascompleted,andnoCUDAfunctionsmay | |
| becalledfromwithinthecallbackfunction. | |
| 64 Chapter2. ProgrammingGPUsinCUDA | |
| CUDAProgrammingGuide,Release13.1 | |
| 2.3.4.2 AsynchronousErrorHandling | |
| Inacudastream,errorsmayoriginatefromanyoperationinthestream,includingforkernellaunches | |
| and memory transfers. These errors may not be propagated back to the user at run-time until the | |
| streamissynchronized,forexample,bywaitingforaneventorcallingcudaStreamSynchronize(). | |
| Therearetwowaystofindoutabouterrorswhichmayhaveoccurredinastream. | |
| ▶ Using the function - this function returns and clears the last error en- | |
| cudaGetLastError() | |
| counteredinanystreaminthecurrentcontext. AnimmediatesecondcalltocudaGetLastError() | |
| wouldreturncudaSuccessifnoothererroroccurredbetweenthetwocalls. | |
| ▶ | |
| UsingthefunctioncudaPeekAtLastError()-thisfunctionreturnsthelasterrorinthecurrent | |
| context,butdoesnotclearit. | |
| BothofthesefunctionsreturntheerrorasavalueoftypecudaError_t. Printablenamesnamesof | |
| theerrorscanbegeneratedusingthefunctionscudaGetErrorName()andcudaGetErrorString(). | |
| Anexampleofusingthesefunctionsisshownbelow: | |
| | | | Listing | | 1: Example | | of using | cudaGetLastError() | and cud- | | |
| | --- | --- | ------- | --- | ---------- | --- | -------- | ------------------ | -------- | | |
| aPeekAtLastError() | |
| | ∕∕ Some | work | occurs | in | streams. | | | | | | |
| | ------- | ---- | ------ | --- | -------- | --- | --- | --- | --- | | |
| cudaStreamSynchronize(stream); | |
| | ∕∕ Look | at the | last | error | but | do | not clear | it | | | |
| | ------------- | --------------- | ----- | ---------------------- | ----- | ------ | ------------------------- | --- | --- | | |
| | cudaError_t | | err = | cudaPeekAtLastError(); | | | | | | | |
| | if (err | != cudaSuccess) | | | { | | | | | | |
| | printf("Error | | | with | name: | %s\n", | cudaGetErrorName(err)); | | | | |
| | printf("Error | | | description: | | %s\n", | cudaGetErrorString(err)); | | | | |
| } | |
| | ∕∕ Look | at the | last | error | and | clear | it | | | | |
| | ------------- | ------ | ------------ | --------------------- | ----- | ------ | -------------------------- | --- | --- | | |
| | cudaError_t | | err2 | = cudaGetLastError(); | | | | | | | |
| | if (err2 | != | cudaSuccess) | | { | | | | | | |
| | printf("Error | | | with | name: | %s\n", | cudaGetErrorName(err2)); | | | | |
| | printf("Error | | | description: | | %s\n", | cudaGetErrorString(err2)); | | | | |
| } | |
| | if (err2 | != | err) | { | | | | | | | |
| | -------- | --- | ---- | --- | --- | --- | --- | --- | --- | | |
| printf("As expected, cudaPeekAtLastError() did not clear the error\n"); | |
| } | |
| | ∕∕ Check | again | | | | | | | | | |
| | ----------- | ----- | ------------ | --------------------- | --- | --- | --- | --- | --- | | |
| | cudaError_t | | err3 | = cudaGetLastError(); | | | | | | | |
| | if (err3 | == | cudaSuccess) | | { | | | | | | |
| printf("As expected, cudaGetLastError() cleared the error\n"); | |
| } | |
| Tip | |
| Whenanerrorappearsatasynchronization,especiallyinastreamwithmanyoperations,itisoften | |
| difficulttopinpointexactlywhereinthestreamtheerrormayhaveoccurred. Todebugsuchasit- | |
| uationausefultrickmaybetosettheenvironmentvariableCUDA_LAUNCH_BLOCKING=1andthen | |
| runtheapplication. Theeffectofthisenvironmentvariableistosynchronizeaftereverysingleker- | |
| 2.3. AsynchronousExecution 65 | |
| CUDAProgrammingGuide,Release13.1 | |
| nellaunch. Thiscanaidintrackingdownwhichkernel,ortransfercausedtheerror. Synchronization | |
| canbeexpensive;applicationsmayrunsubstantiallyslowerwhenthisenvironmentvariableisset. | |
| 2.3.5. CUDA Stream Ordering | |
| Nowthatwehavediscussedthebasicmechanismsofstreams,eventsandcallbackfunctionsitisim- | |
| portanttoconsidertheorderingsemanticsofasynchronousoperationsinastream. Thesesemantics | |
| aretoallowapplicationprogrammerstothinkabouttheorderingofoperationsinastreaminasafe | |
| way. Therearesomespecialcaseswherethesesemanticsmayberelaxedforpurposesofperformance | |
| optimization such as in the case of a ProgrammaticDependentKernelLaunch scenario, which allows | |
| the overlap of two kernels through the use of special attributes and kernel launch mechanisms, or | |
| in the case of batching memory transfers using the cudaMemcpyBatchAsync() function when the | |
| runtimecanperformnon-overlappingbatchcopiesconcurrently. Wewilldiscusstheseoptimizations | |
| lateronlinkneeded. | |
| MostimportantlyCUDAstreamsarewhatareknownasin-orderstreams. Thismeansthattheorder | |
| of execution of the operations in a stream is the same as the order in which those operations were | |
| enqueued. Anoperationinastreamcannotleap-frogotheroperations. Memoryoperations(suchas | |
| copies)aretrackedbytheruntimeandwillalwayscompletebeforethenextoperationinordertoallow | |
| dependentkernelssafeaccesstothedatabeingtransferred. | |
| 2.3.6. Blocking and non-blocking streams and the default | |
| stream | |
| In CUDA there are two types of streams: blocking and non-blocking. The name can be a little mis- | |
| leading as the blocking and non-blocking semantics refer only to how the streams synchronize with | |
| thedefaultstream. Bydefault,streamscreatedwithcudaStreamCreate()areblockingstreams. In | |
| ordertocreateanon-blockingstream,thecudaStreamCreateWithFlags()functionmustbeused | |
| withthecudaStreamNonBlockingflag: | |
| cudaStream_t stream; | |
| cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking); | |
| andnon-blockingstreamscanbedestroyedintheusualwaywithcudaStreamDestroy(). | |
| 2.3.6.1 LegacyDefaultStream | |
| Thekeydifferencebetweentheblockingandnon-blockingstreamsishowtheysynchronizewiththe | |
| defaultstream. CUDAprovidesalegacydefaultstream(alsoknownastheNULLstreamorthestream | |
| withstreamID0)whichisusedwhennostreamisspecifiedinkernellaunchesorinblockingcudaMem- | |
| cpy() calls. This default stream, which was shared amongst all host threads, is a blocking stream. | |
| When an operation is launched into this default stream, it will synchronize with all other blocking | |
| streams,inotherwordsitwillwaitforallotherblockingstreamstocompletebeforeitcanexecute. | |
| cudaStream_t stream1, stream2; | |
| cudaStreamCreate(&stream1); | |
| cudaStreamCreate(&stream2); | |
| kernel1<<<grid, block, 0, stream1>>>(...); | |
| kernel2<<<grid, block>>>(...); | |
| (continuesonnextpage) | |
| 66 Chapter2. ProgrammingGPUsinCUDA | |
| CUDAProgrammingGuide,Release13.1 | |
| (continuedfrompreviouspage) | |
| kernel3<<<grid, block, 0, stream2>>>(...); | |
| cudaDeviceSynchronize(); | |
| Thedefaultstreambehaviormeansthatintheabovecodesnippetabove,kernel2willwaitforkernel1 | |
| to complete, and kernel3 will wait for kernel2 to complete, even if in principle all three kernels could | |
| executeconcurrently. Bycreatinganon-blockingstreamwecanavoidthissynchronizationbehavior. | |
| Inthecodesnippetbelowwecreatetwonon-blockingstreams. Thedefaultstreamwillnolongersyn- | |
| chronizewiththesestreamsandinprincipleallthreekernelscouldexecuteconcurrently. Assuchwe | |
| cannotassumeanyorderingofexecutionofthekernelsandshouldperformexplicitsynchronization | |
| (suchaswiththeratherheavyhandedcudaDeviceSynchronize()call)inordertoensurethatthe | |
| kernelshavecompleted. | |
| cudaStream_t stream1, stream2; | |
| cudaStreamCreateWithFlags(&stream1, cudaStreamNonBlocking); | |
| cudaStreamCreateWithFlags(&stream2, cudaStreamNonBlocking); | |
| kernel1<<<grid, block, 0, stream1>>>(...); | |
| kernel2<<<grid, block>>>(...); | |
| kernel3<<<grid, block, 0, stream2>>>(...); | |
| cudaDeviceSynchronize(); | |
| 2.3.6.2 Per-threadDefaultStream | |
| Starting in CUDA-7, CUDA allows for each host thread to have its own independent default | |
| stream, rather than the shared legacy default stream. In order to enable this behavior one | |
| must either use the nvcc compiler option --default-stream per-thread or define the | |
| CUDA_API_PER_THREAD_DEFAULT_STREAMpreprocessormacro. Whenthisbehaviorisenabled,each | |
| hostthreadwillhaveitsownindependentdefaultstreamwhichwillnotsynchronizewithotherstreams | |
| inthesamewaythelegacydefaultstreamdoes. Insuchasituationthelegacydefaultstreamexample | |
| willnowexhibitthesamesynchronizationbehaviorasthenon-blockingstreamexample. | |
| 2.3.7. Explicit Synchronization | |
| Therearevariouswaystoexplicitlysynchronizestreamswitheachother. | |
| cudaDeviceSynchronize() waits until all preceding commands in all streams of all host threads | |
| havecompleted. | |
| cudaStreamSynchronize()takesastreamasaparameterandwaitsuntilallprecedingcommands | |
| in the given stream have completed. It can be used to synchronize the host with a specific stream, | |
| allowingotherstreamstocontinueexecutingonthedevice. | |
| cudaStreamWaitEvent()takes a stream and an event as parameters (see CUDA Events for a de- | |
| scription of events)and makes all the commands added to the given stream after the call to cud- | |
| aStreamWaitEvent()delaytheirexecutionuntilthegiveneventhascompleted. | |
| cudaStreamQuery()providesapplicationswithawaytoknowifallprecedingcommandsinastream | |
| havecompleted. | |
| 2.3. AsynchronousExecution 67 | |
| CUDAProgrammingGuide,Release13.1 | |
| | 2.3.8. | Implicit | Synchronization | | | | | | |
| | ------ | -------- | --------------- | --- | --- | --- | --- | | |
| Two operations from different streams cannot run concurrently if any CUDA operation on the NULL | |
| streamissubmittedin-betweenthem,unlessthestreamsarenon-blockingstreams(createdwiththe | |
| cudaStreamNonBlockingflag). | |
| Applicationsshouldfollowtheseguidelinestoimprovetheirpotentialforconcurrentkernelexecution: | |
| ▶ Allindependentoperationsshouldbeissuedbeforedependentoperations, | |
| ▶ Synchronizationofanykindshouldbedelayedaslongaspossible. | |
| | 2.3.9. | Miscellaneous | | and | Advanced | topics | | | |
| | ------ | ------------- | --- | --- | -------- | ------ | --- | | |
| 2.3.9.1 StreamPrioritization | |
| Asmentionedpreviously,developerscanassignprioritiestoCUDAstreams. Prioritizedstreamsneed | |
| tobecreatedusingthecudaStreamCreateWithPriority()function. Thefunctiontakestwopa- | |
| rameters: thestreamhandleandtheprioritylevel. Thegeneralschemeisthatlowernumberscorre- | |
| spondtohigherpriorities. Thegivenpriorityrangeforagivendeviceandcontextcanbequeriedusing | |
| thecudaDeviceGetStreamPriorityRange()function. Thedefaultpriorityofastreamis0. | |
| | int minPriority, | | maxPriority; | | | | | | |
| | ---------------- | ------------ | ------------ | ------- | ------ | --- | --- | | |
| | ∕∕ Query | the priority | range | for the | device | | | | |
| cudaDeviceGetStreamPriorityRange(&minPriority, &maxPriority); | |
| | ∕∕ Create | two streams | with | different | priorities | | | | |
| | --------- | ----------- | ---- | --------- | ---------- | --- | --- | | |
| ∕∕ cudaStreamDefault indicates the stream should be created with default flags | |
| ∕∕ in other words they will be blocking streams with respect to the legacy | |
| | ,→default | stream | | | | | | | |
| | --------- | ------ | --- | --- | --- | --- | --- | | |
| ∕∕ One could also use the option `cudaStreamNonBlocking` here to create a non- | |
| | ,→blocking | streams | | | | | | | |
| | ------------ | -------- | -------- | --- | --- | --- | --- | | |
| | cudaStream_t | stream1, | stream2; | | | | | | |
| cudaStreamCreateWithPriority(&stream1, cudaStreamDefault, minPriority); ∕∕ | |
| | ,→Lowest | priority | | | | | | | |
| | -------- | -------- | --- | --- | --- | --- | --- | | |
| cudaStreamCreateWithPriority(&stream2, cudaStreamDefault, maxPriority); ∕∕ | |
| | ,→Highest | priority | | | | | | | |
| | --------- | -------- | --- | --- | --- | --- | --- | | |
| Weshouldnotethatapriorityofastreamisonlyahinttotheruntimeandgenerallyappliesprimarily | |
| tokernellaunches,andmaynotberespectedformemorytransfers. Streamprioritieswillnotpreempt | |
| alreadyexecutingwork,orguaranteeanyspecificexecutionorder. | |
| 2.3.9.2 IntroductiontoCUDAGraphswithStreamCapture | |
| CUDAstreamsallowprogramstospecifyasequenceofoperations,kernelsormemorycopies,inorder. | |
| Usingmultiplestreamsandcross-streamdependencieswithcudaStreamWaitEvent,anapplication | |
| canspecifyafulldirectedacyclicgraph(DAG)ofoperations. Someapplicationsmayhaveasequence | |
| orDAGofoperationsthatneedstoberunmanytimesthroughoutexecution. | |
| For this situation, CUDA provides a feature known as CUDA graphs. This section introduces CUDA | |
| graphs and one mechanism of creating them called stream capture. A more detailed discussion of | |
| CUDAgraphsispresentedinCUDAGraphs. Capturingorcreatingagraphcanhelpreducelatencyand | |
| CPU overhead of repeatedly invoking the same chain of API calls from the host thread. Instead, the | |
| | 68 | | | | | Chapter2. | ProgrammingGPUsinCUDA | | |
| | --- | --- | --- | --- | --- | --------- | --------------------- | | |
| CUDAProgrammingGuide,Release13.1 | |
| APIstospecifythegraphoperationscanbecalledonce,andthentheresultinggraphexecutedmany | |
| times. | |
| CUDAGraphsworkinthefollowingway: | |
| i) Thegraphiscapturedbytheapplication. Thisstepisdoneoncethefirsttimethegraphisexe- | |
| cuted. ThegraphcanalsobemanuallycomposedusingtheCUDAgraphAPI. | |
| ii) Thegraphisinstantiated. Thisstepisdoneonetime,afterthegraphiscaptured. Thisstepcan | |
| setupallthevariousruntimestructuresneededtoexecutethegraph,inordertomakelaunching | |
| itscomponentsasfastaspossible. | |
| iii) Intheremainingsteps,thepre-instantiatedgraphisexecutedasmanytimesasrequired. Since | |
| alltheruntimestructuresneededtoexecutethegraphoperationsarealreadyinplace,theCPU | |
| overheadsofthegraphexecutionareminimized. | |
| | | | Listing2: | Thestagesofcapturing,instantiatingandexecuting | | | | | | | | | |
| | --- | --- | --------- | ---------------------------------------------- | ----- | ----- | ----------- | ----- | ---- | ------ | --- | | |
| | | | a simple | linear | graph | using | CUDA Graphs | (from | CUDA | Devel- | | | |
| operTechnicalBlog,A.Gray,2019) | |
| #define N 500000 ∕∕ tuned such that kernel takes a few microseconds | |
| | ∕∕ A | very | lightweight | kernel | | | | | | | | | |
| | ---------- | -------------------------------------- | -------------------------- | ------ | --- | -------- | ----- | --- | ------ | --- | --- | | |
| | __global__ | | void shortKernel(float | | | * out_d, | float | * | in_d){ | | | | |
| | int | idx=blockIdx.x*blockDim.x+threadIdx.x; | | | | | | | | | | | |
| | if(idx<N) | | out_d[idx]=1.23*in_d[idx]; | | | | | | | | | | |
| } | |
| bool graphCreated=false; | |
| | cudaGraph_t | | graph; | | | | | | | | | | |
| | --------------- | -------- | ------------ | -------- | --------- | ----- | --- | --- | --- | --- | --- | | |
| | cudaGraphExec_t | | instance; | | | | | | | | | | |
| | ∕∕ The | graph | will be | executed | NSTEP | times | | | | | | | |
| | for(int | istep=0; | istep<NSTEP; | | istep++){ | | | | | | | | |
| if(!graphCreated){ | |
| | | ∕∕ | Capture | the graph | | | | | | | | | |
| | --- | ------------------------------ | --------------------- | --------- | -------------- | -------- | ----------------------------- | ---------------- | --- | --- | ------ | | |
| | | cudaStreamBeginCapture(stream, | | | | | cudaStreamCaptureModeGlobal); | | | | | | |
| | | ∕∕ | Launch NKERNEL | | kernels | | | | | | | | |
| | | for(int | ikrnl=0; | | ikrnl<NKERNEL; | | ikrnl++){ | | | | | | |
| | | | shortKernel<<<blocks, | | | threads, | 0, | stream>>>(out_d, | | | in_d); | | |
| } | |
| | | ∕∕ | End the | capture | | | | | | | | | |
| | --- | ------------------------------- | ----------- | ------- | ----- | --- | -------- | ----- | ----- | --- | --- | | |
| | | cudaStreamEndCapture(stream, | | | | | &graph); | | | | | | |
| | | ∕∕ | Instantiate | the | graph | | | | | | | | |
| | | cudaGraphInstantiate(&instance, | | | | | graph, | NULL, | NULL, | 0); | | | |
| graphCreated=true; | |
| } | |
| | ∕∕ | Launch | the graph | | | | | | | | | | |
| | ------------------------- | ----------- | --------- | ---------- | -------- | --- | --- | --- | --- | --- | --- | | |
| | cudaGraphLaunch(instance, | | | | stream); | | | | | | | | |
| | ∕∕ | Synchronize | | the stream | | | | | | | | | |
| (continuesonnextpage) | |
| 2.3. AsynchronousExecution 69 | |
| CUDAProgrammingGuide,Release13.1 | |
| (continuedfrompreviouspage) | |
| cudaStreamSynchronize(stream); | |
| } | |
| MuchmoredetailonCUDAgraphisprovidedinCUDAGraphs. | |
| 2.3.10. Summary of Asynchronous Execution | |
| Thekeypointsofthissectionare: | |
| ▶ Asynchronous APIs allow us to express concurrent execution of tasks providing the way to ex- | |
| pressoverlappingofvariousoperations. Theactualconcurrencyachievedisdependentonavail- | |
| ablehardwareresourcesandcompute-capabilities. | |
| ▶ ThekeyabstractionsinCUDAforasynchronousexecutionarestreams,eventsandcallbackfunc- | |
| tions. | |
| ▶ Synchronizationispossibleattheevent,streamanddevicelevel | |
| ▶ Thedefaultstreamisablockingstreamwhichsynchronizeswithallotherblockingstreams,but | |
| doesnotsynchronizewithnon-blockingstreams | |
| ▶ The default stream behavior can be avoided using per-thread default | |
| streams via the --default-stream per-thread compiler option or the | |
| CUDA_API_PER_THREAD_DEFAULT_STREAMpreprocessormacro. | |
| ▶ Streamscanbecreatedwithdifferentpriorities,whicharehintstotheruntimeandmaynotbe | |
| respectedformemorytransfers. | |
| ▶ CUDA provides API functions to reduce, or overlap overheads of kernel launches and memory | |
| transferssuchasCUDAGraphs,BatchedMemoryTransfersandProgrammaticDependentKernel | |
| Launch. | |
| 2.4. Unified and System Memory | |
| Heterogeneoussystemshavemultiplephysicalmemorieswheredatacanbestored. ThehostCPUhas | |
| attached DRAM, and every GPU in a system has its own attached DRAM. Performance is best when | |
| dataisresidentinthememoryoftheprocessoraccessingit. CUDAprovidesAPIstoexplicitlymanage | |
| memoryplacement,butthiscanbeverboseandcomplicatesoftwaredesign. CUDAprovidesfeatures | |
| andcapabilitiesaimedateasingallocation,placement,andmigrationofdatabetweendifferentphys- | |
| icalmemories. | |
| The purpose of this chapter is to introduce and explain these features and what they mean to ap- | |
| plication developers for both functionality and performance. Unified memory has several different | |
| manifestationswhichdependupontheOS,driverversion,andGPUused. Thischapterwillshowhow | |
| todeterminewhichunifiedmemoryparadigmappliesandhowthefeaturesofunifiedmemorybehave | |
| ineach. Thelaterchapteronunifiedmemoryexplainsunifiedmemoryinmoredetail. | |
| Thefollowingconceptswillbedefinedandexplainedinthischapter: | |
| ▶ UnifiedVirtualAddressSpace-CPUmemoryandeachGPU’smemoryhaveadistinctrangewithin | |
| asinglevirtualaddressspace | |
| ▶ Unified Memory - A CUDA feature that enables managed memory which can be automatically | |
| migratedbetweenCPUandGPUs | |
| ▶ LimitedUnifiedMemory-Aunifiedmemoryparadigmwithsomelimitations | |
| 70 Chapter2. ProgrammingGPUsinCUDA | |
| CUDAProgrammingGuide,Release13.1 | |
| ▶ FullUnifiedMemory-Fullsupportforunifiedmemoryfeatures | |
| ▶ FullUnifiedMemorywithHardwareCoherency -Fullsupportforunifiedmemoryusinghard- | |
| warecapabilities | |
| ▶ Unifiedmemoryhints-APIstoguideunifiedmemorybehaviorforspecificallocations | |
| ▶ Page-lockedHostMemory - Non-pageable system memory, which is necessary for some CUDA | |
| operations | |
| ▶ Mappedmemory-Amechanism(differentfromunifiedmemory)foraccessinghostmemory | |
| directlyfromakernel | |
| Additionally, the following terms used when discussing unified and system memory are introduced | |
| here: | |
| ▶ HeterogeneousManagedMemory (HMM) - A feature of the Linux kernel that enables software | |
| coherencyforfullunifiedmemory | |
| ▶ AddressTranslationServices (ATS) - A hardware feature, available when GPUs are connected to | |
| theCPUbytheNVLinkChip-to-Chip(C2C)interconnect,whichprovideshardwarecoherencyfor | |
| fullunifiedmemory | |
| 2.4.1. Unified Virtual Address Space | |
| A single virtual address space is used for all host memory and all global memory on all GPUs in the | |
| system within a single OS process. All memory allocations on the host and on all devices lie in this | |
| virtual address space. This is true whether allocations are made with CUDA APIs (e.g. cudaMalloc, | |
| cudaMallocHost) or with system allocation APIs (e.g. new, malloc, mmap). The CPU and each GPU | |
| hasauniquerangewithintheunifiedvirtualaddressspace. | |
| Thismeans: | |
| ▶ The location of any memory (that is, CPU or which GPU’s memory it lies in) can be determined | |
| fromthevalueofapointerusingcudaPointerGetAttributes() | |
| ▶ ThecudaMemcpyKindparameterofcudaMemcpy*()canbesettocudaMemcpyDefaulttoau- | |
| tomaticallydeterminethecopytypefromthepointers | |
| 2.4.2. Unified Memory | |
| Unifiedmemory isaCUDAmemoryfeaturewhichallowsmemoryallocationscalledmanagedmemory | |
| to be accessed from code running on either the CPU or the GPU. Unified memory was shown in the | |
| introtoCUDAinC++. UnifiedmemoryisavailableonallsystemssupportedbyCUDA. | |
| Onsomesystems,managedmemorymustbeexplicitlyallocated. Managedmemorycanbeexplicitly | |
| allocatedinCUDAinafewdifferentways: | |
| ▶ TheCUDAAPIcudaMallocManaged | |
| ▶ The CUDA API cudaMallocFromPoolAsync with a pool created with allocType set to cud- | |
| aMemAllocationTypeManaged | |
| ▶ Globalvariableswiththe__managed__specifier(seeMemorySpaceSpecifiers) | |
| OnsystemswithHMMorATS,allsystemmemoryisimplicitlymanagedmemory,regardlessofhowit | |
| isallocated. Nospecialallocationisneeded. | |
| 2.4. UnifiedandSystemMemory 71 | |
| CUDAProgrammingGuide,Release13.1 | |
| 2.4.2.1 UnifiedMemoryParadigms | |
| The features and behavior of unified memory vary between operating systems, kernel versions on | |
| Linux, GPU hardware, and the GPU-CPU interconnect. The form of unified memory available can be | |
| determinedbyusingcudaDeviceGetAttributetoqueryafewattributes: | |
| ▶ cudaDevAttrConcurrentManagedAccess-1forfullunifiedmemorysupport,0forlimitedsup- | |
| port | |
| ▶ cudaDevAttrPageableMemoryAccess-1meansallsystemmemoryisfully-supportedunified | |
| memory,0meansonlymemoryexplicitlyallocatedasmanagedmemoryisfully-supportedunified | |
| memory | |
| ▶ cudaDevAttrPageableMemoryAccessUsesHostPageTables - Indicates the mechanism of | |
| CPU/GPUcoherence: 1ishardware,0issoftware. | |
| Figure18illustrateshowtodeterminetheunifiedmemoryparadigmvisuallyandisfollowedbyacode | |
| sampleimplementingthesamelogic. | |
| Therearefourparadigmsofunifiedmemoryoperation: | |
| ▶ Fullsupportforexplicitmanagedmemoryallocations | |
| ▶ Fullsupportforallallocationswithsoftwarecoherence | |
| ▶ Fullsupportforallallocationswithhardwarecoherence | |
| ▶ Limitedunifiedmemorysupport | |
| Whenfullsupportisavailable,itcaneitherrequireexplicitallocations,orallsystemmemorymayim- | |
| plicitlybeunifiedmemory. Whenallmemoryisimplicitlyunified,thecoherencemechanismcaneither | |
| besoftwareorhardware. WindowsandsomeTegradeviceshavelimitedsupportforunifiedmemory. | |
| Figure 18: All current GPUs use a unified virtual address space and have unified memory available. | |
| When cudaDevAttrConcurrentManagedAccess is 1, full unified memory support is available, oth- | |
| erwiseonlylimitedsupportisavailable. Whenfullsupportisavailable,ifcudaDevAttrPageableMem- | |
| oryAccess is also 1, then all system memory is unified memory. Otherwise, only memory allocated | |
| with CUDA APIs (such as cudaMallocManaged) is unified memory. When all system memory is uni- | |
| fied, cudaDevAttrPageableMemoryAccessUsesHostPageTables indicates whether coherence is | |
| providedbyhardware(whenvalueis1)orsoftware(whenvalueis0). | |
| Table3showsthesameinformationasFigure18asatablewithlinkstotherelevantsectionsofthis | |
| chapterandmorecompletedocumentationinalatersectionofthisguide. | |
| 72 Chapter2. ProgrammingGPUsinCUDA | |
| CUDAProgrammingGuide,Release13.1 | |
| Table3: OverviewofUnifiedMemoryParadigms | |
| UnifiedMemoryParadigm DeviceAttributes FullDocumentation | |
| Limitedunifiedmemorysupport | |
| UnifiedMemoryonWindows, | |
| cudaDevAttrConcurrentManagedAccess | |
| WSL,andTegra | |
| is0 | |
| CUDAforTegraMemory | |
| Management | |
| UnifiedmemoryonTegra | |
| Fullsupportforexplicitmanaged | |
| memoryallocations | |
| UnifiedMemoryonDeviceswith | |
| cudaDevAttrPageableMemoryAccess | |
| onlyCUDAManagedMemory | |
| is0 | |
| Support | |
| andcudaDevAttrConcur- | |
| rentManagedAccessis1 | |
| Full support for all allocations | |
| withsoftwarecoherence | |
| UnifiedMemoryonDeviceswith | |
| cudaDevAttrPageableMemoryAccessUsesHostPageTables | |
| FullCUDAUnifiedMemory | |
| is0 | |
| Support | |
| andcudaDevAttrPageable- | |
| MemoryAccessis1 | |
| andcudaDevAttrConcur- | |
| rentManagedAccessis1 | |
| Full support for all allocations | |
| withhardwarecoherence | |
| UnifiedMemoryonDeviceswith | |
| cudaDevAttrPageableMemoryAccessUsesHostPageTables | |
| FullCUDAUnifiedMemory | |
| is1 | |
| Support | |
| andcudaDevAttrPageable- | |
| MemoryAccessis1 | |
| andcudaDevAttrConcur- | |
| rentManagedAccessis1 | |
| 2.4.2.1.1 UnifiedMemoryParadigm: CodeExample | |
| Thefollowingcodeexampledemonstratesqueryingthedeviceattributesanddeterminingtheunified | |
| memoryparadigm,followingthelogicofFigure18,foreachGPUinasystem. | |
| void queryDevices() | |
| { | |
| int numDevices = 0; | |
| cudaGetDeviceCount(&numDevices); | |
| for(int i=0; i<numDevices; i++) | |
| { | |
| cudaSetDevice(i); | |
| (continuesonnextpage) | |
| 2.4. UnifiedandSystemMemory 73 | |
| CUDAProgrammingGuide,Release13.1 | |
| (continuedfrompreviouspage) | |
| | | cudaInitDevice(0, | | 0, | 0); | | | | | |
| | ------------------------------------- | ------------------------------------------ | --- | ---- | --------------------------- | --- | --- | --- | | |
| | | int deviceId | | = i; | | | | | | |
| | | int concurrentManagedAccess | | | = -1; | | | | | |
| | | cudaDeviceGetAttribute | | | (&concurrentManagedAccess, | | | | | |
| | ,→cudaDevAttrConcurrentManagedAccess, | | | | deviceId); | | | | | |
| | | int pageableMemoryAccess | | | = -1; | | | | | |
| | | cudaDeviceGetAttribute | | | (&pageableMemoryAccess, | | | | | |
| | ,→cudaDevAttrPageableMemoryAccess, | | | | deviceId); | | | | | |
| | | int pageableMemoryAccessUsesHostPageTables | | | | = | -1; | | | |
| cudaDeviceGetAttribute (&pageableMemoryAccessUsesHostPageTables, | |
| ,→cudaDevAttrPageableMemoryAccessUsesHostPageTables, deviceId); | |
| | | printf("Device | | %d has | ", deviceId); | | | | | |
| | --- | -------------- | --- | ------ | ------------- | --- | --- | --- | | |
| if(concurrentManagedAccess){ | |
| if(pageableMemoryAccess){ | |
| | | | printf("full | | unified memory | support"); | | | | |
| | --- | --- | ------------ | --- | -------------- | ---------- | --- | --- | | |
| if( pageableMemoryAccessUsesHostPageTables) | |
| | | | | { printf(" | with hardware | coherency\n"); | } | | | |
| | --- | --- | --- | ---------- | ------------- | -------------- | --- | --- | | |
| else | |
| | | | | { printf(" | with software | coherency\n"); | } | | | |
| | --- | --- | --- | ---------- | ------------- | -------------- | --- | --- | | |
| } | |
| else | |
| | | | { printf("full | | unified memory | support | for CUDA-made | managed | | |
| | ------------------ | --- | -------------- | --- | -------------- | ------- | ------------- | -------- | | |
| | ,→allocations\n"); | | } | | | | | | | |
| } | |
| else | |
| { printf("limited unified memory support: Windows, WSL, or Tegra\n | |
| | ,→"); | } | | | | | | | | |
| | ----- | --- | --- | --- | --- | --- | --- | --- | | |
| } | |
| } | |
| 2.4.2.2 FullUnifiedMemoryFeatureSupport | |
| Most Linux systems have full unified memory support. If device attribute cudaDevAttrPageable- | |
| MemoryAccessis1,thenallsystemmemory,whetherallocatedbyCUDAAPIsorsystemAPIs,oper- | |
| atesasunifiedmemorywithfullfeaturesupport. Thisincludesfile-backedmemoryallocationscreated | |
| withmmap. | |
| If cudaDevAttrPageableMemoryAccess is 0, then only memory allocated as managed memory by | |
| CUDA behaves as unified memory. Memory allocated with system APIs is not managed and is not | |
| necessarilyaccessiblefromGPUkernels. | |
| Ingeneral,forunifiedallocationswithfullsupport: | |
| ▶ Managed memory is usually allocated in the memory space of the processor where it is first | |
| touched | |
| ▶ Managed memory is usually migrated when it is used by a processor other than the processor | |
| whereitcurrentlyresides | |
| ▶ Managedmemoryismigratedoraccessedatthegranularityofmemorypages(softwarecoher- | |
| ence)orcachelines(hardwarecoherence) | |
| | 74 | | | | | Chapter2. | ProgrammingGPUsinCUDA | | | |
| | --- | --- | --- | --- | --- | --------- | --------------------- | --- | | |
| CUDAProgrammingGuide,Release13.1 | |
| ▶ Oversubscription is allowed: an application may allocate more managed memory than is physi- | |
| callyavailableontheGPU | |
| Allocationandmigrationbehaviorcandeviatefromtheabove. Thiscanbyinfluencedtheprogrammer | |
| usinghintsandprefetches. FullcoverageoffullunifiedmemorysupportcanbefoundinUnifiedMemory | |
| onDeviceswithFullCUDAUnifiedMemorySupport. | |
| 2.4.2.2.1 FullUnifiedMemorywithHardwareCoherency | |
| OnhardwaresuchasGraceHopperandGraceBlackwell,whereanNVIDIACPUisusedandtheinter- | |
| connectbetweentheCPUandGPUisNVLinkChip-to-Chip(C2C),addresstranslationservices(ATS)are | |
| available. cudaDevAttrPageableMemoryAccessUsesHostPageTablesis1whenATSisavailable. | |
| WithATS,inadditiontofullunifiedmemorysupportforallhostallocations: | |
| ▶ GPU allocations (e.g. cudaMalloc) can be accessed from the CPU | |
| (cudaDevAttrDirectManagedMemAccessFromHostwillbe1) | |
| ▶ ThelinkbetweenCPUandGPUsupportsnativeatomics(cudaDevAttrHostNativeAtomicSupported | |
| willbe1) | |
| ▶ Hardwaresupportforcoherencecanimproveperformancecomparedtosoftwarecoherence | |
| ATSprovidesallcapabilitiesofHMM.WhenATSisavailable,HMMisautomaticallydisabled. Furtherdis- | |
| cussionofhardwarevs. softwarecoherencyisfoundinCPUandGPUPageTables:HardwareCoherency | |
| vs. SoftwareCoherency. | |
| 2.4.2.2.2 HMM-FullUnifiedMemorywithSoftwareCoherency | |
| HeterogeneousMemoryManagement (HMM) is a feature available on Linux operating systems (with | |
| appropriate kernel versions) which enables software-coherent full unified memory support. Hetero- | |
| geneous memory management brings some of the capabilities and convenience provided by ATS to | |
| PCIe-connectedGPUs. | |
| OnLinuxwithatleastLinuxKernel6.1.24,6.2.11,or6.3orlater,heterogeneousmemorymanagement | |
| (HMM)maybeavailable. ThefollowingcommandcanbeusedtofindiftheaddressingmodeisHMM. | |
| $ nvidia-smi -q | grep Addressing | |
| Addressing Mode : HMM | |
| WhenHMMisavailable,fullunifiedmemory issupportedandallsystemallocationsareimplicitlyuni- | |
| fied memory. If a system also has ATS, HMM is disabled and ATS is used, since ATS provides all the | |
| capabilitiesofHMMandmore. | |
| 2.4.2.3 LimitedUnifiedMemorySupport | |
| On Windows, including Windows Subsystem for Linux (WSL), and on some Tegra systems, a limited | |
| subsetofunifiedmemoryfunctionalityisavailable. Onthesesystems,managedmemoryisavailable, | |
| butmigrationbetweenCPUandGPUsbehavesdifferently. | |
| ▶ ManagedmemoryisfirstallocatedintheCPU’sphysicalmemory | |
| ▶ Managedmemoryismigratedinlargergranularitythanvirtualmemorypages | |
| ▶ ManagedmemoryismigratedtotheGPUwhentheGPUbeginsexecuting | |
| ▶ TheCPUmustnotaccessmanagedmemorywhiletheGPUisactive | |
| ▶ ManagedmemoryismigratedbacktotheCPUwhentheGPUissynchronized | |
| 2.4. UnifiedandSystemMemory 75 | |
| CUDAProgrammingGuide,Release13.1 | |
| ▶ OversubscriptionofGPUmemoryisnotallowed | |
| ▶ OnlymemoryexplicitlyallocatedbyCUDAasmanagedmemoryisunified | |
| FullcoverageofthisparadigmcanbefoundinUnifiedMemoryonWindows,WSL,andTegra. | |
| 2.4.2.4 MemoryAdviseandPrefetch | |
| TheprogrammercanprovidehintstotheNVIDIADrivermanagingunifiedmemorytohelpitmaximize | |
| applicationperformance. TheCUDAAPIcudaMemAdviseallowstheprogrammertospecifyproperties | |
| of allocations that affect where they are placed and whether or not the memory is migrated when | |
| accessedfromanotherdevice. | |
| cudaMemPrefetchAsyncallowstheprogrammertosuggestanasynchronousmigrationofaspecific | |
| allocationtoadifferentlocationbestarted. Acommonuseisstartingthetransferofdataakernelwill | |
| usebeforethekernelislaunched. ThisenablesthecopyofdatatooccurwhileotherGPUkernelsare | |
| executing. | |
| The section on PerformanceHints covers the different hints that can be passed to cudaMemAdvise | |
| andshowsexamplesofusingcudaMemPrefetchAsync. | |
| 2.4.3. Page-Locked Host Memory | |
| Inintroductorycodeexamples,cudaMallocHostwasusedtoallocatememoryontheCPU.Thisallo- | |
| catespage-lockedmemory(alsoknownaspinnedmemory)onthehost. Hostallocationsmadethrough | |
| traditional allocation mechanisms like malloc, new, or mmap are not page-locked, which means they | |
| maybeswappedtodiskorphysicallyrelocatedbytheoperatingsystem. | |
| Page-lockedhostmemoryisrequiredforasynchronouscopiesbetweentheCPUandGPU.Page-locked | |
| hostmemoryalsoimprovesperformanceofsynchronouscopies. Page-lockedmemorycanbemapped | |
| totheGPUfordirectaccessfromGPUkernels. | |
| TheCUDAruntimeprovidesAPIstoallocatepage-lockedhostmemoryortopage-lockexistingallo- | |
| cations: | |
| ▶ cudaMallocHostallocatespage-lockedhostmemory | |
| ▶ cudaHostAlloc defaults to the same behavior as cudaMallocHost, but also takes flags to | |
| specifyothermemoryparameters | |
| ▶ cudaFreeHostfreesmemoryallocatedwithcudaMallocHostorcudaHostAlloc | |
| ▶ cudaHostRegister page-locks a range of existing memory allocated outside the CUDA API, | |
| suchaswithmallocormmap | |
| cudaHostRegisterenableshostmemoryallocatedby3rdpartylibrariesorothercodeoutsideofa | |
| developer’scontroltobepage-lockedsothatitcanbeusedinasynchronouscopiesormapped. | |
| Note | |
| Page-lockedhostmemorycanbeusedforasynchronouscopiesandmapped-memorybyallGPUs | |
| inthesystem. | |
| Page-lockedhostmemoryisnotcachedonnonI/OcoherentTegradevices. Also,cudaHostReg- | |
| ister()isnotsupportedonnonI/OcoherentTegradevices. | |
| 76 Chapter2. ProgrammingGPUsinCUDA | |
| CUDAProgrammingGuide,Release13.1 | |
| 2.4.3.1 MappedMemory | |
| On systems with HMM or ATS, all host memory is directly accessible from the GPU using the host | |
| pointers. WhenATSorHMMarenotavailable,hostallocationscanbemadeaccessibletotheGPUby | |
| mappingthememoryintotheGPU’smemoryspace. Mappedmemoryisalwayspage-locked. | |
| The code examples which follow will illustrate the following array copy kernel operating directly on | |
| mappedhostmemory. | |
| | __global__ | void | copyKernel(float* | | | a, float* | b) | | |
| | ---------- | ---- | ----------------- | --- | --- | --------- | --- | | |
| { | |
| | | int idx | = threadIdx.x | | + blockDim.x | | * blockIdx.x; | | |
| | --- | ------- | ------------- | --- | ------------ | --- | ------------- | | |
| | | a[idx] | = b[idx]; | | | | | | |
| } | |
| While mapped memory may be useful in some cases where certain data which is not copied to the | |
| GPUneedstobeaccessedfromakernel,accessingmappedmemoryinakernelrequirestransactions | |
| across the CPU-GPU interconnect, PCIe, or NVLink C2C. These operations have higher latency and | |
| lowerbandwidthcomparedtoaccessingdevicememory. Mappedmemoryshouldnotbeconsidereda | |
| performantalternativetounifiedmemoryorexplicitmemorymanagementforthemajorityofakernel’s | |
| memoryneeds. | |
| | 2.4.3.1.1 | cudaMallocHostandcudaHostAlloc | | | | | | | |
| | --------- | ------------------------------ | --- | --- | --- | --- | --- | | |
| Host memory allocated with cudaHostMalloc or cudaHostAlloc is automatically mapped. The | |
| pointers returned by these APIs can be directly used in kernel code to access the memory on the | |
| host. ThehostmemoryisaccessedovertheCPU-GPUinterconnect. | |
| cudaMallocHost | |
| | void | usingMallocHost() | | { | | | | | |
| | ----------------------------- | ----------------- | -------------------- | ------------- | --------------------- | --- | --- | | |
| | float* | a = | nullptr; | | | | | | |
| | float* | b = | nullptr; | | | | | | |
| | CUDA_CHECK(cudaMallocHost(&a, | | | | vLen*sizeof(float))); | | | | |
| | CUDA_CHECK(cudaMallocHost(&b, | | | | vLen*sizeof(float))); | | | | |
| | initVector(b, | | vLen); | | | | | | |
| | memset(a, | 0, | vLen*sizeof(float)); | | | | | | |
| | int | threads | = 256; | | | | | | |
| | int | blocks | = vLen∕threads; | | | | | | |
| | copyKernel<<<blocks, | | | threads>>>(a, | | b); | | | |
| CUDA_CHECK(cudaGetLastError()); | |
| CUDA_CHECK(cudaDeviceSynchronize()); | |
| | printf("Using | | cudaMallocHost: | | "); | | | | |
| | ------------- | --- | --------------- | --- | --- | --- | --- | | |
| checkAnswer(a,b); | |
| } | |
| 2.4. UnifiedandSystemMemory 77 | |
| CUDAProgrammingGuide,Release13.1 | |
| cudaAllocHost | |
| | void usingCudaHostAlloc() | { | | | | | |
| | ------------------------- | -------- | --- | --- | --- | | |
| | float* a = | nullptr; | | | | | |
| | float* b = | nullptr; | | | | | |
| CUDA_CHECK(cudaHostAlloc(&a, vLen*sizeof(float), cudaHostAllocMapped)); | |
| CUDA_CHECK(cudaHostAlloc(&b, vLen*sizeof(float), cudaHostAllocMapped)); | |
| | initVector(b, | vLen); | | | | | |
| | -------------------- | -------------------- | --- | --- | --- | | |
| | memset(a, 0, | vLen*sizeof(float)); | | | | | |
| | int threads | = 256; | | | | | |
| | int blocks | = vLen∕threads; | | | | | |
| | copyKernel<<<blocks, | threads>>>(a, | | b); | | | |
| CUDA_CHECK(cudaGetLastError()); | |
| CUDA_CHECK(cudaDeviceSynchronize()); | |
| | printf("Using | cudaAllocHost: | "); | | | | |
| | -------------- | -------------- | --- | --- | --- | | |
| | checkAnswer(a, | b); | | | | | |
| } | |
| 2.4.3.1.2 cudaHostRegister | |
| WhenATSandHMMarenotavailable,allocationsmadebysystemallocatorscanstillbemappedfor | |
| accessdirectlyfromGPUkernelsusingcudaHostRegister. UnlikememorycreatedwithCUDAAPIs, | |
| however, the memory cannot be accessed from the kernel using the host pointer. A pointer in the | |
| device’s memory region must be obtained using cudaHostGetDevicePointer(), and that pointer | |
| mustbeusedforaccessesinkernelcode. | |
| | void usingRegister() | { | | | | | |
| | -------------------- | ---------- | --- | --- | --- | | |
| | float* a = | nullptr; | | | | | |
| | float* b = | nullptr; | | | | | |
| | float* devA | = nullptr; | | | | | |
| | float* devB | = nullptr; | | | | | |
| a = (float*)malloc(vLen*sizeof(float)); | |
| b = (float*)malloc(vLen*sizeof(float)); | |
| | CUDA_CHECK(cudaHostRegister(a, | | vLen*sizeof(float), | | 0 )); | | |
| | ------------------------------ | --- | ------------------- | --- | ----- | | |
| | CUDA_CHECK(cudaHostRegister(b, | | vLen*sizeof(float), | | 0 )); | | |
| CUDA_CHECK(cudaHostGetDevicePointer((void**)&devA, (void*)a, 0)); | |
| CUDA_CHECK(cudaHostGetDevicePointer((void**)&devB, (void*)b, 0)); | |
| | initVector(b, | vLen); | | | | | |
| | -------------------- | -------------------- | --- | ------ | --- | | |
| | memset(a, 0, | vLen*sizeof(float)); | | | | | |
| | int threads | = 256; | | | | | |
| | int blocks | = vLen∕threads; | | | | | |
| | copyKernel<<<blocks, | threads>>>(devA, | | devB); | | | |
| CUDA_CHECK(cudaGetLastError()); | |
| CUDA_CHECK(cudaDeviceSynchronize()); | |
| (continuesonnextpage) | |
| | 78 | | | Chapter2. | ProgrammingGPUsinCUDA | | |
| | --- | --- | --- | --------- | --------------------- | | |
| CUDAProgrammingGuide,Release13.1 | |
| (continuedfrompreviouspage) | |
| printf("Using cudaHostRegister: "); | |
| checkAnswer(a, b); | |
| } | |
| 2.4.3.1.3 ComparingUnifiedMemoryandMappedMemory | |
| MappedmemorymakesCPUmemoryaccessiblefromtheGPU,butdoesnotguaranteethatalltypes | |
| of access, for example atomics, are supported on all systems. Unified memory guarantees that all | |
| accesstypesaresupported. | |
| MappedmemoryremainsinCPUmemory, whichmeansallGPUaccessesmustgothroughthecon- | |
| nectionbetweentheCPUandGPU:PCIeorNVLink. Latencyofaccessesmadeacrosstheselinksare | |
| significantlyhigherthanaccesstoGPUmemory,andtotalavailablebandwidthislower. Assuch,using | |
| mappedmemoryforallkernelmemoryaccessesisunlikelytofullyutilizeGPUcomputingresources. | |
| Unifiedmemoryismostoftenmigratedtothephysicalmemoryoftheprocessoraccessingit. After | |
| thefirstmigration,repeatedaccesstothesamememorypageorcachelinebyakernelcanutilizethe | |
| fullGPUmemorybandwidth. | |
| Note | |
| Mappedmemoryhasalsobeenreferredtoaszero-copymemoryinpreviousdocuments. | |
| PriortoallCUDAapplicationsusingaunifiedvirtualaddressspace,additionalAPIswereneededto | |
| enable memory mapping (cudaSetDeviceFlags with cudaDeviceMapHost). These APIs are no | |
| longerneeded. | |
| Atomicfunctions(seeAtomicFunctions)operatingonmappedhostmemoryarenotatomicfrom | |
| thepointofviewofthehostorotherGPUs. | |
| CUDAruntimerequiresthat1-byte,2-byte,4-byte,8-byte,and16-bytenaturallyalignedloadsand | |
| storestohostmemoryinitiatedfromthedevicearepreservedassingleaccessesfromthepoint | |
| ofviewofthehostandotherdevices. Onsomeplatforms,atomicstomemorymaybebrokenby | |
| thehardwareintoseparateloadandstoreoperations. Thesecomponentloadandstoreoperations | |
| havethesamerequirementsonpreservationofnaturallyalignedaccesses. TheCUDAruntimedoes | |
| notsupportaPCIExpressbustopologywhereaPCIExpressbridgesplits8-bytenaturallyaligned | |
| operationsandNVIDIAisnotawareofanytopologythatsplits16-bytenaturallyalignedoperations. | |
| 2.4.4. Summary | |
| ▶ OnLinuxplatformswithheterogeneousmemorymanagement(HMM)oraddresstranslationser- | |
| vices(ATS),allsystem-allocatedmemoryismanagedmemory | |
| ▶ On Linux platforms without HMM or ATS, on Tegra processors, and on all Windows platforms, | |
| managedmemorymustbeallocatedusingCUDA: | |
| ▶ cudaMallocManagedor | |
| ▶ cudaMallocFromPoolAsync with a pool created with alloc- | |
| Type=cudaMemAllocationTypeManaged | |
| ▶ Globalvariableswith__managed__specifier | |
| ▶ OnWindowsandTegraprocessors,unifiedmemoryhaslimitations | |
| 2.4. UnifiedandSystemMemory 79 | |
| CUDAProgrammingGuide,Release13.1 | |
| ▶ | |
| On NVLINK C2C connected systems with ATS, device memory allocated with cudaMalloc can | |
| bedirectlyaccessedfromtheCPUorotherGPUs | |
| | 2.5. | NVCC: | The | NVIDIA | CUDA | Compiler | | | |
| | ---- | ----- | --- | ------ | ---- | -------- | --- | | |
| TheNVIDIACUDACompilernvccisatoolchainfromNVIDIAforcompilingCUDAC/C++aswellasPTX | |
| code. ThetoolchainispartoftheCUDAToolkitandconsistsofseveraltools, includingthecompiler, | |
| linker,andthePTXandCubinassemblers. Thetop-levelnvcctoolcoordinatesthecompilationprocess, | |
| invokingtheappropriatetoolforeachstageofcompilation. | |
| nvccdrivesofflinecompilationofCUDAcode, incontrasttoonlineorJust-in-Time(JIT)compilation | |
| drivenbytheCUDAruntimecompilernvrtc. | |
| Thischaptercoversthemostcommonusesanddetailsofnvccneededforbuildingapplications. Full | |
| coverageofnvccisfoundinthenvccdocumentation. | |
| | 2.5.1. | CUDA | Source | Files | and Headers | | | | |
| | ------ | ---- | ------ | ----- | ----------- | --- | --- | | |
| Sourcefilescompiledwithnvccmaycontainacombinationofhostcode,whichexecutesontheCPU, | |
| anddevicecodethatexecutesontheGPU. nvcc acceptsthecommonC/C++ sourcefileextensions | |
| .c,.cpp,.cc,.cxxforhost-onlycodeand.cuforfilesthatcontaindevicecodeoramixofhostand | |
| devicecode. Headerscontainingdevicecodetypicallyadoptthe.cuhextensiontodistinguishthem | |
| fromhost-onlycodeheaders.h,.hpp,.hh,.hxx,etc. | |
| | FileExtension | | Description | | Content | | | | |
| | ------------- | --- | ------------- | --- | ------------- | --- | --- | | |
| | .c | | Csourcefile | | Host-onlycode | | | | |
| | .cpp,.cc,.cxx | | C++sourcefile | | Host-onlycode | | | | |
| .h,.hpp,.hh,.hxx C/C++headerfile Devicecode,hostcode,mixofhost/devicecode | |
| | .cu | | CUDAsourcefile | | Devicecode,hostcode,mixofhost/devicecode | | | | |
| | ------ | ---- | -------------- | --- | ---------------------------------------- | --- | --- | | |
| | .cuh | | CUDAheaderfile | | Devicecode,hostcode,mixofhost/devicecode | | | | |
| | 2.5.2. | NVCC | Compilation | | Workflow | | | | |
| Intheinitialphase,nvccseparatesthedevicecodefromthehostcodeanddispatchestheircompila- | |
| tiontotheGPUandthehostcompilers,respectively. | |
| Tocompilethehostcode,theCUDAcompilernvccrequiresacompatiblehostcompilertobe | |
| available. TheCUDAToolkitdefinesthehostcompilersupportpolicyforLinuxandWindows | |
| platforms. | |
| Filescontainingonlyhostcodecanbebuiltusingeithernvccorthehostcompilerdirectly. The | |
| resultingobjectfilescanbecombinedwithobjectfilesfromnvccwhichcontainGPUcodeat | |
| link-time. | |
| TheGPUcompilercompilestheC/C++devicecodetoPTXassemblycode. TheGPUcompilerisrun | |
| | 80 | | | | | Chapter2. | ProgrammingGPUsinCUDA | | |
| | --- | --- | --- | --- | --- | --------- | --------------------- | | |
| CUDAProgrammingGuide,Release13.1 | |
| foreachvirtualmachineinstructionsetarchitecture(e.g. compute_90)specifiedinthecompilation | |
| commandline. | |
| IndividualPTXcodeisthenpassedtotheptxastool,whichgeneratesCubinforthetargethardware | |
| ISAs. ThehardwareISAisidentifiedbyitsSMversion. | |
| ItispossibletoembedmultiplePTXandCubintargetsintoasinglebinaryFatbincontainerwithinan | |
| applicationorlibrarysothatasinglebinarycansupportmultiplevirtualandtargethardwareISAs. | |
| Theinvocationandcoordinationofthetoolsdescribedabovearedoneautomaticallybynvcc. The-v | |
| optioncanbeusedtodisplaythefullcompilationworkflowandtoolinvocation. The-keepoptioncan | |
| beusedtosavetheintermediatefilesgeneratedduringthecompilationinthecurrentdirectoryorin | |
| thedirectoryspecifiedby--keep-dirinstead. | |
| ThefollowingexampleillustratesthecompilationworkflowforaCUDAsourcefileexample.cu: | |
| | ∕∕ ----- | example.cu | ----- | | | |
| | ------------- | ---------- | -------- | ----------- | | |
| | #include | <stdio.h> | | | | |
| | __global__ | void | kernel() | { | | |
| | printf("Hello | | from | kernel\n"); | | |
| } | |
| | void kernel_launcher() | | | { | | |
| | ---------------------- | --- | ------- | --- | | |
| | kernel<<<1, | | 1>>>(); | | | |
| cudaDeviceSynchronize(); | |
| } | |
| | int main() | { | | | | |
| | ---------- | --- | --- | --- | | |
| kernel_launcher(); | |
| | return | 0; | | | | |
| | ------ | --- | --- | --- | | |
| } | |
| nvccbasiccompilationworkflow: | |
| nvcccompilationworkflowwithmultiplePTXandCubinarchitectures: | |
| Amoredetaileddescriptionofthenvcccompilationworkflowcanbefoundinthecompilerdocumen- | |
| tation. | |
| | 2.5.3. | NVCC | Basic | Usage | | |
| | ------ | ---- | ----- | ----- | | |
| ThebasiccommandtocompileaCUDAsourcefilewithnvccis: | |
| | nvcc <source_file>.cu | | | -o <output_file> | | |
| | --------------------- | --- | --- | ---------------- | | |
| nvcc accepts common compiler flags used for specifying include directories -I <path> and | |
| library paths -L <path>, linking against other libraries -l<library>, and defining macros | |
| -D<macro>=<value>. | |
| nvcc example.cu -I path_to_include∕ -L path_to_library∕ -lcublas -o <output_ | |
| ,→file> | |
| 2.5. NVCC:TheNVIDIACUDACompiler 81 | |
| CUDAProgrammingGuide,Release13.1 | |
| | 82 | Chapter2. | ProgrammingGPUsinCUDA | | |
| | --- | --------- | --------------------- | | |
| CUDAProgrammingGuide,Release13.1 | |
| 2.5.3.1 NVCCPTXandCubinGeneration | |
| Bydefault,nvccgeneratesPTXandCubinfortheearliestGPUarchitecture(lowestcompute_XYand | |
| sm_XYversion)supportedbytheCUDAToolkittomaximizecompatibility. | |
| ▶ | |
| The-archoptioncanbeusedtogeneratePTXandCubinforaspecificGPUarchitecture. | |
| ▶ The-gencodeoptioncanbeusedtogeneratePTXandCubinformultipleGPUarchitectures. | |
| The complete list of supported virtual and real GPU architectures can be obtained by passing the | |
| --list-gpu-codeand--list-gpu-archflagsrespectively,orbyreferringtotheVirtualArchitec- | |
| tureListandtheGPUArchitectureListsectionswithinthenvccdocumentation. | |
| nvcc --list-gpu-code # list all supported real GPU architectures | |
| nvcc --list-gpu-arch # list all supported virtual GPU architectures | |
| nvcc example.cu -arch=compute_<XY> # e.g. -arch=compute_80 for NVIDIA Ampere | |
| | ,→GPUs and | later | | | | | | | | |
| | ---------- | ----- | --- | --- | ----------- | --- | ------- | ---------- | | |
| | | | | | # PTX-only, | GPU | forward | compatible | | |
| nvcc example.cu -arch=sm_<XY> # e.g. -arch=sm_80 for NVIDIA Ampere GPUs | |
| ,→and later | |
| | | | | | # PTX | and Cubin, | GPU | forward compatible | | |
| | --- | --- | --- | --- | ----- | ---------- | --- | ------------------ | | |
| nvcc example.cu -arch=native # automatically detects and generates Cubin | |
| | ,→for the | current | GPU | | | | | | | |
| | --------- | ------- | --- | --- | ---- | ------- | ----------- | ------------- | | |
| | | | | | # no | PTX, no | GPU forward | compatibility | | |
| nvcc example.cu -arch=all # generate Cubin for all supported GPU | |
| ,→architectures | |
| | | | | | # also | includes | the latest | PTX for GPU | | |
| | --------- | ------------- | --- | --- | ------ | -------- | ---------- | ------------ | | |
| | ,→forward | compatibility | | | | | | | | |
| nvcc example.cu -arch=all-major # generate Cubin for all major supported | |
| | ,→GPU architectures, | | | e.g. sm_80, | sm_90, | | | | | |
| | -------------------- | ------------- | --- | ----------- | ------ | -------- | ---------- | ------------ | | |
| | | | | | # also | includes | the latest | PTX for GPU | | |
| | ,→forward | compatibility | | | | | | | | |
| MoreadvancedusageallowsPTXandCubintargetstobespecifiedindividually: | |
| # generate PTX for virtual architecture compute_80 and compile it to Cubin for | |
| | ,→real architecture | | sm_86, | keep | compute_80 | PTX | | | | |
| | ------------------- | --- | ------ | ---- | ---------- | --- | --- | --- | | |
| nvcc example.cu -arch=compute_80 -gpu-code=sm_86,compute_80 # (PTX and Cubin) | |
| # generate PTX for virtual architecture compute_80 and compile it to Cubin for | |
| | ,→real architecture | | sm_86, | sm_89 | | | | | | |
| | ------------------- | --- | ------ | ----- | --- | --- | --- | --- | | |
| nvcc example.cu -arch=compute_80 -gpu-code=sm_86,sm_89 # (no PTX) | |
| nvcc example.cu -gencode=arch=compute_80,code=sm_86,sm_89 # same as above | |
| # (1) generate PTX for virtual architecture compute_80 and compile it to Cubin | |
| | ,→for real | architecture | | sm_86, | sm_89 | | | | | |
| | ---------- | ------------ | --- | ------ | ----- | --- | --- | --- | | |
| # (2) generate PTX for virtual architecture compute_90 and compile it to Cubin | |
| | ,→for real | architecture | | sm_90 | | | | | | |
| | --------------- | ------------ | ----------------------------------------- | ----- | --- | --- | --- | --- | | |
| | nvcc example.cu | | -gencode=arch=compute_80,code=sm_86,sm_89 | | | | | - | | |
| ,→gencode=arch=compute_90,code=sm_90 | |
| 2.5. NVCC:TheNVIDIACUDACompiler 83 | |
| CUDAProgrammingGuide,Release13.1 | |
| The full reference of nvcc command-line options for steering GPU code generation can be found in | |
| thenvccdocumentation. | |
| 2.5.3.2 HostCodeCompilationNotes | |
| Compilationunits,namelyasourcefileanditsheaders,thatdonotcontaindevicecodeorsymbolscan | |
| be compiled directly with a host compiler. If any compilation unit uses CUDA runtime API functions, | |
| theapplicationmustbelinkedwiththeCUDAruntimelibrary. TheCUDAruntimeisavailableasboth | |
| astaticandasharedlibrary,libcudart_staticandlibcudart,respectively. Bydefault,nvcclinks | |
| againstthestaticCUDAruntimelibrary. TousethesharedlibraryversionoftheCUDAruntime,pass | |
| theflag--cudart=sharedtonvcconthecompileorlinkcommand. | |
| nvcc allowsthe hostcompiler usedforhostfunctionstobe specifiedvia the -ccbin <compiler> | |
| argument. The environment variable NVCC_CCBIN can also be defined to specify the host compiler | |
| usedbynvcc. The-Xcompilerargumenttonvccpassesthroughargumentstothehostcompiler. | |
| Forexample,intheexamplebelow,the-O3argumentispassedtothehostcompilerbynvcc. | |
| | nvcc example.cu | | -ccbin=clang++ | | | | | | | |
| | --------------- | ---------------- | -------------- | --- | --- | --- | --- | --- | | |
| | export | NVCC_CCBIN='gcc' | | | | | | | | |
| | nvcc example.cu | | -Xcompiler=-O3 | | | | | | | |
| 2.5.3.3 SeparateCompilationofGPUCode | |
| nvccdefaultstowhole-programcompilation,whichexpectsallGPUcodeandsymbolstobepresentin | |
| thecompilationunitthatusesthem. CUDAdevicefunctionsmaycalldevicefunctionsoraccessdevice | |
| variablesdefinedinothercompilationunits, buteitherthe-rdc=trueoritsaliasthe-dcflagmust | |
| be specified on the nvcc command line to enable linking of device code from different compilation | |
| units. The ability to link device code and symbols from different compilation units is called separate | |
| compilation. | |
| Separatecompilationallowsmoreflexiblecodeorganization,canimprovecompiletime,andcanleadto | |
| smaller binaries. Separate compilation may involve some build-time complexity compared to whole- | |
| program compilation. Performance can be affected by the use of device code linking, which is why | |
| it is not used by default. Link-TimeOptimization(LTO) can help reduce the performance overhead of | |
| separatecompilation. | |
| Separatecompilationrequiresthefollowingconditions: | |
| ▶ Non-constdevicevariablesdefinedinonecompilationunitmustbereferredtowiththeextern | |
| keywordinothercompilationunits. | |
| ▶ Allconstdevicevariablesmustbedefinedandreferredtowiththeexternkeyword. | |
| ▶ AllCUDAsourcefiles.cumustbecompiledwiththe-dcor-rdc=trueflags. | |
| Host and device functions have external linkage by default and do not require the ex- | |
| | keyword. | | Note | that starting from | CUDA | 13, | functions | and | | |
| | -------- | --- | ---- | ------------------ | ---- | ---------- | --------- | ------ | | |
| | tern | | | | | __global__ | | __man- | | |
| aged__/__device__/__constant__variableshaveinternallinkagebydefault. | |
| Inthefollowingexample,definition.cudefinesavariableandafunction,whileexample.curefers | |
| tothem. Bothfilesarecompiledseparatelyandlinkedintothefinalbinary. | |
| | ∕∕ ----- | definition.cu | | ----- | | | | | | |
| | ---------- | ------------- | --- | ----------------- | -------- | --------- | --------------------- | --- | | |
| | extern | __device__ | int | device_variable | = 5; | | | | | |
| | __device__ | | int | device_function() | { return | 10; | } | | | |
| | 84 | | | | | Chapter2. | ProgrammingGPUsinCUDA | | | |
| CUDAProgrammingGuide,Release13.1 | |
| | ∕∕ ----- | example.cu | ----- | | | | |
| | --------------- | ---------- | ---------------------- | ------ | --- | | |
| | extern | __device__ | int device_variable; | | | | |
| | __device__ | | int device_function(); | | | | |
| | __global__ | void | kernel(int* | ptr) { | | | |
| | device_variable | | = 0; | | | | |
| | *ptr | | = device_function(); | | | | |
| } | |
| | nvcc -dc | definition.cu | -o | definition.o | | | |
| | ----------------- | ------------- | --------- | ------------ | --- | | |
| | nvcc -dc | example.cu | -o | example.o | | | |
| | nvcc definition.o | | example.o | -o program | | | |
| | 2.5.4. | Common | Compiler | Options | | | |
| This section presents the most relevant compiler options that can be used with nvcc, covering lan- | |
| guagefeatures,optimization,debugging,profiling,andbuildaspects. Thefulldescriptionofalloptions | |
| canbefoundinthenvccdocumentation. | |
| 2.5.4.1 LanguageFeatures | |
| nvccsupportstheC++corelanguagefeatures,fromC++03toC++20. The-stdflagcanbeusedto | |
| specifythelanguagestandardtouse: | |
| ▶ | |
| --std={c++03|c++11|c++14|c++17|c++20} | |
| Inaddition,nvccsupportsthefollowinglanguageextensions: | |
| ▶ | |
| -restrict: Assertthatallkernelpointerparametersarerestrictpointers. | |
| ▶ -extended-lambda: Allow__host__,__device__annotationsinlambdadeclarations. | |
| ▶ -expt-relaxed-constexpr: (Experimentalflag)Allowhostcodetoinvoke__device__ con- | |
| stexprfunctions,anddevicecodetoinvoke__host__ constexprfunctions. | |
| Moredetailonthesefeaturescanbefoundintheextendedlambdaandconstexprsections. | |
| 2.5.4.2 DebuggingOptions | |
| nvccsupportsthefollowingoptionstogeneratedebuginformation: | |
| ▶ -g: Generate debug information for host code. gdb∕lldb and similar tools rely on such infor- | |
| mationforhostcodedebugging. | |
| ▶ -G:Generatedebuginformationfordevicecode. cuda-gdbreliesonsuchinformationfordevice- | |
| | codedebugging. | | Theflagalsodefinesthe__CUDACC_DEBUG__macro. | | | | |
| | -------------- | --- | ------------------------------------------- | --- | --- | | |
| ▶ -lineinfo: Generateline-numberinformationfordevicecode. Thisoptiondoesnotaffectex- | |
| ecution performance and is useful in conjunction with the compute-sanitizer tool to trace the | |
| kernelexecution. | |
| uses the highest optimization level for GPU code by default. The debug flag prevents | |
| | nvcc | | | -O3 | -G | | |
| | ---- | --- | --- | --- | --- | | |
| some compiler optimizations, and so debug code is expected to have lower performance than non- | |
| debugcode. The-DNDEBUGflagcanbedefinedtodisableruntimeassertions,asthesecanalsoslow | |
| downexecution. | |
| 2.5. NVCC:TheNVIDIACUDACompiler 85 | |
| CUDAProgrammingGuide,Release13.1 | |
| 2.5.4.3 OptimizationOptions | |
| nvccprovidesmanyoptionsforoptimizingperformance. Thissectionaimstoprovideabriefsurveyof | |
| someoftheoptionsavailablethatdevelopersmayfinduseful,aswellaslinkstofurtherinformation. | |
| Completecoveragecanbefoundinthenvccdocumentation. | |
| ▶ -XptxaspassesargumentstothePTXassemblertoolptxas. Thenvccdocumentationprovides | |
| a list of useful arguments for ptxas. For example, specifies the | |
| -Xptxas=-maxrregcount=N | |
| maximumnumberofregisterstouse,perthread. | |
| ▶ -extra-device-vectorization: Enablesmoreaggressivedevicecodevectorization. | |
| ▶ Additionalflagswhichprovidefine-grainedcontroloverfloatingpointbehaviorarecoveredinthe | |
| Floating-PointComputationsectionandinthenvccdocumentation. | |
| Thefollowingflagsgetoutputfromthecompilerwhichcanbeusefulinmoreadvancedcodeoptimiza- | |
| tion: | |
| ▶ -res-usage: Printresourceusagereportaftercompilation. Itincludesthenumberofregisters, | |
| sharedmemory,constantmemory,andlocalmemoryallocatedforeachkernelfunction. | |
| | ▶ -opt-info=inline: | | Printinformationaboutinlinedfunctions. | | | | | |
| | ------------------- | --- | -------------------------------------- | --- | --- | --- | | |
| ▶ | |
| | -Xptxas=-warn-lmem-usage: | | Warniflocalmemoryisused. | | | | | |
| | ------------------------- | --- | ------------------------ | --- | --- | --- | | |
| ▶ -Xptxas=-warn-spills: Warnifregistersarespilledtolocalmemory. | |
| 2.5.4.4 Link-TimeOptimization(LTO) | |
| Separatecompilationcanresultinlowerperformancethanwhole-programcompilationduetolimited | |
| cross-fileoptimizationopportunities. Link-TimeOptimization(LTO)addressesthisbyperformingop- | |
| timizationsacrossseparatelycompiledfilesatlinktime,atthecostofincreasedcompilationtime. LTO | |
| canrecovermuchoftheperformanceofwhole-programcompilationwhilemaintainingtheflexibility | |
| ofseparatecompilation. | |
| nvccrequiresthe-dltoflagorlto_<SM version>link-timeoptimizationtargetstoenableLTO: | |
| | nvcc -dc | -dlto -arch=sm_100 | definition.cu | -o | definition.o | | | |
| | ---------- | ------------------ | ------------- | --------------- | ------------ | --- | | |
| | nvcc -dc | -dlto -arch=sm_100 | example.cu | -o | example.o | | | |
| | nvcc -dlto | definition.o | example.o | -o program | | | | |
| | nvcc -dc | -arch=lto_100 | definition.cu | -o definition.o | | | | |
| | nvcc -dc | -arch=lto_100 | example.cu | -o example.o | | | | |
| | nvcc -dlto | definition.o | example.o | -o program | | | | |
| 2.5.4.5 ProfilingOptions | |
| ItispossibletodirectlyprofileaCUDAapplicationusingtheNsightComputeandNsightSystemstools | |
| withouttheneedforadditionalflagsduringthecompilationprocess. However,additionalinformation | |
| which can be generated by nvcc can assist profiling by correlating source files with the generated | |
| code: | |
| ▶ | |
| -lineinfo: Generate line-number information for device code; this allows viewing the source | |
| codeintheprofilingtools. Profilingtoolsrequiretheoriginalsourcecodetobeavailableinthe | |
| samelocationwherethecodewascompiled. | |
| ▶ -src-in-ptx: KeeptheoriginalsourcecodeinthePTX,avoidingthelimitationsof-lineinfo | |
| | mentionedabove. | Requires-lineinfo. | | | | | | |
| | --------------- | ------------------ | --- | --- | --------- | --------------------- | | |
| | 86 | | | | Chapter2. | ProgrammingGPUsinCUDA | | |
| CUDAProgrammingGuide,Release13.1 | |
| 2.5.4.6 FatbinCompression | |
| nvcccompressesthefatbinsstoredinapplicationorlibrarybinariesbydefault. Fatbincompression | |
| canbecontrolledusingthefollowingoptions: | |
| ▶ -no-compress: Disablethecompressionofthefatbin. | |
| ▶ --compress-mode={default|size|speed|balance|none}: Set the compression mode. | |
| speed focuses on fast decompression time, while size aims at reducing the fatbin size. bal- | |
| ance provides a trade-off between speed and size. The default mode is speed. none disables | |
| compression. | |
| 2.5.4.7 CompilerPerformanceControls | |
| nvccprovidesoptionstoanalyzeandacceleratethecompilationprocessitself: | |
| ▶ -t <N>: ThenumberofCPUthreadsusedtoparallelizethecompilationofasinglecompilation | |
| unitformultipleGPUarchitectures. | |
| ▶ -split-compile <N>: ThenumberofCPUthreadsusedtoparallelizetheoptimizationphase. | |
| ▶ -split-compile-extended <N>: Moreaggressiveformofsplitcompilation. Requireslink-time | |
| optimization. | |
| ▶ -Ofc <N>: Levelofdevicecodecompilationspeed. | |
| ▶ -time <filename>: Generateacomma-separatedvalue(CSV)tablewiththetimetakenbyeach | |
| compilationphase. | |
| ▶ -fdevice-time-trace: Generateatimetracefordevicecodecompilation. | |
| 2.5. NVCC:TheNVIDIACUDACompiler 87 | |
| CUDAProgrammingGuide,Release13.1 | |
| | 88 | Chapter2. | ProgrammingGPUsinCUDA | | |
| | --- | --------- | --------------------- | | |
| Chapter 3. Advanced CUDA | |
| 3.1. Advanced CUDA APIs and Features | |
| ThissectionwillcoveruseofmoreadvancedCUDAAPIsandfeatures. Thesetopicscovertechniques | |
| or features that do not usually require CUDA kernel modifications, but can still influence, from the | |
| host-side,application-levelbehavior,bothintermsofGPUworkexecutionandperformanceaswellas | |
| CPU-sideperformance. | |
| 3.1.1. cudaLaunchKernelEx | |
| Whenthetriplechevronnotationwasintroducedinfirstversionsof,theKernelConfigurationofaker- | |
| nel had only four programmable parameters: - thread block dimensions - grid dimensions - dynamic | |
| shared-memory(optional,0ifunspecified)-stream(defaultstreamusedifunspecified) | |
| SomeCUDAfeaturescanbenefitfromadditionalattributesandhintsprovidedwithakernellaunch. | |
| The cudaLaunchKernelEx enables a program to set the above mentioned execution configuration | |
| parameters via the cudaLaunchConfig_t structure. In addition, the cudaLaunchConfig_t struc- | |
| tureallowstheprogramtopassinzeroormorecudaLaunchAttributestocontrolorsuggestother | |
| parametersforthekernellaunch. Forexample,thecudaLaunchAttributePreferredSharedMem- | |
| oryCarveoutdiscussedlaterinthischapter(seeConfiguringL1/SharedMemoryBalance)isspecified | |
| using cudaLaunchKernelEx. The cudaLaunchAttributeClusterDimension attribute, discussed | |
| laterinthischapter,isusedtospecifythedesiredclustersizeforthekernellaunch. | |
| The complete list of supported attributes and their meaning is captured in the CUDA Runtime API | |
| ReferenceDocumentation. | |
| 3.1.2. Launching Clusters: | |
| Threadblockclusters, introducedinprevioussections, areanoptionallevelofthreadblockorganiza- | |
| tionavailableincomputecapability9.0andhigherwhichenableapplicationstoguaranteethatthread | |
| blocksofaclusteraresimultaneouslyexecutedonsingleGPC.Thisenableslargergroupsofthreads | |
| thanthosethatfitinasingleSMtoexchangedataandsynchronizewitheachother. | |
| SectionSection2.1.10.1showedhowakernelwhichusesclusterscanbespecifiedandlaunchedusing | |
| triplechevronnotation. Inthissection, the__cluster_dims__annotationwasusedtospecifythe | |
| dimensionsoftheclusterwhichmustbeusedtolaunchthekernel. Whenusingtriplechevronnotation, | |
| thesizeoftheclustersisdeterminedimplicitly. | |
| 89 | |
| CUDAProgrammingGuide,Release13.1 | |
| 3.1.2.1 LaunchingwithClustersusingcudaLaunchKernelEx | |
| Unlikelaunchingkernelsusingclusterswithtriplechevronnotation,thesizeofthethreadblockcluster | |
| canbeconfiguredonaper-launchbasis. Thecodeexamplebelowshowshowtolaunchaclusterkernel | |
| usingcudaLaunchKernelEx. | |
| | ∕∕ Kernel | definition | | | | | | | | | | | |
| | ---------- | ---------- | ------------------------- | --------- | --- | -------- | -------------- | ------- | --- | --- | --- | | |
| | ∕∕ No | compile | time | attribute | | attached | to the kernel | | | | | | |
| | __global__ | | void cluster_kernel(float | | | | *input, float* | output) | | | | | |
| { | |
| } | |
| int main() | |
| { | |
| | float | *input, | | *output; | | | | | | | | | |
| | ----- | ------------------- | --- | -------- | --- | ---- | --- | --- | --- | --- | --- | | |
| | dim3 | threadsPerBlock(16, | | | | 16); | | | | | | | |
| dim3 numBlocks(N ∕ threadsPerBlock.x, N ∕ threadsPerBlock.y); | |
| | ∕∕ | Kernel | invocation | | with | runtime | cluster size | | | | | | |
| | --- | ------ | ---------- | --- | ---- | ------- | ------------ | --- | --- | --- | --- | | |
| { | |
| | | cudaLaunchConfig_t | | | | config | = {0}; | | | | | | |
| | --- | ------------------ | --- | --- | --- | ------ | ------ | --- | --- | --- | --- | | |
| ∕∕ The grid dimension is not affected by cluster launch, and is still | |
| ,→enumerated | |
| | | ∕∕ | using | number | of | blocks. | | | | | | | |
| | --- | ------------------- | -------- | --------- | -------------------------------------- | ------------- | ------------- | --- | ------- | ----- | --- | | |
| | | ∕∕ | The grid | dimension | | should | be a multiple | of | cluster | size. | | | |
| | | config.gridDim | | | = numBlocks; | | | | | | | | |
| | | config.blockDim | | | = threadsPerBlock; | | | | | | | | |
| | | cudaLaunchAttribute | | | | attribute[1]; | | | | | | | |
| | | attribute[0].id | | | = cudaLaunchAttributeClusterDimension; | | | | | | | | |
| attribute[0].val.clusterDim.x = 2; ∕∕ Cluster size in X-dimension | |
| | | attribute[0].val.clusterDim.y | | | | | = 1; | | | | | | |
| | --- | ----------------------------- | --- | --- | ---------- | --- | --------------- | ------ | --- | -------- | --- | | |
| | | attribute[0].val.clusterDim.z | | | | | = 1; | | | | | | |
| | | config.attrs | | = | attribute; | | | | | | | | |
| | | config.numAttrs | | | = 1; | | | | | | | | |
| | | cudaLaunchKernelEx(&config, | | | | | cluster_kernel, | input, | | output); | | | |
| } | |
| } | |
| TherearetwocudaLaunchAttributetypeswhicharerelevanttothreadblockclustersclusters: cu- | |
| daLaunchAttributeClusterDimension and cudaLaunchAttributePreferredClusterDimen- | |
| sion. | |
| TheattributeidcudaLaunchAttributeClusterDimensionspecifiestherequireddimensionswith | |
| whichtoexecutethecluster. Thevalueforthisattribute,clusterDim,isa3-dimensionalvalue. The | |
| corresponding dimensions of the grid (x, y, and z) must be divisible by the respective dimensions of | |
| thespecifiedclusterdimension. Settingthisissimilartousingthe__cluster_dims__attributeon | |
| the kernel definition at compile time as shown in LaunchingwithClustersinTripleChevronNotation, | |
| butcanbechangedatruntimefordifferentlaunchesofthesamekernel. | |
| On GPUs with compute capability of 10.0 and higher, another attribute id cudaLaunchAt- | |
| tributePreferredClusterDimensionallowstheapplicationtoadditionallyspecifyapreferreddi- | |
| mensionforthecluster. Thepreferreddimensionmustbeanintegermultipleoftheminimumclus- | |
| terdimensionsspecifiedbythe__cluster_dims__attributeonthekernelorthecudaLaunchAt- | |
| | 90 | | | | | | | | Chapter3. | | AdvancedCUDA | | |
| | --- | --- | --- | --- | --- | --- | --- | --- | --------- | --- | ------------ | | |
| CUDAProgrammingGuide,Release13.1 | |
| tributeClusterDimensionattributetocudaLaunchKernelEx. Thatis,aminimumclusterdimen- | |
| sionmustbespecifiedinadditiontothepreferredclusterdimension. Thecorrespondingdimensions | |
| ofthegrid(x,y,andz)mustbedivisiblebytherespectivedimensionofthespecifiedpreferredcluster | |
| dimension. | |
| Allthreadblockswillexecuteinclustersofatleasttheminimumclusterdimension. Wherepossible, | |
| clusters of the preferred dimension will be used, but not all clusters are guaranteed to execute with | |
| the preferred dimensions. All thread blocks will execute in clusters with either the minimum or pre- | |
| ferredclusterdimension. Kernelswhichuseapreferredclusterdimensionmustbewrittentooperate | |
| correctlyineithertheminimumorthepreferredclusterdimension. | |
| 3.1.2.2 BlocksasClusters | |
| Whenakernelisdefinedwiththe__cluster_dims__annotation,thenumberofclustersinthegrid | |
| isimplicitandcanbecalculatedfromthesizeofthegriddividedintothespecifiedclustersize. | |
| __cluster_dims__((2, 2, 2)) __global__ void foo(); | |
| ∕∕ 8x8x8 clusters each with 2x2x2 thread blocks. | |
| foo<<<dim3(16, 16, 16), dim3(1024, 1, 1)>>>(); | |
| Intheaboveexample,thekernelislaunchedasagridof16x16x16threadblocks,whichmeansagrid | |
| ofof8x8x8clustersisused. | |
| A kernel can alternatively use the __block_size__ annotation, which specifies both the required | |
| blocksizeandclustersizeatthetimethekernelisdefined. Whenthisannotationisused, thetriple | |
| chevronlaunchbecomesthegriddimensionintermsofclustersratherthanthreadblocks,asshown | |
| below. | |
| ∕∕ Implementation detail of how many threads per block and blocks per cluster | |
| ∕∕ is handled as an attribute of the kernel. | |
| __block_size__((1024, 1, 1), (2, 2, 2)) __global__ void foo(); | |
| ∕∕ 8x8x8 clusters. | |
| foo<<<dim3(8, 8, 8)>>>(); | |
| __block_size__requirestwofieldseachbeingatupleof3elements. Thefirsttupledenotesblock | |
| dimensionandsecondclustersize. Thesecondtupleisassumedtobe(1,1,1)ifit’snotpassed. To | |
| specify the stream, one must pass 1 and 0 as the second and third arguments within <<<>>> and | |
| lastlythestream. Passingothervalueswouldleadtoundefinedbehavior. | |
| Notethatitisillegalforthesecondtupleof__block_size__and__cluster_dims__tobespecified | |
| atthesametime. It’salsoillegaltouse__block_size__withanempty__cluster_dims__. When | |
| the second tuple of __block_size__ is specified, it implies the “Blocks as Clusters” being enabled | |
| andthecompilerwouldrecognizethefirstargumentinside<<<>>>asthenumberofclustersinstead | |
| ofthreadblocks. | |
| 3.1.3. More on Streams and Events | |
| CUDAStreams introduced the basics of CUDA streams. By default, operations submitted on a given | |
| CUDAstreamareserialized: onecannotstartexecutinguntilthepreviousonehascompleted. Theonly | |
| exceptionistherecentlyaddedProgrammaticDependentLaunchandSynchronizationfeature. Having | |
| multiple CUDA streams is a way to enable concurrent execution; another way is using CUDAGraphs. | |
| Thetwoapproachescanalsobecombined. | |
| 3.1. AdvancedCUDAAPIsandFeatures 91 | |
| CUDAProgrammingGuide,Release13.1 | |
| WorksubmittedondifferentCUDAstreamsmay executeconcurrentlyunderspecificcircumstances, | |
| e.g., if there are no event dependencies, if there is no implicit synchronization, if there are sufficient | |
| resources,etc. | |
| Independent operations from different CUDA streams cannot run concurrently if any CUDA opera- | |
| tion on the NULL stream is submitted in between them, unless the streams are non-blocking CUDA | |
| streams. These are streams created with cudaStreamCreateWithFlags() runtime API with the | |
| cudaStreamNonBlockingflag. ToimprovepotentialforconcurrentGPUworkexecution,itisrecom- | |
| mendedthattheusercreatesnon-blockingCUDAstreams. | |
| Itisalsorecommendedthattheuserselectstheleastgeneralsynchronizationoptionthatissufficient | |
| fortheirproblem. Forexample,iftherequirementisfortheCPUtowait(block)forallworkonaspecific | |
| CUDA stream to complete, using cudaStreamSynchronize() for that stream would be preferable | |
| to cudaDeviceSynchronize(), as the latter would unnecessarily wait for GPU work on all CUDA | |
| streams of the device to complete. And if the requirement is for the CPU to wait without blocking, | |
| thenusingcudaStreamQuery()andcheckingitsreturnvalue,inapollingloop,maybepreferable. | |
| AsimilarsynchronizationeffectcanalsobeachievedwithCUDAevents(CUDAEvents),e.g.,byrecord- | |
| inganeventonthatstreamandcallingcudaEventSynchronize()towait,inablockingmanner,for | |
| theworkcapturedinthateventtocomplete. Again,thiswouldbepreferableandmorefocusedthan | |
| usingcudaDeviceSynchronize(). CallingcudaEventQuery()andcheckingitsreturnvalue,e.g.,in | |
| apollingloop,wouldbeanon-blockingalternative. | |
| Thechoiceoftheexplicitsynchronizationmethodisparticularlyimportantifthisoperationhappens | |
| in the application’s critical path. Table 4 provides a high-level summary of various synchronization | |
| optionswiththehost. | |
| Table4: Summaryofexplicitsynchronizationoptionswiththe | |
| host | |
| Wait for specific Wait for specific Wait for everything on | |
| stream event thedevice | |
| Non-blocking (would need a cudaStream- cudaEvent- N/A | |
| pollingloop) Query() Query() | |
| Blocking cudaStreamSyn- cudaEventSyn- cudaDeviceSynchronize() | |
| chronize() chronize() | |
| For synchronization, i.e., to express dependencies, between CUDA streams, use of non-timing CUDA | |
| events is recommended, as described in CUDAEvents. A user can call cudaStreamWaitEvent() to | |
| force future submitted operations on a specific stream to wait for the completion of a previously | |
| recorded event (e.g., on another stream). Note that for any CUDA API waiting or querying an event, | |
| it is the responsibility of the user to ensure the cudaEventRecord API has been already called, as a | |
| non-recordedeventwillalwaysreturnsuccess. | |
| CUDAeventscarry,bydefault,timinginformation,astheycanbeusedincudaEventElapsedTime() | |
| APIcalls. However,aCUDAeventthatissolelyusedtoexpressdependenciesacrossstreamsdoesnot | |
| needtiminginformation. Forsuchcases,itisrecommendedtocreateeventswithtiminginformation | |
| disabledforimprovedperformance. ThisispossibleusingcudaEventCreateWithFlags()APIwith | |
| thecudaEventDisableTimingflag. | |
| 92 Chapter3. AdvancedCUDA | |
| CUDAProgrammingGuide,Release13.1 | |
| 3.1.3.1 StreamPriorities | |
| The relative priorities of streams can be specified at creation time using cudaStreamCreateWith- | |
| Priority(). The range of allowable priorities, ordered as [greatest priority, least priority] can be | |
| obtainedusingthecudaDeviceGetStreamPriorityRange()function. Atruntime,theGPUsched- | |
| uler utilizes stream priorities to determine task execution order, but these priorities serve as hints | |
| ratherthanguarantees. Whenselectingworktolaunch,pendingtasksinhigher-prioritystreamstake | |
| precedenceoverthoseinlower-prioritystreams. Higher-prioritytasksdonotpreemptalreadyrunning | |
| lower-prioritytasks. TheGPUdoesnotreassessworkqueuesduringtaskexecution,andincreasinga | |
| stream’s priority will not interrupt ongoing work. Stream priorities influence task execution without | |
| enforcingstrictordering,souserscanleveragestreamprioritiestoinfluencetaskexecutionwithout | |
| relyingonstrictorderingguarantees. | |
| Thefollowingcodesampleobtainstheallowablerangeofprioritiesforthecurrentdevice,andcreates | |
| twonon-blockingCUDAstreamswiththehighestandlowestavailablepriorities. | |
| ∕∕ get the range of stream priorities for this device | |
| int leastPriority, greatestPriority; | |
| cudaDeviceGetStreamPriorityRange(&leastPriority, &greatestPriority); | |
| ∕∕ create streams with highest and lowest available priorities | |
| cudaStream_t st_high, st_low; | |
| cudaStreamCreateWithPriority(&st_high, cudaStreamNonBlocking, | |
| ,→greatestPriority)); | |
| cudaStreamCreateWithPriority(&st_low, cudaStreamNonBlocking, leastPriority); | |
| 3.1.3.2 ExplicitSynchronization | |
| Aspreviouslyoutlined,thereareanumberofwaysthatstreamscansynchronizewithotherstreams. | |
| The following provides common methods at different levels of granularity: - cudaDeviceSynchro- | |
| nize()waitsuntilallprecedingcommandsinallstreamsofallhostthreadshavecompleted. -cudaS- | |
| treamSynchronize()takesastreamasaparameterandwaitsuntilallprecedingcommandsinthe | |
| givenstreamhavecompleted. Itcanbeusedtosynchronizethehostwithaspecificstream,allowing | |
| otherstreamstocontinueexecutingonthedevice. -cudaStreamWaitEvent()takesastreamandan | |
| eventasparameters(seeCUDAEventsforadescriptionofevents)andmakesallthecommandsadded | |
| to the given stream after the call to cudaStreamWaitEvent()delay their execution until the given | |
| eventhascompleted. -cudaStreamQuery()providesapplicationswithawaytoknowifallpreceding | |
| commandsinastreamhavecompleted. | |
| 3.1.3.3 ImplicitSynchronization | |
| Twocommandsfromdifferentstreamscannotrunconcurrentlyifanyoneofthefollowingoperations | |
| isissuedin-betweenthembythehostthread: | |
| ▶ apage-lockedhostmemoryallocation | |
| ▶ adevicememoryallocation | |
| ▶ adevicememoryset | |
| ▶ amemorycopybetweentwoaddressestothesamedevicememory | |
| ▶ anyCUDAcommandtotheNULLstream | |
| ▶ aswitchbetweentheL1/sharedmemoryconfigurations | |
| 3.1. AdvancedCUDAAPIsandFeatures 93 | |
| CUDAProgrammingGuide,Release13.1 | |
| Operationsthatrequireadependencycheckincludeanyothercommandswithinthesamestreamas | |
| thelaunchbeingcheckedandanycalltocudaStreamQuery()onthatstream. Therefore,applications | |
| shouldfollowtheseguidelinestoimprovetheirpotentialforconcurrentkernelexecution: | |
| ▶ Allindependentoperationsshouldbeissuedbeforedependentoperations, | |
| ▶ Synchronizationofanykindshouldbedelayedaslongaspossible. | |
| 3.1.4. Programmatic Dependent Kernel Launch | |
| Aswehavediscussedearlier,thesemanticsofCUDAStreamsaresuchthatkernelsexecuteinorder. | |
| This is so that if we have two successive kernels, where the second kernel depends on results from | |
| thefirstone,theprogrammercanbesafeintheknowledgethatbythetimethesecondkernelstarts | |
| executingthedependentdatawillbeavailable. However,itmaybethecasethatthefirstkernelcan | |
| havethedataonwhichasubsequentkerneldependsalreadywrittentoglobalmemoryanditstillhas | |
| moreworktodo. Likewisethedependentsecondkernelmayhavesomeindependentworkbeforeit | |
| needsthedatafromthefirstkernel. Insuchasituationitispossibletopartiallyoverlaptheexecution | |
| ofthetwokernels(assumingthathardwareresourcesareavailable). Theoverlappingcanalsooverlap | |
| thelaunchoverheadsofthesecondkerneltoo. Otherthantheavailabilityofhardwareresources,the | |
| degreeofoverlapwhichcanbeachievedisdependentonthespecificstructureofthekernels, such | |
| as | |
| ▶ wheninitsexecutiondoesthefirstkernelfinishtheworkonwhichthesecondkerneldepends? | |
| ▶ wheninitsexecutiondoesthesecondkernelstartworkingonthedatafromthefirstkernel? | |
| sincethisisverymuchdependentonthespecifickernelsinquestionitisdifficulttoautomatecom- | |
| pletelyandhenceCUDAprovidesamechanismtoallowtheapplicationdevelopertospecifythesyn- | |
| chronization point between the two kernels. This is done via a technique known as Programmatic | |
| DependentKernelLaunch. Thesituationisdepictedinthefigurebelow. | |
| PDLhasthreemaincomponents. | |
| i) The first kernel (the so called primarykernel) needs to call a special function to indicate that it | |
| is done with the everything that the subsequent dependentkernels (also called secondaryker- | |
| nel)willneed. ThisisdonebycallingthefunctioncudaTriggerProgrammaticLaunchComple- | |
| tion(). | |
| ii) Inturn,thedependentsecondarykernelneedstoindicatethatithasreachedtheportionofthe | |
| its work which is independent of the primary kernel and that it is now waiting on the primary | |
| kerneltofinishtheworkonwhichitdepends. ThisisdonewiththefunctioncudaGridDepen- | |
| dencySynchronize(). | |
| iii) THesecondkernelneedstobelaunchedwithaspecialattributecudaLaunchAttributeProgram- | |
| maticStreamSerializationwithitsprogrammaticStreamSerializationAllowedfieldsetto‘1’. | |
| Thefollowingcodesnippetshowsanexampleofhowthiscanbedone. | |
| 94 Chapter3. AdvancedCUDA | |
| CUDAProgrammingGuide,Release13.1 | |
| | | | Listing3: | | ExampleofProgrammaticDependentKernelLaunch | | | | | | |
| | --- | --- | --------- | --- | ------------------------------------------ | --- | --- | --- | --- | | |
| withtwoKernels | |
| | __global__ | void | primary_kernel() | | | | { | | | | |
| | ---------- | ---- | ---------------- | --- | --- | --- | --- | --- | --- | | |
| ∕∕ Initial work that should finish before starting secondary kernel | |
| | ∕∕ | Trigger | the | secondary | | kernel | | | | | |
| | --- | ------- | --- | --------- | --- | ------ | --- | --- | --- | | |
| cudaTriggerProgrammaticLaunchCompletion(); | |
| | ∕∕ | Work | that | can coincide | | with | the | secondary | kernel | | |
| | --- | ---- | ---- | ------------ | --- | ---- | --- | --------- | ------ | | |
| } | |
| | __global__ | void | secondary_kernel() | | | | | | | | |
| | ---------- | ---- | ------------------ | --- | --- | --- | --- | --- | --- | | |
| { | |
| | ∕∕ | Initialization, | | | Independent | | work, | etc. | | | |
| | --- | --------------- | --- | --- | ----------- | --- | ----- | ---- | --- | | |
| ∕∕ Will block until all primary kernels the secondary kernel is dependent | |
| ,→on have | |
| | ∕∕ | completed | | and flushed | | results | | to global | memory | | |
| | --- | --------- | --- | ----------- | --- | ------- | --- | --------- | ------ | | |
| cudaGridDependencySynchronize(); | |
| | ∕∕ | Dependent | | work | | | | | | | |
| | --- | --------- | --- | ---- | --- | --- | --- | --- | --- | | |
| } | |
| | ∕∕ Launch | the | secondary | | kernel | with | the | special | attribute | | |
| | ------------------- | ------ | --------- | ------------- | ------ | ---- | --- | ------- | --------- | | |
| | ∕∕ Set | Up the | attribute | | | | | | | | |
| | cudaLaunchAttribute | | | attribute[1]; | | | | | | | |
| attribute[0].id = cudaLaunchAttributeProgrammaticStreamSerialization; | |
| attribute[0].val.programmaticStreamSerializationAllowed = 1; | |
| | ∕∕ Set | the attribute | | in | a kernel | | launch | configuration | | | |
| | -------------------------- | ------------- | ------------- | ---------- | -------- | ---------- | ------ | --------------- | --- | | |
| | cudaLaunchConfig_t | | | config | | = {0}; | | | | | |
| | ∕∕ Base | launch | configuration | | | | | | | | |
| | config.gridDim | | = | grid_dim; | | | | | | | |
| | config.blockDim | | = | block_dim; | | | | | | | |
| | config.dynamicSmemBytes= | | | | 0; | | | | | | |
| | config.stream | | = stream; | | | | | | | | |
| | ∕∕ Add | special | attribute | | for | PDL | | | | | |
| | config.attrs | | = attribute; | | | | | | | | |
| | config.numAttrs | | = | 1; | | | | | | | |
| | ∕∕ Launch | primary | | kernel | | | | | | | |
| | primary_kernel<<<grid_dim, | | | | | block_dim, | | 0, stream>>>(); | | | |
| ∕∕ Launch secondary (dependent) kernel using the configuration with | |
| | ∕∕ the | attribute | | | | | | | | | |
| | --------------------------- | --------- | --- | --- | --- | ------------------ | --- | --- | --- | | |
| | cudaLaunchKernelEx(&config, | | | | | secondary_kernel); | | | | | |
| 3.1. AdvancedCUDAAPIsandFeatures 95 | |
| CUDAProgrammingGuide,Release13.1 | |
| | 3.1.5. | Batched | | Memory | | Transfers | | | | | |
| | ------ | ------- | --- | ------ | --- | --------- | --- | --- | --- | | |
| A common pattern in CUDA development is to use a technique of batching. By batching we loosely | |
| meanthatwehaveseveral(typicallysmall)tasksgroupedtogetherintoasingle(typicallybigger)op- | |
| eration. Thecomponentsofthebatchdonotnecessarilyallhavetobeidenticalalthoughtheyoften | |
| are. AnexampleofthisideaisthebatchmatrixmultiplicationoperationprovidedbycuBLAS. | |
| Generally as with CUDA Graphs, and PDL, the purpose of batching is to reduce overheads associ- | |
| atedwithdispatchingtheindividualbatchtasksseparately. Intermsofmemorytransferslaunchinga | |
| memorytransfercanincursomeCPUanddriveroverheads. Further,theregularcudaMemcpyAsync() | |
| functioninitscurrentformdoesnotnecessarilyprovideenoughinformationforthedrivertooptimize | |
| thetransfer,forexample,intermsofhintsaboutthesourceanddestination. OnTegraplatformsone | |
| has the choice of using SMs or Copy Engines (CEs)o perform transfers. The choice of which is cur- | |
| rentlyspecifiedbyaheuristicinthedriver. ThiscanbeimportantbecauseusingtheSMsmayresultin | |
| afastertransfer,howeverittiesdownsomeoftheavailablecomputepower. Ontheotherhand,using | |
| theCEsmayresultinaslowertransferbutoverallhigherapplicationperformance,sinceitleavesthe | |
| SMsfreetoperformotherwork. | |
| These considerations motivated the design of the cudaMemcpyBatchAsync() function (and its rel- | |
| ative cudaMemcpyBatch3DAsync()). These functions allow batched memory transfers to be opti- | |
| mized. Apartfromthelistsofsourceanddestinationpointers,theAPIusesmemorycopyattributes | |
| to specify expectations of orderings, with hints for source and destination locations, as well as for | |
| whetheronepreferstooverlapthetransferwithcompute(somethingthatiscurrentlyonlysupported | |
| onTegraplatformswithCEs). | |
| Letusfirstconsiderthesimplestcaseofasimplebatchtransferofdatafrompinnedhostmemoryto | |
| pinneddevicememory | |
| | | | Listing4: | ExampleofHomogeneousBatchedMemoryTransfer | | | | | | | |
| | --- | --- | --------- | ----------------------------------------- | --- | --- | --- | --- | --- | | |
| fromPinnedHostMemorytoPinnedDeviceMemory | |
| | std::vector<void | | | *> srcs(batch_size); | | | | | | | |
| | ------------------------ | --- | ---------- | --------------------- | ----------- | ------ | -------- | --- | --- | | |
| | std::vector<void | | | *> dsts(batch_size); | | | | | | | |
| | std::vector<void | | | *> sizes(batch_size); | | | | | | | |
| | ∕∕ Allocate | | the source | and | destination | | buffers | | | | |
| | ∕∕ initialize | | with | the stream | | number | | | | | |
| | for (size_t | | i = 0; | i < | batch_size; | | i++) { | | | | |
| | cudaMallocHost(&srcs[i], | | | | sizes[i]); | | | | | | |
| | cudaMalloc(&dsts[i], | | | | sizes[i]); | | | | | | |
| | cudaMemsetAsync(srcs[i], | | | | sizes[i], | | stream); | | | | |
| } | |
| | ∕∕ Setup | attributes | | for | this batch | | of copies | | | | |
| | -------------------- | ---------- | ------- | --------- | ------------------------------- | ---- | ---------------- | --- | --- | | |
| | cudaMemcpyAttributes | | | attrs | = {}; | | | | | | |
| | attrs.srcAccessOrder | | | = | cudaMemcpySrcAccessOrderStream; | | | | | | |
| | ∕∕ All | copies | in | the batch | have | same | copy attributes. | | | | |
| | size_t | attrsIdxs | | = 0; | ∕∕ Index | of | the attributes | | | | |
| | ∕∕ Launch | the | batched | memory | transfer | | | | | | |
| cudaMemcpyBatchAsync(&dsts[0], &srcs[0], &sizes[0], batch_size, | |
| &attrs, &attrsIdxs, 1 ∕*numAttrs*∕, nullptr ∕*failIdx*∕, stream); | |
| ThefirstfewparameterstothecudaMemcpyBatchAsync()functionseemimmediatelysensible. The | |
| arecomprisedofarrayscontainingthesourceanddestinationpointers,aswellasthetransfersizes. | |
| | 96 | | | | | | | Chapter3. | AdvancedCUDA | | |
| | --- | --- | --- | --- | --- | --- | --- | --------- | ------------ | | |
| CUDAProgrammingGuide,Release13.1 | |
| Eacharrayhastohave``batch_size``elements. Thenewinformationcomesfromtheattributes. The | |
| functionneeds apointerto anarrayofattributes, and a correspondingarrayofattributeindices. In | |
| principleitisalsopossibletopassanarrayofsize_tandinthisarraytheindicesofanfailedtransfers | |
| canberecorded,howeveritissafetopassanullptrhere,inthiscasetheindicesoffailureswillsimply | |
| notberecorded. | |
| Turningtotheattributes,inthisinstancethetransfersarehomogeneous. Soweuseonlyoneattribute, | |
| whichwillapplytoallthetransfers. ThisiscontrolledbytheattrIndexparameter. Inprinciplethiscan | |
| beanarray. Elementiofthearraycontainstheindexofthefirsttransfertowhichthei-thelementof | |
| the attributearray applies. In thiscase, attrIndex istreatedas a single element array, with the value | |
| ‘0’meaningthatattribute[0]willapplytoalltransferswithindex0andup, inotherwordsallthe | |
| transfers. | |
| Finally, wenotethatwehavesetthethesrcAccessOrderattributetocudaMemcpySrcAccessOr- | |
| derStream. Thismeansthatthesourcedatawillbeaccessedinregularstreamorder. Inotherwords, | |
| thememcpywillblockuntilpreviouskernelsdealingwiththedatafromanyofthesesourceanddes- | |
| tinationpointersarecompleted. | |
| Inthenextexamplewewillconsideramorecomplexcaseofaheterogeneousbatchtransfer. | |
| | | | Listing5: | | ExampleofHeterogeneousBatchedMemoryTrans- | | | | | | | |
| | --- | --- | --------- | ----- | ----------------------------------------- | --------- | ---- | ------ | --------- | ------ | | |
| | | | fer | using | some | Ephemeral | Host | Memory | to Pinned | Device | | |
| Memory | |
| | std::vector<void | | | *> srcs(batch_size); | | | | | | | | |
| | ------------------------ | --- | ------- | --------------------- | ---------- | ---------- | ---------- | --- | --- | --- | | |
| | std::vector<void | | | *> dsts(batch_size); | | | | | | | | |
| | std::vector<void | | | *> sizes(batch_size); | | | | | | | | |
| | ∕∕ Allocate | | the src | and | dst | buffers | | | | | | |
| | for (size_t | | i = 0; | i < | batch_size | | - 10; i++) | { | | | | |
| | cudaMallocHost(&srcs[i], | | | | | sizes[i]); | | | | | | |
| | cudaMalloc(&dsts[i], | | | | sizes[i]); | | | | | | | |
| } | |
| int buffer[10]; | |
| | for (size_t | | i = batch_size | | | - 10; i | < batch_size; | | i++) { | | | |
| | -------------------- | --- | -------------- | --- | ---------- | ----------- | ------------- | --- | ------ | --- | | |
| | srcs[i] | | = &buffer[10 | | - | (batch_size | - i]; | | | | | |
| | cudaMalloc(&dsts[i], | | | | sizes[i]); | | | | | | | |
| } | |
| | ∕∕ Setup | attributes | | for | this | batch | of copies | | | | | |
| | ----------------------- | ---------- | --- | -------- | --------------------------------- | ----- | --------- | --- | --- | --- | | |
| | cudaMemcpyAttributes | | | attrs[2] | | = {}; | | | | | | |
| | attrs[0].srcAccessOrder | | | | = cudaMemcpySrcAccessOrderStream; | | | | | | | |
| attrs[1].srcAccessOrder = cudaMemcpySrcAccessOrderDuringApiCall; | |
| | size_t | attrsIdxs[2]; | | | | | | | | | | |
| | ------------ | ------------- | ------------ | ------ | --- | -------- | --- | --- | --- | --- | | |
| | attrsIdxs[0] | | = 0; | | | | | | | | | |
| | attrsIdxs[1] | | = batch_size | | - | 10; | | | | | | |
| | ∕∕ Launch | | the batched | memory | | transfer | | | | | | |
| cudaMemcpyBatchAsync(&dsts[0], &srcs[0], &sizes[0], batch_size, | |
| &attrs, &attrsIdxs, 2 ∕*numAttrs*∕, nullptr ∕*failIdx*∕, stream); | |
| Herewehavetwokindsoftransfers: batch_size-10transferfrompinnedhostmemorytopinned | |
| devicememory,and10transfersfromahostarraytopinneddevicememory. Further,thebufferarray | |
| 3.1. AdvancedCUDAAPIsandFeatures 97 | |
| CUDAProgrammingGuide,Release13.1 | |
| isnotonlyonthehostbutisonlyinexistenceinthecurrentscope–itsaddressiswhatisknownas | |
| anephemeralpointer. ThispointermaynotbevalidaftertheAPIcallcompletes(itisasynchronous). | |
| Toperformthecopieswithsuchephemeralpointers,thesrcAccessOrderintheattributemustbeset | |
| tocudaMemcpySrcAccessOrderDuringApiCall. | |
| Wenowhavetwoattributes, the firstone applies to all transfers with indices starting at0, and less | |
| thanbatch_size-10. Thesecondoneappliestoalltransferswithindicesstartingatbatch_size-10 | |
| andlessthanbatch_size. | |
| Ifinsteadofallocatingthebufferarrayfromthestack,wehadallocateditfromtheheapusingmalloc | |
| thedatawouldnotbeephemeralanymore. Itwouldbevaliduntilthepointerwasexplicitlyfreed. In | |
| such a case the best option for how to stage the copies would depend on whether the system had | |
| hardwaremanagedmemoryorcoherentGPUaccesstohostmemoryviaaddresstranslationinwhich | |
| caseitwouldbebesttousestreamordering,orwhetheritdidnotinwhichcasestagingthetransfers | |
| immediatelywouldmakemostsense. Inthissituation,oneshouldusethevaluecudaMemcpyAcces- | |
| sOrderAnyforthesrcAccessOrderoftheattribute. | |
| ThecudaMemcpyBatchAsyncfunctionalsoallowstheprogrammertoprovidehintsaboutthesource | |
| anddestinationlocations. Thisisdone bysettingthe srcLocation and dstLocation fieldsofthe | |
| cudaMemcpyAttributesstructure. ThesrcLocation``and ``dstLocationfieldsarebothoftype | |
| cudaMemLocationwhichisastructurethatcontainsthetypeofthelocationandtheIDoftheloca- | |
| tion. ThisisthesamecudaMemLocationstructurethatcanbeusedtogiveprefetchinghintstothe | |
| runtimewhenusingcudaMemPrefetchAsync(). Weillustratehowtosetupthehintsforatransfer | |
| fromthedevice,toaspecificNUMAnodeofthehostinthecodeexamplebelow: | |
| | | Listing6: | ExampleofSettingSourceandDestinationLocation | | | | | | | | |
| | --- | --------- | -------------------------------------------- | --- | --- | --- | --- | --- | --- | | |
| Hints | |
| | ∕∕ Allocate | the source | and | destination | | buffers | | | | | |
| | ---------------- | ---------- | --------------------- | ----------- | --- | ------- | --- | --- | --- | | |
| | std::vector<void | | *> srcs(batch_size); | | | | | | | | |
| | std::vector<void | | *> dsts(batch_size); | | | | | | | | |
| | std::vector<void | | *> sizes(batch_size); | | | | | | | | |
| ∕∕ cudaMemLocation structures we will use tp provide location hints | |
| | ∕∕ Device | device_id | | | | | | | | | |
| | --------- | --------- | --- | --- | --- | --- | --- | --- | --- | | |
| cudaMemLocation srcLoc = {cudaMemLocationTypeDevice, dev_id}; | |
| | ∕∕ Host | with numa | Node numa_id | | | | | | | | |
| | ------- | --------- | ------------ | --- | --- | --- | --- | --- | --- | | |
| cudaMemLocation dstLoc = {cudaMemLocationTypeHostNuma, numa_id}; | |
| | ∕∕ Allocate | the src | and dst | buffers | | | | | | | |
| | ----------------------------- | ------- | --------------- | --------- | ---------- | -------- | ------- | ----------- | --- | | |
| | for (size_t | i = 0; | i < batch_size; | | i++) | { | | | | | |
| | cudaMallocManaged(&srcs[i], | | | | sizes[i]); | | | | | | |
| | cudaMallocManaged(&dsts[i], | | | | sizes[i]); | | | | | | |
| | cudaMemPrefetchAsync(srcs[i], | | | | sizes[i], | | srcLoc, | 0, stream); | | | |
| | cudaMemPrefetchAsync(dsts[i], | | | | sizes[i], | | dstLoc, | 0, stream); | | | |
| | cudaMemsetAsync(srcs[i], | | | sizes[i], | | stream); | | | | | |
| } | |
| | ∕∕ Setup | attributes | for this | batch | of | copies | | | | | |
| | -------------------- | ---------- | -------- | ----- | --- | ------ | --- | --- | --- | | |
| | cudaMemcpyAttributes | | attrs | = {}; | | | | | | | |
| ∕∕ These are managed memory pointers so Stream Order is appropriate | |
| | attrs.srcAccessOrder | | = cudaMemcpySrcAccessOrderStream; | | | | | | | | |
| | -------------------- | --- | --------------------------------- | --- | --- | --- | --- | --- | --- | | |
| (continuesonnextpage) | |
| | 98 | | | | | | | Chapter3. | AdvancedCUDA | | |
| | --- | --- | --- | --- | --- | --- | --- | --------- | ------------ | | |
| CUDAProgrammingGuide,Release13.1 | |
| (continuedfrompreviouspage) | |
| ∕∕ Now we can specify the location hints here. | |
| attrs.srcLocHint = srcLoc; | |
| attrs.dstlocHint = dstLoc; | |
| ∕∕ All copies in the batch have same copy attributes. | |
| size_t attrsIdxs = 0; | |
| ∕∕ Launch the batched memory transfer | |
| cudaMemcpyBatchAsync(&dsts[0], &srcs[0], &sizes[0], batch_size, | |
| &attrs, &attrsIdxs, 1 ∕*numAttrs*∕, nullptr ∕*failIdx*∕, stream); | |
| THelastthingtocoveristheflagforhintingwhetherwewanttouseSM’sorCEsforthetransfers. | |
| ThefieldforthisisthecudaMemcpyAttributesflags::flagsandthepossiblevaluesare: | |
| ▶ cudaMemcpyFlagDefault–defaultbehavior | |
| ▶ cudaMemcpyFlagPreferOverlapWithCompute – this hints that the system should prefer to | |
| use CEs for the transfers overlapping the transfer with computations. This flag is ignored on | |
| non-Tegraplatforms | |
| Insummary,themainpointsregarding“cudaMemcpyBatchAsync”areasfollows: | |
| ▶ ThecudaMemcpyBatchAsyncfunction(andits3Dvariant)allowstheprogrammertospecifya | |
| batchofmemorytransfers,allowingtheamortizationoftransfersetupoverheads. | |
| ▶ Otherthanthesourceanddestinationpointersandthetransfersizes,thefunctioncantakeone | |
| or more memory copy attributes providing information about the kind of memory being trans- | |
| ferredandthecorrespondingstreamorderingbehaviorofthesourcepointers, hintsaboutthe | |
| sourceanddestinationlocations,andhintsastowhethertoprefertooverlapthetransferwith | |
| compute(ifpossible)orwhethertouseSMsforthetransfer. | |
| ▶ Giventheaboveinformationtheruntimecanattempttooptimizethetransfertothemaximum | |
| degreepossible.. | |
| 3.1.6. Environment Variables | |
| CUDA provides various environment variables (see Section 5.2), which can affect execution and per- | |
| formance. If they are not explicitly set, CUDA uses reasonable default values for them, but special | |
| handlingmayberequiredonaper-casebasis,e.g.,fordebuggingpurposesortogetimprovedperfor- | |
| mance. | |
| Forexample,increasingthevalueoftheCUDA_DEVICE_MAX_CONNECTIONSenvironmentvariablemay | |
| benecessarytoreducethepossibilitythatindependentworkfromdifferentCUDAstreamsgetsseri- | |
| alizedduetofalsedependencies. Suchfalsedependenciesmaybeintroducedwhenthesameunder- | |
| lyingresource(s)areused. Itisrecommendedtostartbyusingthedefaultvalueandonlyexplorethe | |
| impact of this environment variable in case of performance issues (e.g., unexpected serialization of | |
| independentworkacrossCUDAstreamsthatcannotbeattributedtootherfactorslikelackofavail- | |
| ableSMresources). Worthnotingthatthisenvironmentvariablehasadifferent(lower)defaultvalue | |
| incaseofMPS. | |
| Similarly, setting the CUDA_MODULE_LOADING environment variable to EAGER may be preferable for | |
| latency-sensitiveapplications,inordertomovealloverheadduetomoduleloadingtotheapplication | |
| initializationphaseandoutsideitscriticalphase. Thecurrentdefaultmodeislazymoduleloading. In | |
| thisdefaultmode,asimilareffecttoeagermoduleloadingcouldbeachievedbyadding“warm-up”calls | |
| 3.1. AdvancedCUDAAPIsandFeatures 99 | |
| CUDAProgrammingGuide,Release13.1 | |
| ofthevariouskernelsduringtheapplication’sinitializationphase,toforcemoduleloadingtohappen | |
| sooner. | |
| PleaserefertoCUDAEnvironmentVariablesformoredetailsaboutthevariousCUDAenvironmentvari- | |
| ables. Itisrecommendedthatyousettheenvironmentvariablestonewvaluesbeforeyoulaunchthe | |
| application;attemptingtosetthemwithinyourapplicationmayhavenoeffect. | |
| 3.2. Advanced Kernel Programming | |
| ThischapterwillfirsttakeadeeperdiveintothehardwaremodelofNVIDIAGPUs,andthenintroduce | |
| some of the more advanced features available in CUDA kernel code aimed at improving kernel per- | |
| formance. Thischapterwillintroducesomeconceptsrelatedtothreadscopes,asynchronousexecu- | |
| tion,andtheassociatedsynchronizationprimitives. Theseconceptualdiscussionsprovideanecessary | |
| foundationforsomeoftheadvancedperformancefeaturesavailablewithinkernelcode. | |
| Detaileddescriptionsforsomeofthesefeaturesarecontainedinchaptersdedicatedtothefeatures | |
| inthenextpartofthisprogrammingguide. | |
| ▶ Advancedsynchronizationprimitivesintroducedinthischapter,arecoveredcompletelyinSection | |
| 4.9andSection4.10. | |
| ▶ Asynchronousdatacopies,includingthetensormemoryaccelerator(TMA),areintroducedinthis | |
| chapterandcoveredcompletelyinSection4.11. | |
| 3.2.1. Using PTX | |
| ParallelThreadExecution(PTX),thevirtualmachineinstructionsetarchitecture(ISA)thatCUDAuses | |
| to abstract hardware ISAs, was introduced in Section 1.3.3. Writing code in PTX directly is a highly | |
| advancedoptimizationtechniquethatisnotnecessaryformostdevelopersandshouldbeconsidered | |
| a tool of last resort. Nevertheless, there are situations where the fine-grained control enabled by | |
| writingPTXdirectlyenablesperformanceimprovementsinspecificapplications. Thesesituationsare | |
| typicallyinveryperformance-sensitiveportionsofanapplicationwhereeveryfractionofapercentof | |
| performanceimprovementhassignificantbenefits. AlloftheavailablePTXinstructionsareinthePTX | |
| ISAdocument. | |
| cuda::ptxnamespace | |
| One way to use PTX directly in your code is to use the cuda::ptx namespace from libcu++. This | |
| namespaceprovidesC++functionsthatmapdirectlytoPTXinstructions,simplifyingtheirusewithin | |
| aC++application. Formoreinformation,pleaserefertothecuda::ptxnamespacedocumentation. | |
| InlinePTX | |
| AnotherwaytoincludePTXinyourcodeistouseinlinePTX.Thismethodisdescribedindetailinthe | |
| correspondingdocumentation. ThisisverysimilartowritingassemblycodeonaCPU. | |
| 3.2.2. Hardware Implementation | |
| A streaming multiprocessor or SM (see GPU Hardware Model) is designed to execute hundreds of | |
| threadsconcurrently. Tomanagesuchalargenumberofthreads,itemploysauniqueparallelcomput- | |
| ingmodelcalledSingle-Instruction,Multiple-Thread,orSIMT,thatisdescribedinSIMTExecutionModel. | |
| Theinstructionsarepipelined,leveraginginstruction-levelparallelismwithinasinglethread,aswellas | |
| 100 Chapter3. AdvancedCUDA | |
| CUDAProgrammingGuide,Release13.1 | |
| extensivethread-levelparallelismthroughsimultaneoushardwaremultithreadingasdetailedinHard- | |
| ware Multithreading. Unlike CPU cores, SMs issue instructions in order and do not perform branch | |
| predictionorspeculativeexecution. | |
| SectionsSIMTExecutionModelandHardwareMultithreadingdescribethearchitecturalfeaturesofthe | |
| SMthatarecommontoalldevices. SectionComputeCapabilitiesprovidesthespecificsfordevicesof | |
| differentcomputecapabilities. | |
| TheNVIDIAGPUarchitectureusesalittle-endianrepresentation. | |
| 3.2.2.1 SIMTExecutionModel | |
| Each SM creates, manages, schedules, and executes threads in groups of 32 parallel threads called | |
| warps. Individual threads composing a warp start together at the same program address, but they | |
| have their own instruction address counter and register state and are therefore free to branch and | |
| executeindependently. Thetermwarporiginatesfromweaving,thefirstparallelthreadtechnology. A | |
| half-warpiseitherthefirstorsecondhalfofawarp. Aquarter-warpiseitherthefirst,second, third, | |
| orfourthquarterofawarp. | |
| Awarpexecutesonecommoninstructionatatime,sofullefficiencyisrealizedwhenall32threadsof | |
| a warp agree on their execution path. If threads of a warp diverge via a data-dependent conditional | |
| branch,thewarpexecuteseachbranchpathtaken,disablingthreadsthatarenotonthatpath. Branch | |
| divergence occurs only within a warp; different warps execute independently regardless of whether | |
| theyareexecutingcommonordisjointcodepaths. | |
| TheSIMTarchitectureisakintoSIMD(SingleInstruction,MultipleData)vectororganizationsinthata | |
| singleinstructioncontrolsmultipleprocessingelements. AkeydifferenceisthatSIMDvectororgani- | |
| zationsexposetheSIMDwidthtothesoftware,whereasSIMTinstructionsspecifytheexecutionand | |
| branchingbehaviorofasinglethread. IncontrastwithSIMDvectormachines,SIMTenablesprogram- | |
| merstowritethread-levelparallelcodeforindependent,scalarthreads,aswellasdata-parallelcode | |
| forcoordinatedthreads. Forthepurposesofcorrectness,theprogrammercanessentiallyignorethe | |
| SIMT behavior; however, substantial performance improvements can be realized by taking care that | |
| thecodeseldomrequiresthreadsinawarptodiverge. Inpractice,thisisanalogoustotheroleofcache | |
| lines: thecachelinesizecanbesafelyignoredwhendesigningforcorrectnessbutmustbeconsidered | |
| inthecodestructurewhendesigningforpeakperformance. Vectorarchitectures,ontheotherhand, | |
| requirethesoftwaretocoalesceloadsintovectorsandmanagedivergencemanually. | |
| 3.2.2.1.1 IndependentThreadScheduling | |
| OnGPUswithcomputecapabilitylowerthan7.0,warpsusedasingleprogramcountersharedamongst | |
| all32threadsinthewarptogetherwithanactivemaskspecifyingtheactivethreadsofthewarp. Asa | |
| result,threadsfromthesamewarpindivergentregionsordifferentstatesofexecutioncannotsignal | |
| eachotherorexchangedata,andalgorithmsrequiringfine-grainedsharingofdataguardedbylocks | |
| ormutexescanleadtodeadlock,dependingonwhichwarpthecontendingthreadscomefrom. | |
| InGPUsofcomputecapability7.0andlater,independentthreadschedulingallowsfullconcurrencybe- | |
| tweenthreads,regardlessofwarp. Withindependentthreadscheduling,theGPUmaintainsexecution | |
| stateperthread,includingaprogramcounterandcallstack,andcanyieldexecutionataper-thread | |
| granularity, either to make better use of execution resources or to allow one thread to wait for data | |
| to be produced by another. A schedule optimizer determines how to group active threads from the | |
| same warp together into SIMT units. This retains the high throughput of SIMT execution as in prior | |
| NVIDIA GPUs, but with much more flexibility: threads can now diverge and reconverge at sub-warp | |
| granularity. | |
| Independentthreadschedulingcanbreakcodethatreliesonimplicitwarp-synchronousbehaviorfrom | |
| previousGPUarchitectures. Warp-synchronouscodeassumesthatthreadsinthesamewarpexecute | |
| in lockstep at every instruction, but the ability for threads to diverge and reconverge at sub-warp | |
| 3.2. AdvancedKernelProgramming 101 | |
| CUDAProgrammingGuide,Release13.1 | |
| granularitymakessuchassumptionsinvalid. Thiscanleadtoadifferentsetofthreadsparticipating | |
| in the executed code than intended. Any warp-synchronous code developed for GPUs prior to CC | |
| 7.0 (such as synchronization-free intra-warp reductions) should be revisited to ensure compatibility. | |
| Developersshouldexplicitlysynchronizesuchcodeusing__syncwarp() toensurecorrectbehavior | |
| acrossallGPUgenerations. | |
| Note | |
| Thethreadsofawarpthatareparticipatinginthecurrentinstructionarecalledtheactivethreads, | |
| whereasthreadsnotonthecurrentinstructionareinactive(disabled). Threadscanbeinactivefor | |
| avarietyofreasonsincludinghavingexitedearlierthanotherthreadsoftheirwarp,havingtaken | |
| a different branch path than the branch path currently executed by the warp, or being the last | |
| threadsofablockwhosenumberofthreadsisnotamultipleofthewarpsize. | |
| If a non-atomic instruction executed by a warp writes to the same location in global or shared | |
| memoryfrommorethanoneofthethreadsofthewarp,thenumberofserializedwritesthatoccur | |
| to that location may vary depending on the compute capability of the device. However, for all | |
| computecapabilities,whichthreadperformsthefinalwriteisundefined. | |
| Ifanatomicinstructionexecutedbyawarpreads,modifies,andwritestothesamelocationinglobal | |
| memory for more than one of the threads of the warp, each read/modify/write to that location | |
| occursandtheyareallserialized,buttheorderinwhichtheyoccurisundefined. | |
| 3.2.2.2 HardwareMultithreading | |
| When an SM is given one or more thread blocks to execute, it partitions them into warps and each | |
| warp gets scheduled for execution by a warpscheduler. The way a block is partitioned into warps is | |
| alwaysthesame;eachwarpcontainsthreadsofconsecutive,increasingthreadIDswiththefirstwarp | |
| containingthread0. ThreadHierarchydescribeshowthreadIDsrelatetothreadindicesintheblock. | |
| Thetotalnumberofwarpsinablockisdefinedasfollows: | |
| ( ) | |
| ceil T ,1 | |
| Wsize | |
| ▶ T isthenumberofthreadsperblock, | |
| ▶ Wsizeisthewarpsize,whichisequalto32, | |
| ▶ ceil(x,y)isequaltoxroundeduptothenearestmultipleofy. | |
| Figure19: Athreadblockispartitionedintowarpsof32threads. | |
| 102 Chapter3. AdvancedCUDA | |
| CUDAProgrammingGuide,Release13.1 | |
| Theexecutioncontext(programcounters,registers,etc.) foreachwarpprocessedbyanSMismain- | |
| tainedon-chipthroughoutthewarp’slifetime. Therefore,switchingbetweenwarpsincursnocost. At | |
| each instruction issue cycle, a warp scheduler selects a warp with threads ready to execute its next | |
| instruction(theactivethreadsofthewarp)andissuestheinstructiontothosethreads. | |
| Each SM has a set of 32-bit registers that are partitioned among the warps, and a shared memory | |
| thatispartitionedamongthethreadblocks. Thenumberofblocksandwarpsthatcanresideandbe | |
| processedconcurrentlyontheSMforagivenkerneldependsontheamountofregistersandshared | |
| memory used by the kernel, as well as the amount of registers and shared memory available on the | |
| SM.TherearealsoamaximumnumberofresidentblocksandwarpsperSM.Theselimits,aswellthe | |
| amountofregistersandsharedmemoryavailableontheSM,dependonthecomputecapabilityofthe | |
| device and are specified in ComputeCapabilities. If there are not enough resources available per SM | |
| toprocessatleastoneblock,thekernelwillfailtolaunch. Thetotalnumberofregistersandshared | |
| memoryallocatedforablockcanbedeterminedinseveralwaysdocumentedintheOccupancysection. | |
| 3.2.2.3 AsynchronousExecutionFeatures | |
| RecentNVIDIAGPUgenerationshaveincludedasynchronousexecutioncapabilitiestoallowmoreover- | |
| lap of data movement, computation, and synchronization within the GPU. These capabilities enable | |
| certainoperationsinvokedfromGPUcodetoexecuteasynchronouslytootherGPUcodeinthesame | |
| thread block. This asynchronous execution should not be confused with asynchronous CUDA APIs | |
| discussed in Section 2.3, which enable GPU kernel launches or memory operations to operate asyn- | |
| chronouslytoeachotherortotheCPU. | |
| Computecapability8.0(TheNVIDIAAmpereGPUArchitecture)introducedhardware-acceleratedasyn- | |
| chronous data copies from global to shared memory and asynchronous barriers (see NVIDIA A100 | |
| TensorCoreGPUArchitecture). | |
| Computecapability9.0(TheNVIDIAHopperGPUarchitecture)extendedtheasynchronousexecution | |
| featureswiththeTensorMemoryAccelerator(TMA)unit, whichcantransferlargeblocksofdataand | |
| multidimensionaltensorsfromglobalmemorytosharedmemoryandviceversa,asynchronoustrans- | |
| actionbarriers,andasynchronousmatrixmultiply-accumulateoperations(seeHopperArchitecturein | |
| Depthblogpostfordetails.). | |
| CUDAprovidesAPIswhichcanbecalledbythreadsfromdevicecodetousethesefeatures. Theasyn- | |
| chronousprogrammingmodeldefinesthebehaviorofasynchronousoperationswithrespecttoCUDA | |
| threads. | |
| AnasynchronousoperationisanoperationinitiatedbyaCUDAthread,butexecutedasynchronously | |
| as if by another thread, which we will refer to as an asyncthread. In a well-formed program, one or | |
| more CUDA threads synchronize with the asynchronous operation. The CUDA thread that initiated | |
| theasynchronousoperationisnotrequiredtobeamongthesynchronizingthreads. Theasyncthread | |
| isalwaysassociatedwiththeCUDAthreadthatinitiatedtheoperation. | |
| An asynchronous operation uses a synchronization object to signal its completion, which could be a | |
| barrierorapipeline. ThesesynchronizationobjectsareexplainedindetailinAdvancedSynchronization | |
| Primitives, and their role in performing asynchronous memory operations is demonstrated in Asyn- | |
| chronousDataCopies. | |
| 3.2.2.3.1 AsyncThreadandAsyncProxy | |
| Asynchronousoperationsmayaccessmemorydifferentlythanregularoperations. Todistinguishbe- | |
| tweenthesedifferentmemoryaccessmethods,CUDAintroducestheconceptsofanasyncthread,a | |
| genericproxy,andanasyncproxy. Normaloperations(loadsandstores)gothroughthegenericproxy. | |
| Someasynchronousinstructions,suchasLDGSTSandSTAS/REDAS,aremodeledusinganasyncthread | |
| operating in the generic proxy. Other asynchronous instructions, such as bulk-asynchronous copies | |
| 3.2. AdvancedKernelProgramming 103 | |
| CUDAProgrammingGuide,Release13.1 | |
| withTMAandsometensorcoreoperations(tcgen05.*, wgmma.mma_async.*), aremodeledusingan | |
| asyncthreadoperatingintheasyncproxy. | |
| Asyncthreadoperatingingenericproxy. Whenanasynchronousoperationisinitiated,itisassociated | |
| withanasyncthread,whichisdifferentfromtheCUDAthreadthatinitiatedtheoperation. Preceding | |
| genericproxy(normal)loadsandstorestothesameaddressareguaranteedtobeorderedbeforethe | |
| asynchronousoperation. However, subsequent normalloadsandstorestothesameaddressarenot | |
| guaranteed to maintain their ordering, potentially incurring a race condition until the async thread | |
| completes. | |
| Asyncthreadoperatinginasyncproxy. Whenanasynchronousoperationisinitiated,itisassociated | |
| withanasyncthread,whichisdifferentfromtheCUDAthreadthatinitiatedtheoperation. Priorand | |
| subsequentnormalloadsandstorestothesameaddressarenotguaranteedtomaintaintheirordering. | |
| Aproxyfenceisrequiredtosynchronizethemacrossthedifferentproxiestoensurepropermemory | |
| ordering. Section Using the Tensor Memory Accelerator (TMA) demonstrates use of proxy fences to | |
| ensurecorrectnesswhenperformingasynchronouscopieswithTMA. | |
| Formoredetailsontheseconcepts,seethePTXISAdocumentation. | |
| 3.2.3. Thread Scopes | |
| CUDA threads form a ThreadHierarchy, and using this hierarchy is essential for writing both correct | |
| andperformantCUDAkernels. Withinthishierarchy,thevisibilityandsynchronizationscopeofmem- | |
| oryoperationscanvary. Toaccountforthisnon-uniformity,theCUDAprogrammingmodelintroduces | |
| theconceptofthreadscopes. Athreadscopedefineswhichthreadscanobserveathread’sloadsand | |
| storesandspecifieswhichthreadscansynchronizewitheachotherusingsynchronizationprimitives | |
| suchasatomicoperationsandbarriers. Eachscopehasanassociatedpointofcoherencyinthemem- | |
| oryhierarchy. | |
| ThreadscopesareexposedinCUDAPTXandarealsoavailableasextensionsinthelibcu++library. The | |
| followingtabledefinesthethreadscopesavailable: | |
| CUDA C++ CUDA PTX Description Point of Coherency | |
| ThreadScope Thread in Memory Hierar- | |
| Scope chy | |
| cuda::thread_scope_thread Memory operations are visible only to the – | |
| localthread. | |
| cuda::thread_sc.ocptea_block Memory operations are visible to other L1 | |
| threadsinthesamethreadblock. | |
| .cluster Memory operations are visible to other L2 | |
| threadsinthesamethreadblockcluster. | |
| cuda::thread_sc.ogppeu_device Memory operations are visible to other L2 | |
| threadsinthesameGPUdevice. | |
| cuda::thread_sc.ospyes_system Memory operations are visible to other L2 + connected | |
| threads in the same system (CPU, other caches | |
| GPUs). | |
| Sections Advanced Synchronization Primitives and Asynchronous Data Copies demonstrate use of | |
| threadscopes. | |
| 104 Chapter3. AdvancedCUDA | |
| CUDAProgrammingGuide,Release13.1 | |
| | 3.2.4. | Advanced | | Synchronization | | | Primitives | | | |
| | ------ | -------- | --- | --------------- | --- | --- | ---------- | --- | | |
| Thissectionintroducesthreefamiliesofsynchronizationprimitives: | |
| ▶ ScopedAtomics,whichpairC++memoryorderingwithCUDAthreadscopestosafelycommuni- | |
| cateacrossthreadsatblock,cluster,device,orsystemscope(seeThreadScopes). | |
| ▶ AsynchronousBarriers,whichsplitsynchronizationintoarrivalandwaitphases,andcanbeused | |
| totracktheprogressofasynchronousoperations. | |
| ▶ Pipelines,whichstageworkandcoordinatemulti-bufferproducer–consumerpatterns,commonly | |
| usedtooverlapcomputewithasynchronousdatacopies. | |
| 3.2.4.1 ScopedAtomics | |
| Section5.4.5givesanoverviewofatomicfunctionsavailableinCUDA.Inthissection,wewillfocuson | |
| scopedatomicsthatsupportC++standardatomicmemorysemantics,availablethroughthelibcu++ | |
| libraryorthroughcompilerbuilt-infunctions. Scopedatomicsprovidethetoolsforefficientsynchro- | |
| nizationattheappropriateleveloftheCUDAthreadhierarchy,enablingbothcorrectnessandperfor- | |
| manceincomplexparallelalgorithms. | |
| | 3.2.4.1.1 | ThreadScopeandMemoryOrdering | | | | | | | | |
| | --------- | ---------------------------- | --- | --- | --- | --- | --- | --- | | |
| Scopedatomicscombinetwokeyconcepts: | |
| ▶ ThreadScope: defineswhichthreadscanobservetheeffectoftheatomicoperation(seeThread | |
| Scopes). | |
| ▶ Memory Ordering: defines the ordering constraints relative to other memory operations (see | |
| C++standardatomicmemorysemantics). | |
| CUDAC++cuda::atomic | |
| | #include | <cuda∕atomic> | | | | | | | | |
| | ---------- | ------------- | ------ | ---------------------- | ------- | ----------- | ---- | ----- | | |
| | __global__ | | void | block_scoped_counter() | | { | | | | |
| | ∕∕ | Shared | atomic | counter | visible | only within | this | block | | |
| __shared__ cuda::atomic<int, cuda::thread_scope_block> counter; | |
| | ∕∕ | Initialize | | counter | (only one | thread | should | do this) | | |
| | --- | ---------------- | --- | ------- | ---------------------------- | ------ | ------ | -------- | | |
| | if | (threadIdx.x | | == 0) | { | | | | | |
| | | counter.store(0, | | | cuda::memory_order_relaxed); | | | | | |
| } | |
| __syncthreads(); | |
| | ∕∕ | All | threads | in block | atomically | increment | | | | |
| | --- | --- | ------- | -------- | ---------- | --------- | --- | --- | | |
| int old_value = counter.fetch_add(1, cuda::memory_order_relaxed); | |
| | ∕∕ | Use | old_value... | | | | | | | |
| | --- | --- | ------------ | --- | --- | --- | --- | --- | | |
| } | |
| 3.2. AdvancedKernelProgramming 105 | |
| CUDAProgrammingGuide,Release13.1 | |
| Built-inAtomicFunctions | |
| | __global__ | | void | block_scoped_counter() | | | { | | | | | |
| | ---------- | ----------------------------- | ------- | ---------------------- | --------- | ------ | ---------- | -------- | --- | --- | | |
| | ∕∕ | Shared | counter | visible | only | within | this block | | | | | |
| | __shared__ | | | int counter; | | | | | | | | |
| | ∕∕ | Initialize | | counter | (only one | thread | should | do this) | | | | |
| | if | (threadIdx.x | | == 0) | { | | | | | | | |
| | | __nv_atomic_store_n(&counter, | | | | | 0, | | | | | |
| __NV_ATOMIC_RELAXED, | |
| __NV_THREAD_SCOPE_BLOCK); | |
| } | |
| __syncthreads(); | |
| | ∕∕ | All | threads | in block | atomically | | increment | | | | | |
| | --- | --------- | ------- | --------------------------------- | ---------- | --- | --------- | --- | --- | --- | | |
| | int | old_value | | = __nv_atomic_fetch_add(&counter, | | | | 1, | | | | |
| __NV_ATOMIC_RELAXED, | |
| __NV_THREAD_SCOPE_BLOCK); | |
| | ∕∕ | Use | old_value... | | | | | | | | | |
| | --- | --- | ------------ | --- | --- | --- | --- | --- | --- | --- | | |
| } | |
| Thisexampleimplementsablock-scopedatomiccounterthatdemonstratesthefundamentalconcepts | |
| ofscopedatomics: | |
| ▶ Shared Variable: a single counter is shared among all threads in the block using __shared__ | |
| memory. | |
| ▶ Atomic Type Declaration: cuda::atomic<int, cuda::thread_scope_block> creates an | |
| atomicintegerwithblock-levelvisibility. | |
| ▶ SingleInitialization: onlythread0initializesthecountertopreventraceconditionsduringsetup. | |
| ▶ | |
| BlockSynchronization: __syncthreads()ensuresallthreadsseetheinitializedcounterbefore | |
| proceeding. | |
| ▶ | |
| Atomic Increment: each thread atomically increments the counter and receives the previous | |
| value. | |
| cuda::memory_order_relaxed is chosen here because we only need atomicity (indivisible read- | |
| modify-write) without ordering constraints between different memory locations. Since this is a | |
| straightforwardcountingoperation,theorderofincrementsdoesn’tmatterforcorrectness. | |
| Forproducer-consumerpatterns,acquire-releasesemanticsensureproperordering: | |
| CUDAC++cuda::atomic | |
| | __global__ | | void | producer_consumer() | | { | | | | | | |
| | ---------- | --- | ---- | ------------------- | --- | --- | --- | --- | --- | --- | | |
| | __shared__ | | | int data; | | | | | | | | |
| __shared__ cuda::atomic<bool, cuda::thread_scope_block> ready; | |
| | if | (threadIdx.x | | == 0) | { | | | | | | | |
| | --- | ------------ | --------- | ----- | --------- | ------ | ----- | --- | --- | --- | | |
| | | ∕∕ | Producer: | write | data then | signal | ready | | | | | |
| | | data | = | 42; | | | | | | | | |
| ready.store(true, cuda::memory_order_release); ∕∕ Release ensures | |
| | ,→data | write | is | visible | | | | | | | | |
| | ------ | ----- | --- | ------- | --- | --- | --- | --- | --- | --- | | |
| (continuesonnextpage) | |
| | 106 | | | | | | | | Chapter3. | AdvancedCUDA | | |
| | --- | --- | --- | --- | --- | --- | --- | --- | --------- | ------------ | | |
| CUDAProgrammingGuide,Release13.1 | |
| (continuedfrompreviouspage) | |
| | } | else { | | | | | | | |
| | --- | ------------ | ---- | --------- | ---------------- | ---- | --- | | |
| | | ∕∕ Consumer: | wait | for ready | signal then read | data | | | |
| while (!ready.load(cuda::memory_order_acquire)) { ∕∕ Acquire ensures | |
| | ,→data | read sees | the write | | | | | | |
| | ------ | --------- | --------- | --- | --- | --- | --- | | |
| | | ∕∕ spin | wait | | | | | | |
| } | |
| | | int value | = data; | | | | | | |
| | --- | ---------- | -------- | --- | --- | --- | --- | | |
| | | ∕∕ Process | value... | | | | | | |
| } | |
| } | |
| Built-inAtomicFunctions | |
| | __global__ | void producer_consumer() | | | { | | | | |
| | ---------- | ------------------------ | ----- | --- | --- | --- | --- | | |
| | __shared__ | int | data; | | | | | | |
| __shared__ bool ready; ∕∕ Only ready flag needs atomic operations | |
| | if | (threadIdx.x | == 0) | { | | | | | |
| | --- | --------------------------- | ----- | --------- | ------------ | --- | --- | | |
| | | ∕∕ Producer: | write | data then | signal ready | | | | |
| | | data = 42; | | | | | | | |
| | | __nv_atomic_store_n(&ready, | | | true, | | | | |
| __NV_ATOMIC_RELEASE, | |
| | | | | __NV_THREAD_SCOPE_BLOCK); | | ∕∕ Release | ensures | | |
| | ------ | ---------------- | ---- | ------------------------- | ---------------- | ---------- | -------- | | |
| | ,→data | write is visible | | | | | | | |
| | } | else { | | | | | | | |
| | | ∕∕ Consumer: | wait | for ready | signal then read | data | | | |
| while (!__nv_atomic_load_n(&ready, | |
| __NV_ATOMIC_ACQUIRE, | |
| | | | | | __NV_THREAD_SCOPE_BLOCK)) | { | ∕∕ Acquire | | |
| | --------- | --------- | -------- | ----- | ------------------------- | --- | ----------- | | |
| | ,→ensures | data read | sees the | write | | | | | |
| | | ∕∕ spin | wait | | | | | | |
| } | |
| | | int value | = data; | | | | | | |
| | --- | ---------- | -------- | --- | --- | --- | --- | | |
| | | ∕∕ Process | value... | | | | | | |
| } | |
| } | |
| | 3.2.4.1.2 | PerformanceConsiderations | | | | | | | |
| | --------- | ------------------------- | --- | --- | --- | --- | --- | | |
| ▶ | |
| Use the narrowest scope possible: block-scoped atomics are much faster than system-scoped | |
| atomics. | |
| ▶ | |
| Preferweakerorderings: usestrongerorderingsonlywhennecessaryforcorrectness. | |
| ▶ Considermemorylocation: sharedmemoryatomicsarefasterthanglobalmemoryatomics. | |
| 3.2.4.2 AsynchronousBarriers | |
| An asynchronous barrier differs from a typical single-stage barrier (__syncthreads()) in that the | |
| notificationbyathreadthatithasreachedthebarrier(the“arrival”)isseparatedfromtheoperation | |
| of waiting for other threads to arrive at the barrier (the “wait”). This separation increases execution | |
| efficiencybyallowingathreadtoperformadditionaloperationsunrelatedtothebarrier,makingmore | |
| 3.2. AdvancedKernelProgramming 107 | |
| CUDAProgrammingGuide,Release13.1 | |
| efficientuseofthewaittime. Asynchronousbarrierscanbeusedtoimplementproducer-consumer | |
| patternswithCUDAthreadsorenableasynchronousdatacopieswithinthememoryhierarchybyhav- | |
| ingthecopyoperationsignal(“arriveon”)abarrieruponcompletion. | |
| Asynchronousbarriersareavailableondevicesofcomputecapability7.0orhigher. Devicesofcompute | |
| capability 8.0 or higher provide hardware acceleration for asynchronous barriers in shared-memory | |
| andasignificantadvancementinsynchronizationgranularity, byallowinghardware-acceleratedsyn- | |
| chronizationofanysubsetofCUDAthreadswithintheblock. Previousarchitecturesonlyaccelerate | |
| synchronizationatawhole-warp(__syncwarp())orwhole-block(__syncthreads())level. | |
| The CUDA programming model provides asynchronous barriers via cuda::std::barrier, an ISO | |
| C++-conforming barrier available in the libcu++ library. In addition to implementing std::barrier, the | |
| libraryoffersCUDA-specificextensionstoselectabarrier’sthreadscopetoimproveperformanceand | |
| exposesalower-levelcuda::ptxAPI.Acuda::barriercaninteroperatewithcuda::ptxbyusingthe | |
| friendfunctioncuda::device::barrier_native_handle()toretrievethebarrier’snativehan- | |
| dleandpassittocuda::ptxfunctions. CUDAalsoprovidesaprimitivesAPIforasynchronousbarriers | |
| insharedmemoryatthread-blockscope. | |
| Thefollowingtablegivesanoverviewofasynchronousbarriersavailableforsynchronizingatdifferent | |
| threadscopes. | |
| | | Thread | Memory | Lo- | Arrive | on Wait | on Hardware- | | CUDAAPIs | | | |
| | --- | ------ | ------ | ------ | ------- | ------- | ------------ | --- | -------------------- | --- | | |
| | | Scope | cation | | Barrier | Barrier | accelerated | | | | | |
| | | block | local | shared | allowed | allowed | yes(8.0+) | | cuda::barrier, | | | |
| | | | memory | | | | | | cuda::ptx,primitives | | | |
| cluster local shared allowed allowed yes(9.0+) cuda::barrier, | |
| memory | |
| cuda::ptx | |
| | | cluster | remote | | allowed | not | al- yes(9.0+) | | cuda::barrier, | | | |
| | --- | ------- | ------ | ---- | ------- | ----- | ------------- | --- | -------------- | --- | | |
| | | | shared | mem- | | lowed | | | | | | |
| cuda::ptx | |
| ory | |
| | | device | global | mem- | allowed | allowed | no | | | | | |
| | --- | ------ | ------ | ---- | ------- | ------- | --- | --- | --- | --- | | |
| cuda::barrier | |
| ory | |
| | | system | global/unified | | allowed | allowed | no | | | | | |
| | --- | ------ | -------------- | --- | ------- | ------- | --- | --- | --- | --- | | |
| cuda::barrier | |
| memory | |
| TemporalSplittingofSynchronization | |
| Withouttheasynchronousarrive-waitbarriers,synchronizationwithinathreadblockisachievedusing | |
| __syncthreads()orblock.sync()whenusingCooperativeGroups. | |
| | #include | | <cooperative_groups.h> | | | | | | | | | |
| | ---------- | ---- | ---------------------- | ------------------------------------------ | ------------------ | ---------------- | ----- | --- | --- | --- | | |
| | __global__ | | void | simple_sync(int | | iteration_count) | | { | | | | |
| | | auto | block | = cooperative_groups::this_thread_block(); | | | | | | | | |
| | | for | (int i | = 0; i | < iteration_count; | | ++i) | { | | | | |
| | | | ∕* code | before | arrive | *∕ | | | | | | |
| | | | ∕∕ Wait | for all | threads | to arrive | here. | | | | | |
| (continuesonnextpage) | |
| | 108 | | | | | | | | Chapter3. | AdvancedCUDA | | |
| | --- | --- | --- | --- | --- | --- | --- | --- | --------- | ------------ | | |
| CUDAProgrammingGuide,Release13.1 | |
| (continuedfrompreviouspage) | |
| block.sync(); | |
| ∕* code after wait *∕ | |
| } | |
| } | |
| Threadsareblockedatthesynchronizationpoint(block.sync())untilallthreadshavereachedthe | |
| synchronizationpoint. Inaddition,memoryupdatesthathappenedbeforethesynchronizationpoint | |
| areguaranteedtobevisibletoallthreadsintheblockafterthesynchronizationpoint. | |
| Thispatternhasthreestages: | |
| ▶ Codebeforethesyncperformsmemoryupdatesthatwillbereadafterthesync. | |
| ▶ Synchronizationpoint. | |
| ▶ Codeafterthesync,withvisibilityofmemoryupdatesthathappenedbeforethesync. | |
| Usingasynchronousbarriersinstead,thetemporally-splitsynchronizationpatternisasfollows. | |
| 3.2. AdvancedKernelProgramming 109 | |
| CUDAProgrammingGuide,Release13.1 | |
| CUDAC++cuda::barrier | |
| | #include | <cuda∕barrier> | | | | | | | |
| | ---------- | ---------------------- | ------------- | ------ | --------------- | --- | --- | | |
| | #include | <cooperative_groups.h> | | | | | | | |
| | __device__ | void | compute(float | *data, | int iteration); | | | | |
| __global__ void split_arrive_wait(int iteration_count, float *data) | |
| { | |
| | using | barrier_t | = cuda::barrier<cuda::thread_scope_block>; | | | | | | |
| | ----------------------- | --------- | ------------------------------------------ | ----- | --- | --- | --- | | |
| | __shared__ | barrier_t | bar; | | | | | | |
| | auto block | = | cooperative_groups::this_thread_block(); | | | | | | |
| | if (block.thread_rank() | | | == 0) | | | | | |
| { | |
| | ∕∕ Initialize | | barrier | with expected | arrival | count. | | | |
| | ------------- | --- | -------------- | ------------- | ------- | ------ | --- | | |
| | init(&bar, | | block.size()); | | | | | | |
| } | |
| block.sync(); | |
| | for (int | i = | 0; i < iteration_count; | | ++i) | | | | |
| | -------- | --- | ----------------------- | --- | ---- | --- | --- | | |
| { | |
| | ∕* code | before | arrive | *∕ | | | | | |
| | ------------------------ | ------ | -------- | ------------ | ------------- | --------- | --- | | |
| | ∕∕ This | thread | arrives. | Arrival does | not block | a thread. | | | |
| | barrier_t::arrival_token | | | token = | bar.arrive(); | | | | |
| | compute(data, | | i); | | | | | | |
| ∕∕ Wait for all threads participating in the barrier to complete bar. | |
| ,→arrive(). | |
| bar.wait(std::move(token)); | |
| | ∕* code | after | wait *∕ | | | | | | |
| | ------- | ----- | ------- | --- | --- | --- | --- | | |
| } | |
| } | |
| | 110 | | | | | Chapter3. | AdvancedCUDA | | |
| | --- | --- | --- | --- | --- | --------- | ------------ | | |
| CUDAProgrammingGuide,Release13.1 | |
| CUDAC++cuda::ptx | |
| | #include | <cuda∕ptx> | | | | | | |
| | ---------- | ---------------------- | ------------- | ------ | --------------- | --- | | |
| | #include | <cooperative_groups.h> | | | | | | |
| | __device__ | void | compute(float | *data, | int iteration); | | | |
| __global__ void split_arrive_wait(int iteration_count, float *data) | |
| { | |
| | __shared__ | uint64_t | bar; | | | | | |
| | ----------------------- | -------- | ---------------------------------------- | ----- | --- | --- | | |
| | auto block | = | cooperative_groups::this_thread_block(); | | | | | |
| | if (block.thread_rank() | | | == 0) | | | | |
| { | |
| | ∕∕ Initialize | | barrier | with expected | arrival | count. | | |
| | ------------------------------ | --- | ------- | -------------- | ------- | ------ | | |
| | cuda::ptx::mbarrier_init(&bar, | | | block.size()); | | | | |
| } | |
| block.sync(); | |
| | for (int | i = | 0; i < iteration_count; | | ++i) | | | |
| | -------- | --- | ----------------------- | --- | ---- | --- | | |
| { | |
| | ∕* code | before | arrive | *∕ | | | | |
| | ------------- | ------ | ----------------------------------- | ------------ | --------- | --------- | | |
| | ∕∕ This | thread | arrives. | Arrival does | not block | a thread. | | |
| | uint64_t | token | = cuda::ptx::mbarrier_arrive(&bar); | | | | | |
| | compute(data, | | i); | | | | | |
| ∕∕ Wait for all threads participating in the barrier to complete mbarrier_ | |
| ,→arrive(). | |
| | while(!cuda::ptx::mbarrier_try_wait(&bar, | | | | token)) | {} | | |
| | ----------------------------------------- | ----- | ------- | --- | ------- | --- | | |
| | ∕* code | after | wait *∕ | | | | | |
| } | |
| } | |
| 3.2. AdvancedKernelProgramming 111 | |
| CUDAProgrammingGuide,Release13.1 | |
| CUDACprimitives | |
| | #include | | <cuda_awbarrier_primitives.h> | | | | | | | | |
| | ---------- | --- | ----------------------------- | ------------- | ------ | --------------- | --- | --- | --- | | |
| | #include | | <cooperative_groups.h> | | | | | | | | |
| | __device__ | | void | compute(float | *data, | int iteration); | | | | | |
| __global__ void split_arrive_wait(int iteration_count, float *data) | |
| { | |
| | | __shared__ | __mbarrier_t | | bar; | | | | | | |
| | --- | ----------------------- | ------------ | ---------------------------------------- | ----- | --- | --- | --- | --- | | |
| | | auto block | = | cooperative_groups::this_thread_block(); | | | | | | | |
| | | if (block.thread_rank() | | | == 0) | | | | | | |
| { | |
| | | ∕∕ Initialize | | barrier | with expected | arrival | count. | | | | |
| | --- | --------------------- | --- | ------- | -------------- | ------- | ------ | --- | --- | | |
| | | __mbarrier_init(&bar, | | | block.size()); | | | | | | |
| } | |
| block.sync(); | |
| | | for (int | i = | 0; i < iteration_count; | | ++i) | | | | | |
| | --- | -------- | --- | ----------------------- | --- | ---- | --- | --- | --- | | |
| { | |
| | | ∕* code | before | arrive | *∕ | | | | | | |
| | --- | ------------------ | ------ | -------- | -------------------------------- | -------------- | --------- | --- | --- | | |
| | | ∕∕ This | thread | arrives. | Arrival | does not block | a thread. | | | | |
| | | __mbarrier_token_t | | | token = __mbarrier_arrive(&bar); | | | | | | |
| | | compute(data, | | i); | | | | | | | |
| ∕∕ Wait for all threads participating in the barrier to complete __ | |
| ,→mbarrier_arrive(). | |
| | | while(!__mbarrier_try_wait(&bar, | | | | token, 1000)) | {} | | | | |
| | --- | -------------------------------- | ----- | ------- | --- | ------------- | --- | --- | --- | | |
| | | ∕* code | after | wait *∕ | | | | | | | |
| } | |
| } | |
| Inthispattern,thesynchronizationpointissplitintoanarrivepoint(bar.arrive())andawaitpoint | |
| (bar.wait(std::move(token))). A thread begins participating in a cuda::barrier with its first | |
| call to bar.arrive(). When a thread calls bar.wait(std::move(token)) it will be blocked until | |
| participating threads have completed bar.arrive() the expected number of times, which is the | |
| expectedarrivalcountargumentpassedtoinit(). Memoryupdatesthathappenbeforeparticipating | |
| threads’calltobar.arrive()areguaranteedtobevisibletoparticipatingthreadsaftertheircallto | |
| bar.wait(std::move(token)). Note that the call to bar.arrive() does not block a thread, it | |
| canproceedwithotherworkthatdoesnotdependuponmemoryupdatesthathappenbeforeother | |
| participatingthreads’calltobar.arrive(). | |
| Thearriveandwaitpatternhasfivestages: | |
| ▶ Codebeforethearriveperformsmemoryupdatesthatwillbereadafterthewait. | |
| ▶ | |
| Arrivepointwithimplicitmemoryfence(i.e.,equivalenttocuda::atomic_thread_fence(cuda::memory_order_seq_cst, | |
| cuda::thread_scope_block)). | |
| | 112 | | | | | | | Chapter3. | AdvancedCUDA | | |
| | --- | --- | --- | --- | --- | --- | --- | --------- | ------------ | | |
| CUDAProgrammingGuide,Release13.1 | |
| ▶ Codebetweenarriveandwait. | |
| ▶ Waitpoint. | |
| ▶ Codeafterthewait,withvisibilityofupdatesthatwereperformedbeforethearrive. | |
| Foracomprehensiveguideonhowtouseasynchronousbarriers,seeAsynchronousBarriers. | |
| 3.2.4.3 Pipelines | |
| TheCUDAprogrammingmodelprovidesthepipelinesynchronizationobjectasacoordinationmecha- | |
| nismtosequenceasynchronousmemorycopiesintomultiplestages,facilitatingtheimplementation | |
| ofdouble-ormulti-bufferingproducer-consumerpatterns. Apipelineisadouble-endedqueuewitha | |
| headandatailthatprocessesworkinafirst-infirst-out(FIFO)order. Producerthreadscommitwork | |
| tothepipeline’shead,whileconsumerthreadspullworkfromthepipeline’stail. | |
| Pipelines are exposed through the cuda::pipeline API in the libcu++ library, as well as through a | |
| primitivesAPI.ThefollowingtablesdescribethemainfunctionalityofthetwoAPIs. | |
| cuda::pipelineAPI Description | |
| producer_acquire Acquiresanavailablestageinthepipeline’sinternalqueue. | |
| producer_commit Commits the asynchronous operations issued after the pro- | |
| ducer_acquirecallonthecurrentlyacquiredstageofthepipeline. | |
| consumer_wait Waitsforcompletionofasynchronousoperationsintheoldeststage | |
| ofthepipeline. | |
| consumer_release Releases the oldest stage of the pipeline to the pipeline object for | |
| reuse. Thereleasedstagecanbethenacquiredbyaproducer. | |
| PrimitivesAPI Description | |
| __pipeline_memcpy_asyncRequestamemorycopyfromglobaltosharedmemorytobesubmit- | |
| tedforasynchronousevaluation. | |
| __pipeline_commit Commitstheasynchronousoperationsissuedbeforethecallonthe | |
| currentstageofthepipeline. | |
| __pipeline_wait_prior(N)WaitsforcompletionofasynchronousoperationsinallbutthelastN | |
| commitstothepipeline. | |
| The cuda::pipeline API has a richer interface with less restrictions, while the primitives API only | |
| supportstrackingasynchronouscopiesfromglobalmemorytosharedmemorywithspecificsizeand | |
| alignment requirements. The primitives API provides equivalent functionality to a cuda::pipeline | |
| objectwithcuda::thread_scope_thread. | |
| Fordetailedusagepatternsandexamples,seePipelines. | |
| 3.2.5. Asynchronous Data Copies | |
| Efficientdatamovementwithinthememoryhierarchyisfundamentaltoachievinghighperformance | |
| inGPUcomputing. Traditionalsynchronousmemoryoperationsforcethreadstowaitidleduringdata | |
| 3.2. AdvancedKernelProgramming 113 | |
| CUDAProgrammingGuide,Release13.1 | |
| transfers. GPUs inherently hide memory latency through parallelism. That is, the SM switches to | |
| executeanotherwarpwhilememoryoperationscomplete. Evenwiththislatencyhidingthroughpar- | |
| allelism,itisstillpossibleformemorylatencytobeabottleneckonbothmemorybandwidthutilization | |
| and compute resource efficiency. To address these bottlenecks, modern GPU architectures provide | |
| hardware-acceleratedasynchronousdatacopymechanismsthatallowmemorytransferstoproceed | |
| independentlywhilethreadscontinueexecutingotherwork. | |
| Asynchronousdatacopiesenableoverlappingofcomputationwithdatamovement,bydecouplingthe | |
| initiationofamemorytransferfromwaitingforitscompletion. Thisway,threadscanperformuseful | |
| workduringmemorylatencyperiods,leadingtoimprovedoverallthroughputandresourceutilization. | |
| Note | |
| While concepts and principles underlying this section are similar to those discussed in the ear- | |
| lier chapter on AsynchronousExecution, that chapter covered asynchronous execution of kernels | |
| andmemorytransferssuchasthoseinvokedbycudaMemcpyAsync. Thatcanbeconsideredasyn- | |
| chronyofdifferentcomponentsoftheapplication. | |
| The asynchrony described in this section refers to enabling transfer of data between the GPU’s | |
| DRAM,i.e. globalmemory,andon-SMmemorysuchassharedmemoryortensormemorywithout | |
| blockingtheGPUthreads. Thisisanasynchronywithintheexecutionofasinglekernellaunch. | |
| Tounderstandhowasynchronouscopiescanimproveperformance,itishelpfultoexamineacommon | |
| GPUcomputingpattern. CUDAapplicationsoftenemployacopyandcomputepatternthat: | |
| ▶ fetchesdatafromglobalmemory, | |
| ▶ storesdatatosharedmemory,and | |
| ▶ performs computations on shared memory data, and potentially writes results back to global | |
| memory. | |
| The copy phase of this pattern is typically expressed as shared[local_idx] = | |
| global[global_idx]. This global to shared memory copy is expanded by the compiler to a | |
| readfromglobalmemoryintoaregisterfollowedbyawritetosharedmemoryfromtheregister. | |
| When this pattern occurs within an iterative algorithm, each thread block needs to synchronize af- | |
| ter the shared[local_idx] = global[global_idx] assignment, to ensure all writes to shared | |
| memory have completed before the compute phase can begin. The thread block also needs to syn- | |
| chronizeagainafterthecomputephase,topreventoverwritingsharedmemorybeforeallthreadshave | |
| completedtheircomputations. Thispatternisillustratedinthefollowingcodesnippet. | |
| #include <cooperative_groups.h> | |
| __device__ void compute(int* global_out, int const* shared_in) { | |
| ∕∕ Computes using all values of current batch from shared memory. | |
| ∕∕ Stores this thread's result back to global memory. | |
| } | |
| __global__ void without_async_copy(int* global_out, int const* global_in, | |
| ,→size_t size, size_t batch_sz) { | |
| auto grid = cooperative_groups::this_grid(); | |
| auto block = cooperative_groups::this_thread_block(); | |
| assert(size == batch_sz * grid.size()); ∕∕ Exposition: input size fits batch_ | |
| ,→sz * grid_size | |
| (continuesonnextpage) | |
| 114 Chapter3. AdvancedCUDA | |
| CUDAProgrammingGuide,Release13.1 | |
| (continuedfrompreviouspage) | |
| extern __shared__ int shared[]; ∕∕ block.size() * sizeof(int) bytes | |
| | size_t local_idx | | = block.thread_rank(); | | | | | | |
| | ---------------- | ----- | ---------------------- | ----- | ----------- | -------- | --- | | |
| | for (size_t | batch | = 0; | batch | < batch_sz; | ++batch) | { | | |
| ∕∕ Compute the index of the current batch for this block in global memory. | |
| size_t block_batch_idx = block.group_index().x * block.size() + grid. | |
| | ,→size() * | batch; | | | | | | | |
| | ----------------- | ---------- | ------------------------ | ------------ | --- | -------------- | --- | | |
| | size_t | global_idx | = block_batch_idx | | | + threadIdx.x; | | | |
| | shared[local_idx] | | = global_in[global_idx]; | | | | | | |
| | ∕∕ Wait | for all | copies | to complete. | | | | | |
| block.sync(); | |
| | ∕∕ Compute | and | write | result | to global | memory. | | | |
| | ------------------ | ----------- | ----- | ---------------- | --------- | -------- | ------- | | |
| | compute(global_out | | + | block_batch_idx, | | shared); | | | |
| | ∕∕ Wait | for compute | using | shared | memory | to | finish. | | |
| block.sync(); | |
| } | |
| } | |
| Withasynchronousdatacopies,datamovementfromglobalmemorytosharedmemorycanbedone | |
| asynchronouslytoenablemoreefficientuseoftheSMwhilewaitingfordatatoarrive. | |
| | #include <cooperative_groups.h> | | | | | | | | |
| | -------------------------------------------- | --- | --- | --- | --- | --- | --- | | |
| | #include <cooperative_groups∕memcpy_async.h> | | | | | | | | |
| __device__ void compute(int* global_out, int const* shared_in) { | |
| ∕∕ Computes using all values of current batch from shared memory. | |
| | ∕∕ Stores | this | thread's | result | back | to global | memory. | | |
| | --------- | ---- | -------- | ------ | ---- | --------- | ------- | | |
| } | |
| __global__ void with_async_copy(int* global_out, int const* global_in, size_t | |
| | ,→size, size_t | batch_sz) | | { | | | | | |
| | -------------- | ------------------------------------------ | --- | --- | --- | --- | --- | | |
| | auto grid | = cooperative_groups::this_grid(); | | | | | | | |
| | auto block | = cooperative_groups::this_thread_block(); | | | | | | | |
| assert(size == batch_sz * grid.size()); ∕∕ Exposition: input size fits batch_ | |
| ,→sz * grid_size | |
| extern __shared__ int shared[]; ∕∕ block.size() * sizeof(int) bytes | |
| | size_t local_idx | | = block.thread_rank(); | | | | | | |
| | ---------------- | ----- | ---------------------- | ----- | ----------- | -------- | --- | | |
| | for (size_t | batch | = 0; | batch | < batch_sz; | ++batch) | { | | |
| ∕∕ Compute the index of the current batch for this block in global memory. | |
| size_t block_batch_idx = block.group_index().x * block.size() + grid. | |
| | ,→size() * | batch; | | | | | | | |
| | ---------- | ------ | --- | --- | --- | --- | --- | | |
| ∕∕ Whole thread-group cooperatively copies whole batch to shared memory. | |
| cooperative_groups::memcpy_async(block, shared, global_in + block_batch_ | |
| ,→idx, block.size()); | |
| (continuesonnextpage) | |
| 3.2. AdvancedKernelProgramming 115 | |
| CUDAProgrammingGuide,Release13.1 | |
| (continuedfrompreviouspage) | |
| ∕∕ Compute on different data while waiting. | |
| ∕∕ Wait for all copies to complete. | |
| cooperative_groups::wait(block); | |
| ∕∕ Compute and write result to global memory. | |
| compute(global_out + block_batch_idx, shared); | |
| ∕∕ Wait for compute using shared memory to finish. | |
| block.sync(); | |
| } | |
| } | |
| Thecooperative_groups::memcpy_asyncfunctioncopiesblock.size()elementsfromglobalmemory | |
| totheshareddata. Thisoperationhappensas-ifperformedbyanotherthread, whichsynchronizes | |
| withthecurrentthread’scalltocooperative_groups::waitafterthecopyhascompleted. Untilthecopy | |
| operationcompletes,modifyingtheglobaldataorreadingorwritingtheshareddataintroducesadata | |
| race. | |
| Thisexampleillustratesthefundamentalconceptbehindallasynchronouscopyoperations: theyde- | |
| couplememorytransferinitiationfromcompletion,allowingthreadstoperformotherworkwhiledata | |
| moves in the background. The CUDA programming model provides several APIs to access these ca- | |
| pabilities, including memcpy_async functionsavailablein CooperativeGroupsandthe libcu++ library, | |
| aswellaslower-levelcuda::ptxandprimitivesAPIs. TheseAPIssharesimilarsemantics: theycopy | |
| objects from source to destination as-if performed by another thread which, on completion of the | |
| copy,canbesynchronizedusingdifferentcompletionmechanisms. | |
| ModernGPUarchitecturesprovidemultiplehardwaremechanismsforasynchronousdatamovement. | |
| ▶ LDGSTS(computecapability8.0+)allowsforefficientsmall-scaleasynchronoustransfersfrom | |
| globaltosharedmemory. | |
| ▶ Thetensormemoryaccelerator(TMA,computecapability9.0+)extendsthesecapabilities,pro- | |
| vidingbulk-asynchronouscopyoperationsoptimizedforlargemulti-dimensionaldatatransfers | |
| ▶ STASinstructions(computecapability9.0+)enablesmall-scaleasynchronoustransfersfromreg- | |
| isterstodistributedsharedmemorywithinacluster. | |
| Thesemechanismssupportdifferentdatapaths,transfersizes,andalignmentrequirements,allowing | |
| developers to choose the most appropriate approach for their specific data access patterns. The | |
| following table gives an overview of the supported data paths for asynchronous copies within the | |
| GPU. | |
| 116 Chapter3. AdvancedCUDA | |
| CUDAProgrammingGuide,Release13.1 | |
| | | | Table5: Asynchronous | | copies | with possible | source and desti- | | |
| | --- | --- | -------------------- | --- | -------------------------------- | ------------- | ----------------- | | |
| | | | nationmemoryspaces. | | Anemptycellindicatesthatasource- | | | | |
| destinationpairisnotsupported. | |
| | | Direction | | | CopyMechanism | | | | |
| | --- | ----------- | ----------- | --- | ---------------- | --- | --------------------- | | |
| | | Source | Destination | | AsynchronousCopy | | Bulk-AsynchronousCopy | | |
| | | global | global | | | | | | |
| | | shared::cta | global | | | | supported(TMA,9.0+) | | |
| global shared::cta supported(LDGSTS,8.0+) supported(TMA,9.0+) | |
| | | global | shared::cluster | | | | supported(TMA,9.0+) | | |
| | --- | --------------- | --------------- | --- | -------------------- | --- | ------------------- | | |
| | | shared::cluster | shared::cta | | | | supported(TMA,9.0+) | | |
| | | shared::cta | shared::cta | | | | | | |
| | | registers | shared::cluster | | supported(STAS,9.0+) | | | | |
| SectionsUsingLDGSTS,UsingtheTensorMemoryAccelerator(TMA)andUsingSTASwillgointomore | |
| detailsabouteachmechanism. | |
| | 3.2.6. | Configuring | | L1/Shared | | Memory | Balance | | |
| | ------ | ----------- | --- | --------- | --- | ------ | ------- | | |
| AsmentionedinL1datacache,theL1andsharedmemoryonanSMusethesamephysicalresource, | |
| knownastheunifieddatacache. Onmostarchitectures,ifakerneluseslittleornosharedmemory, | |
| theunifieddatacachecanbeconfiguredtoprovidethemaximalamountofL1cacheallowedbythe | |
| architecture. | |
| The unified data cache reserved for shared memory is configurable on a per-kernel basis. An appli- | |
| cation can set the carveout, or preferred shared memory capacity, with the cudaFuncSetAttribute | |
| functioncalledbeforethekernelislaunched. | |
| cudaFuncSetAttribute(kernel_name, | |
| | ,→cudaFuncAttributePreferredSharedMemoryCarveout, | | | | | | carveout); | | |
| | ------------------------------------------------- | --- | --- | --- | --- | --- | ---------- | | |
| The application can set the carveout as an integer percentage of the maximum supported shared | |
| memorycapacityofthatarchitecture. Inadditiontoanintegerpercentage,threeconvenienceenums | |
| areprovidedascarveoutvalues. | |
| ▶ | |
| cudaSharedmemCarveoutDefault | |
| ▶ | |
| cudaSharedmemCarveoutMaxL1 | |
| ▶ cudaSharedmemCarveoutMaxShared | |
| Themaximumsupportedsharedmemoryandthesupportedcarveoutsizesvarybyarchitecture;see | |
| SharedMemoryCapacityperComputeCapabilityfordetails. | |
| Where a chosen integer percentage carveout does not map exactly to a supported shared memory | |
| capacity,thenextlargercapacityisused. Forexample,fordevicesofcomputecapability12.0,which | |
| haveamaximumsharedmemorycapacityof100KB,settingthecarveoutto50%willresultin64KBof | |
| sharedmemory,not50KB,becausedevicesofcomputecapability12.0supportsharedmemorysizes | |
| of0,8,16,32,64,and100. | |
| 3.2. AdvancedKernelProgramming 117 | |
| CUDAProgrammingGuide,Release13.1 | |
| The function passed to cudaFuncSetAttribute must be declared with the __global__ specifier. | |
| cudaFuncSetAttributeisinterpretedbythedriverasahint,andthedrivermaychooseadifferent | |
| carveoutsizeifrequiredtoexecutethekernel. | |
| Note | |
| AnotherCUDAAPI,cudaFuncSetCacheConfig,alsoallowsanapplicationtoadjustthebalancebe- | |
| tweenL1andsharedmemoryforakernel. However,thisAPIsetahardrequirementsonshared/L1 | |
| balanceforkernellaunch. Asaresult,interleavingkernelswithdifferentsharedmemoryconfigu- | |
| rationswouldneedlesslyserializelaunchesbehindsharedmemoryreconfigurations. cudaFuncSe- | |
| tAttributeispreferredbecausedrivermaychooseadifferentconfigurationifrequiredtoexecute | |
| thefunctionortoavoidthrashing. | |
| Kernelsrelyingonsharedmemoryallocationsover48KBperblockarearchitecture-specific. Assuch | |
| theymustusedynamicsharedmemoryratherthanstatically-sizedarraysandrequireanexplicitopt-in | |
| usingcudaFuncSetAttributeasfollows. | |
| | ∕∕ Device | code | | | | | | |
| | ---------- | ------------------ | --- | --- | --- | --- | | |
| | __global__ | void MyKernel(...) | | | | | | |
| { | |
| | extern | __shared__ | float buffer[]; | | | | | |
| | ------ | ---------- | --------------- | --- | --- | --- | | |
| ... | |
| } | |
| | ∕∕ Host | code | | | | | | |
| | ------------ | -------- | ----- | --- | --- | --- | | |
| | int maxbytes | = 98304; | ∕∕ 96 | KB | | | | |
| cudaFuncSetAttribute(MyKernel, cudaFuncAttributeMaxDynamicSharedMemorySize, | |
| ,→maxbytes); | |
| | MyKernel | <<<gridDim, | blockDim, | maxbytes>>>(...); | | | | |
| | -------- | ----------- | --------- | ----------------- | --- | --- | | |
| | 3.3. | The CUDA | Driver | API | | | | |
| PrevioussectionsofthisguidehavecoveredtheCUDAruntime. AsmentionedinCUDARuntimeAPIand | |
| CUDADriverAPI,theCUDAruntimeiswrittenontopofthelowerlevelCUDAdriverAPI.Thissection | |
| covers some of the differences between the CUDA runtime and the driver APIs, as well has how to | |
| intermix them. Most applications can operate at full performance without ever needing to interact | |
| with the CUDA driver API. However, new interfaces are sometimes available in the driver API earlier | |
| than the runtime API, and some advanced interfaces, such as VirtualMemoryManagement, are only | |
| exposedinthedriverAPI. | |
| ThedriverAPIisimplementedinthecudadynamiclibrary(cuda.dllorcuda.so)whichiscopiedon | |
| thesystemduringtheinstallationofthedevicedriver. Allitsentrypointsareprefixedwithcu. | |
| Itisahandle-based,imperativeAPI:Mostobjectsarereferencedbyopaquehandlesthatmaybespec- | |
| ifiedtofunctionstomanipulatetheobjects. | |
| TheobjectsavailableinthedriverAPIaresummarizedinTable6. | |
| | 118 | | | | Chapter3. | AdvancedCUDA | | |
| | --- | --- | --- | --- | --------- | ------------ | | |
| CUDAProgrammingGuide,Release13.1 | |
| Table6: ObjectsAvailableintheCUDADriverAPI | |
| | Object | Handle | Description | | | |
| | ------ | ------ | ------------------ | --- | | |
| | Device | CUde- | CUDA-enableddevice | | | |
| vice | |
| | Context | CUcon- | RoughlyequivalenttoaCPUprocess | | | |
| | ------- | ------ | ------------------------------ | --- | | |
| text | |
| | Module | CUmod- | Roughlyequivalenttoadynamiclibrary | | | |
| | ------ | ------ | ---------------------------------- | --- | | |
| ule | |
| | Function | CUfunc- | Kernel | | | |
| | -------- | ------- | ------ | --- | | |
| tion | |
| | Heap mem- | CUdevi- | Pointertodevicememory | | | |
| | --------- | ------- | --------------------- | --- | | |
| | ory | ceptr | | | | |
| CUDAarray CUarray Opaque container for one-dimensional or two-dimensional data on the | |
| device,readableviatextureorsurfacereferences | |
| Texture ob- CUtexref Objectthatdescribeshowtointerprettexturememorydata | |
| ject | |
| Surface ref- CUsurfref ObjectthatdescribeshowtoreadorwriteCUDAarrays | |
| erence | |
| | Stream | CUs- | ObjectthatdescribesaCUDAstream | | | |
| | ------ | ---- | ------------------------------ | --- | | |
| tream | |
| | Event | CUevent | ObjectthatdescribesaCUDAevent | | | |
| | ----- | ------- | ----------------------------- | --- | | |
| ThedriverAPImustbeinitializedwithcuInit()beforeanyfunctionfromthedriverAPIiscalled. A | |
| CUDA context must then be created that is attached to a specific device and made current to the | |
| callinghostthreadasdetailedinContext. | |
| Within a CUDA context, kernels are explicitly loaded as PTX or binary objects by the host code as | |
| describedinModule. KernelswritteninC++mustthereforebecompiledseparatelyintoPTX orbinary | |
| objects. KernelsarelaunchedusingAPIentrypointsasdescribedinKernelExecution. | |
| AnyapplicationthatwantstorunonfuturedevicearchitecturesmustloadPTX,notbinarycode. This | |
| isbecausebinarycodeisarchitecture-specificandthereforeincompatiblewithfuturearchitectures, | |
| whereasPTX codeiscompiledtobinarycodeatloadtimebythedevicedriver. | |
| HereisthehostcodeofthesamplefromKernelswrittenusingthedriverAPI: | |
| int main() | |
| { | |
| | int N | = ...; | | | | |
| | ------------- | --------------------------- | ------------------- | -------------- | | |
| | size_t | size = N | * sizeof(float); | | | |
| | ∕∕ Allocate | input | vectors h_A and h_B | in host memory | | |
| | float* | h_A = (float*)malloc(size); | | | | |
| | float* | h_B = (float*)malloc(size); | | | | |
| | ∕∕ Initialize | input | vectors | | | |
| ... | |
| (continuesonnextpage) | |
| 3.3. TheCUDADriverAPI 119 | |
| CUDAProgrammingGuide,Release13.1 | |
| (continuedfrompreviouspage) | |
| | ∕∕ | Initialize | | | | | | | | | | |
| | --- | ---------- | --- | --- | --- | --- | --- | --- | --- | --- | | |
| cuInit(0); | |
| | ∕∕ | Get | number | of devices | | supporting | CUDA | | | | | |
| | --- | ----------- | ------ | ---------- | --- | ---------- | ---- | --- | --- | --- | | |
| | int | deviceCount | | = 0; | | | | | | | | |
| cuDeviceGetCount(&deviceCount); | |
| | if | (deviceCount | | == | 0) { | | | | | | | |
| | --- | ------------- | --- | --- | --------- | ---------- | --- | ---------- | --- | --- | | |
| | | printf("There | | is | no device | supporting | | CUDA.\n"); | | | | |
| exit (0); | |
| } | |
| | ∕∕ | Get | handle | for device | | 0 | | | | | | |
| | ---------------------------- | --------------- | ---------- | --------------- | -------------- | ---------- | ------------------ | ---------- | --- | --- | | |
| | CUdevice | | cuDevice; | | | | | | | | | |
| | cuDeviceGet(&cuDevice, | | | | 0); | | | | | | | |
| | ∕∕ | Create | context | | | | | | | | | |
| | CUcontext | | cuContext; | | | | | | | | | |
| | cuCtxCreate(&cuContext, | | | | 0, | cuDevice); | | | | | | |
| | ∕∕ | Create | module | from | binary | file | | | | | | |
| | CUmodule | | cuModule; | | | | | | | | | |
| | cuModuleLoad(&cuModule, | | | | "VecAdd.ptx"); | | | | | | | |
| | ∕∕ | Allocate | | vectors | in device | memory | | | | | | |
| | CUdeviceptr | | | d_A; | | | | | | | | |
| | cuMemAlloc(&d_A, | | | size); | | | | | | | | |
| | CUdeviceptr | | | d_B; | | | | | | | | |
| | cuMemAlloc(&d_B, | | | size); | | | | | | | | |
| | CUdeviceptr | | | d_C; | | | | | | | | |
| | cuMemAlloc(&d_C, | | | size); | | | | | | | | |
| | ∕∕ | Copy | vectors | from | host | memory | to device | memory | | | | |
| | cuMemcpyHtoD(d_A, | | | h_A, | size); | | | | | | | |
| | cuMemcpyHtoD(d_B, | | | h_B, | size); | | | | | | | |
| | ∕∕ | Get | function | handle | from | module | | | | | | |
| | CUfunction | | vecAdd; | | | | | | | | | |
| | cuModuleGetFunction(&vecAdd, | | | | | cuModule, | | "VecAdd"); | | | | |
| | ∕∕ | Invoke | kernel | | | | | | | | | |
| | int | threadsPerBlock | | | = 256; | | | | | | | |
| | int | blocksPerGrid | | = | | | | | | | | |
| | | | (N + | threadsPerBlock | | - 1) | ∕ threadsPerBlock; | | | | | |
| | void* | args[] | | = { &d_A, | &d_B, | &d_C, | &N }; | | | | | |
| cuLaunchKernel(vecAdd, | |
| | | | | blocksPerGrid, | | 1, | 1, threadsPerBlock, | | 1, 1, | | | |
| | --- | --- | --- | -------------- | ----- | --- | ------------------- | --- | ----- | --- | | |
| | | | | 0, 0, | args, | 0); | | | | | | |
| ... | |
| } | |
| FullcodecanbefoundinthevectorAddDrvCUDAsample. | |
| | 120 | | | | | | | | Chapter3. | AdvancedCUDA | | |
| | --- | --- | --- | --- | --- | --- | --- | --- | --------- | ------------ | | |
| CUDAProgrammingGuide,Release13.1 | |
| 3.3.1. Context | |
| ACUDAcontextisanalogoustoaCPUprocess. Allresourcesandactionsperformedwithinthedriver | |
| APIareencapsulatedinsideaCUDAcontext,andthesystemautomaticallycleansuptheseresources | |
| when the context is destroyed. Besides objects such as modules and texture or surface references, | |
| each context has its own distinct address space. As a result, CUdeviceptr values from different | |
| contextsreferencedifferentmemorylocations. | |
| A host thread may have only one device context current at a time. When a context is created with | |
| cuCtxCreate(), it is made current to the calling host thread. CUDA functions that operate in a | |
| context(mostfunctionsthatdonotinvolvedeviceenumerationorcontextmanagement)willreturn | |
| CUDA_ERROR_INVALID_CONTEXTifavalidcontextisnotcurrenttothethread. | |
| Eachhostthreadhasastackofcurrentcontexts. cuCtxCreate()pushesthenewcontextontothe | |
| top of the stack. cuCtxPopCurrent() may be called to detach the context from the host thread. | |
| Thecontextisthen“floating”andmaybepushedasthecurrentcontextforanyhostthread. cuCtx- | |
| PopCurrent()alsorestoresthepreviouscurrentcontext,ifany. | |
| A usage count is also maintained for each context. cuCtxCreate() creates a context with a usage | |
| countof1. cuCtxAttach()incrementstheusagecountandcuCtxDetach()decrementsit. Acon- | |
| textisdestroyedwhentheusagecountgoesto0whencallingcuCtxDetach()orcuCtxDestroy(). | |
| The driver API is interoperable with the runtime and it is possible to access the primary context | |
| (seeRuntimeInitialization)managedbytheruntimefromthedriverAPIviacuDevicePrimaryCtxRe- | |
| tain(). | |
| Usagecountfacilitatesinteroperabilitybetweenthirdpartyauthoredcodeoperatinginthesamecon- | |
| text. For example, if three libraries are loaded to use the same context, each library would call cuC- | |
| txAttach()toincrementtheusagecountandcuCtxDetach()todecrementtheusagecountwhen | |
| the library is done using the context. For most libraries, it is expected that the application will have | |
| createdacontextbeforeloadingorinitializingthelibrary;thatway,theapplicationcancreatethecon- | |
| textusingitsownheuristics,andthelibrarysimplyoperatesonthecontexthandedtoit. Librariesthat | |
| wishtocreatetheirowncontexts-unbeknownsttotheirAPIclientswhomayormaynothavecreated | |
| contexts oftheir own - woulduse cuCtxPushCurrent() and cuCtxPopCurrent() as illustratedin | |
| thefollowingfigure. | |
| Figure20: LibraryContextManagement | |
| 3.3. TheCUDADriverAPI 121 | |
| CUDAProgrammingGuide,Release13.1 | |
| | 3.3.2. Module | | | | | | | |
| | ------------- | --- | --- | --- | --- | --- | | |
| Modules are dynamically loadable packages of device code and data, akin to DLLs in Windows, that | |
| areoutputbynvcc(seeCompilationwithNVCC).Thenamesforallsymbols,includingfunctions,global | |
| variables,andtextureorsurfacereferences,aremaintainedatmodulescopesothatmoduleswritten | |
| byindependentthirdpartiesmayinteroperateinthesameCUDAcontext. | |
| Thiscodesampleloadsamoduleandretrievesahandletosomekernel: | |
| | CUmodule cuModule; | | | | | | | |
| | ------------------------------ | --------- | ---------------- | ------------ | --- | --- | | |
| | cuModuleLoad(&cuModule, | | "myModule.ptx"); | | | | | |
| | CUfunction | myKernel; | | | | | | |
| | cuModuleGetFunction(&myKernel, | | cuModule, | "MyKernel"); | | | | |
| ThiscodesamplecompilesandloadsanewmodulefromPTXcodeandparsescompilationerrors: | |
| | #define BUFFER_SIZE | 8192 | | | | | | |
| | ------------------- | ----------- | --- | --- | --- | --- | | |
| | CUmodule cuModule; | | | | | | | |
| | CUjit_option | options[3]; | | | | | | |
| void* values[3]; | |
| | char* PTXCode | = "some PTX | code"; | | | | | |
| | ------------- | ----------- | ------ | --- | --- | --- | | |
| char error_log[BUFFER_SIZE]; | |
| int err; | |
| | options[0] | = CU_JIT_ERROR_LOG_BUFFER; | | | | | | |
| | ---------- | ------------------------------------- | --- | --- | --- | --- | | |
| | values[0] | = (void*)error_log; | | | | | | |
| | options[1] | = CU_JIT_ERROR_LOG_BUFFER_SIZE_BYTES; | | | | | | |
| | values[1] | = (void*)BUFFER_SIZE; | | | | | | |
| | options[2] | = CU_JIT_TARGET_FROM_CUCONTEXT; | | | | | | |
| | values[2] | = 0; | | | | | | |
| err = cuModuleLoadDataEx(&cuModule, PTXCode, 3, options, values); | |
| | if (err != | CUDA_SUCCESS) | | | | | | |
| | ------------ | -------------- | ----------- | --- | --- | --- | | |
| | printf("Link | error:\n%s\n", | error_log); | | | | | |
| Thiscodesamplecompiles,links,andloadsanewmodulefrommultiplePTXcodesandparseslinkand | |
| compilationerrors: | |
| | #define BUFFER_SIZE | 8192 | | | | | | |
| | ------------------- | ----------- | --- | --- | --- | --- | | |
| | CUmodule cuModule; | | | | | | | |
| | CUjit_option | options[6]; | | | | | | |
| void* values[6]; | |
| float walltime; | |
| | char error_log[BUFFER_SIZE], | | info_log[BUFFER_SIZE]; | | | | | |
| | ---------------------------- | ---------- | ---------------------- | --- | --- | --- | | |
| | char* PTXCode0 | = "some | PTX code"; | | | | | |
| | char* PTXCode1 | = "some | other PTX code"; | | | | | |
| | CUlinkState | linkState; | | | | | | |
| int err; | |
| void* cubin; | |
| size_t cubinSize; | |
| | options[0] | = CU_JIT_WALL_TIME; | | | | | | |
| | ----------- | ------------------------------------ | --- | --- | --- | --- | | |
| | values[0] = | (void*)&walltime; | | | | | | |
| | options[1] | = CU_JIT_INFO_LOG_BUFFER; | | | | | | |
| | values[1] = | (void*)info_log; | | | | | | |
| | options[2] | = CU_JIT_INFO_LOG_BUFFER_SIZE_BYTES; | | | | | | |
| | values[2] = | (void*)BUFFER_SIZE; | | | | | | |
| (continuesonnextpage) | |
| | 122 | | | | Chapter3. | AdvancedCUDA | | |
| | --- | --- | --- | --- | --------- | ------------ | | |
| CUDAProgrammingGuide,Release13.1 | |
| (continuedfrompreviouspage) | |
| | options[3] | = | CU_JIT_ERROR_LOG_BUFFER; | | | | | | | | | | |
| | ------------------------- | ------------------------ | ----------------------------------- | ---------------- | ------- | ----------------- | ---------------- | --- | ------- | --------- | --- | | |
| | values[3] | = | (void*)error_log; | | | | | | | | | | |
| | options[4] | = | CU_JIT_ERROR_LOG_BUFFER_SIZE_BYTES; | | | | | | | | | | |
| | values[4] | = | (void*)BUFFER_SIZE; | | | | | | | | | | |
| | options[5] | = | CU_JIT_LOG_VERBOSE; | | | | | | | | | | |
| | values[5] | = | (void*)1; | | | | | | | | | | |
| | cuLinkCreate(6, | | options, | | values, | &linkState); | | | | | | | |
| | err = | cuLinkAddData(linkState, | | | | CU_JIT_INPUT_PTX, | | | | | | | |
| | | | | (void*)PTXCode0, | | | strlen(PTXCode0) | | + 1, 0, | 0, 0, 0); | | | |
| | if (err | != CUDA_SUCCESS) | | | | | | | | | | | |
| | printf("Link | | | error:\n%s\n", | | error_log); | | | | | | | |
| | err = | cuLinkAddData(linkState, | | | | CU_JIT_INPUT_PTX, | | | | | | | |
| | | | | (void*)PTXCode1, | | | strlen(PTXCode1) | | + 1, 0, | 0, 0, 0); | | | |
| | if (err | != CUDA_SUCCESS) | | | | | | | | | | | |
| | printf("Link | | | error:\n%s\n", | | error_log); | | | | | | | |
| | cuLinkComplete(linkState, | | | | &cubin, | | &cubinSize); | | | | | | |
| printf("Link completed in %fms. Linker Output:\n%s\n", walltime, info_log); | |
| | cuModuleLoadData(cuModule, | | | | | cubin); | | | | | | | |
| | -------------------------- | --- | --- | --- | --- | ------- | --- | --- | --- | --- | --- | | |
| cuLinkDestroy(linkState); | |
| It’spossibletoacceleratesomepartsofthemodulelinking/loadingprocessbyusingmultiplethreads, | |
| includingwhenloadingacubin. ThiscodesampleusesCU_JIT_BINARY_LOADER_THREAD_COUNTto | |
| speedupmoduleloading. | |
| | #define | BUFFER_SIZE | | 8192 | | | | | | | | | |
| | ------------ | ----------- | ----------- | ---- | --- | --- | --- | --- | --- | --- | --- | | |
| | CUmodule | cuModule; | | | | | | | | | | | |
| | CUjit_option | | options[3]; | | | | | | | | | | |
| void* values[3]; | |
| | char* | cubinCode | = | "some | cubin | code"; | | | | | | | |
| | ----- | --------- | --- | ----- | ----- | ------ | --- | --- | --- | --- | --- | | |
| char error_log[BUFFER_SIZE]; | |
| int err; | |
| | options[0] | = | CU_JIT_ERROR_LOG_BUFFER; | | | | | | | | | | |
| | ---------- | --- | ----------------------------------- | --- | ------- | ------- | ------- | ------ | ------- | --- | --- | | |
| | values[0] | = | (void*)error_log; | | | | | | | | | | |
| | options[1] | = | CU_JIT_ERROR_LOG_BUFFER_SIZE_BYTES; | | | | | | | | | | |
| | values[1] | = | (void*)BUFFER_SIZE; | | | | | | | | | | |
| | options[2] | = | CU_JIT_BINARY_LOADER_THREAD_COUNT; | | | | | | | | | | |
| | values[2] | = | 0; ∕∕ | Use | as many | threads | as CPUs | on the | machine | | | | |
| err = cuModuleLoadDataEx(&cuModule, cubinCode, 3, options, values); | |
| | if (err | != CUDA_SUCCESS) | | | | | | | | | | | |
| | ------------ | ---------------- | --- | -------------- | --- | ----------- | --- | --- | --- | --- | --- | | |
| | printf("Link | | | error:\n%s\n", | | error_log); | | | | | | | |
| FullcodecanbefoundintheptxjitCUDAsample. | |
| | 3.3.3. | Kernel | | Execution | | | | | | | | | |
| | ------ | ------ | --- | --------- | --- | --- | --- | --- | --- | --- | --- | | |
| cuLaunchKernel()launchesakernelwithagivenexecutionconfiguration. | |
| Parametersarepassedeitherasanarrayofpointers(nexttolastparameterofcuLaunchKernel()) | |
| wherethenthpointercorrespondstothenthparameterandpointstoaregionofmemoryfromwhich | |
| theparameteriscopied,orasoneoftheextraoptions(lastparameterofcuLaunchKernel()). | |
| | When parameters | | are | passed | as an | extra option | (the | | | | option), | | |
| | --------------- | --- | --- | ------ | ----- | ------------ | ---- | --- | --- | --- | -------- | | |
| CU_LAUNCH_PARAM_BUFFER_POINTER | |
| they are passed as a pointer to a single buffer where parameters are assumed to be properly offset | |
| | 3.3. TheCUDADriverAPI | | | | | | | | | | 123 | | |
| | --------------------- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | |
| CUDAProgrammingGuide,Release13.1 | |
| withrespecttoeachotherbymatchingthealignmentrequirementforeachparametertypeindevice | |
| code. | |
| Alignmentrequirementsindevicecodeforthebuilt-invectortypesarelistedinTable42. Forallother | |
| basic types, the alignment requirement in device code matches the alignment requirement in host | |
| codeandcanthereforebeobtainedusing__alignof(). Theonlyexceptioniswhenthehostcompiler | |
| aligns double and long long (and long on a 64-bit system) on a one-word boundary instead of a | |
| two-wordboundary(forexample, usinggcc’scompilationflag-mno-align-double)sinceindevice | |
| codethesetypesarealwaysalignedonatwo-wordboundary. | |
| CUdeviceptr is an integer, but represents a pointer, so its alignment requirement is __alig- | |
| nof(void*). | |
| Thefollowingcodesampleusesamacro(ALIGN_UP())toadjusttheoffsetofeachparametertomeet | |
| itsalignmentrequirementandanothermacro(ADD_TO_PARAM_BUFFER())toaddeachparameterto | |
| theparameterbufferpassedtotheCU_LAUNCH_PARAM_BUFFER_POINTERoption. | |
| | #define | ALIGN_UP(offset, | | alignment) | \ | | | | | | |
| | ------- | ---------------- | ----------- | ------------- | --- | ------ | ------------- | --- | ---- | | |
| | | (offset) | = ((offset) | + (alignment) | | - 1) & | ~((alignment) | | - 1) | | |
| char paramBuffer[1024]; | |
| | size_t | paramBufferSize | | = 0; | | | | | | | |
| | ------- | -------------------------- | --------- | --------------------------- | ---------- | --- | ----------- | --- | --- | | |
| | #define | ADD_TO_PARAM_BUFFER(value, | | | alignment) | | | | \ | | |
| | do | { | | | | | | | \ | | |
| | | paramBufferSize | | = ALIGN_UP(paramBufferSize, | | | alignment); | | \ | | |
| | | memcpy(paramBuffer | | + paramBufferSize, | | | | | \ | | |
| | | | &(value), | sizeof(value)); | | | | | \ | | |
| | | paramBufferSize | | += sizeof(value); | | | | | \ | | |
| | } | while | (0) | | | | | | | | |
| int i; | |
| | ADD_TO_PARAM_BUFFER(i, | | | __alignof(i)); | | | | | | | |
| | ----------------------- | --- | --- | -------------- | -------- | --------- | --- | --- | --- | | |
| | float4 | f4; | | | | | | | | | |
| | ADD_TO_PARAM_BUFFER(f4, | | | 16); ∕∕ | float4's | alignment | is | 16 | | | |
| char c; | |
| | ADD_TO_PARAM_BUFFER(c, | | | __alignof(c)); | | | | | | | |
| | ---------------------- | --- | --- | -------------- | --- | --- | --- | --- | --- | | |
| float f; | |
| | ADD_TO_PARAM_BUFFER(f, | | | __alignof(f)); | | | | | | | |
| | ------------------------------- | ------- | ------- | ------------------- | ----------------- | --------- | ---- | --- | --- | | |
| | CUdeviceptr | | devPtr; | | | | | | | | |
| | ADD_TO_PARAM_BUFFER(devPtr, | | | __alignof(devPtr)); | | | | | | | |
| | float2 | f2; | | | | | | | | | |
| | ADD_TO_PARAM_BUFFER(f2, | | | 8); ∕∕ | float2's | alignment | is 8 | | | | |
| | void* | extra[] | = { | | | | | | | | |
| | CU_LAUNCH_PARAM_BUFFER_POINTER, | | | | paramBuffer, | | | | | | |
| | CU_LAUNCH_PARAM_BUFFER_SIZE, | | | | ¶mBufferSize, | | | | | | |
| CU_LAUNCH_PARAM_END | |
| }; | |
| cuLaunchKernel(cuFunction, | |
| | | | blockWidth, | blockHeight, | | blockDepth, | | | | | |
| | --- | --- | ----------- | ------------ | --- | ----------- | --- | --- | --- | | |
| | | | gridWidth, | gridHeight, | | gridDepth, | | | | | |
| | | | 0, 0, | 0, extra); | | | | | | | |
| Thealignmentrequirementofastructureisequaltothemaximumofthealignmentrequirementsof | |
| itsfields. Thealignmentrequirementofastructurethatcontainsbuilt-invectortypes,CUdeviceptr, | |
| | 124 | | | | | | | Chapter3. | AdvancedCUDA | | |
| | --- | --- | --- | --- | --- | --- | --- | --------- | ------------ | | |
| CUDAProgrammingGuide,Release13.1 | |
| or non-aligned double and long long, might therefore differ between device code and host code. | |
| Suchastructuremightalsobepaddeddifferently. Thefollowingstructure,forexample,isnotpadded | |
| at all in host code, but it is padded in device code with 12 bytes after field f since the alignment | |
| requirementforfieldf4is16. | |
| | typedef | struct { | | | | | | |
| | ------- | -------- | --- | --- | --- | --- | | |
| | float | f; | | | | | | |
| | float4 | f4; | | | | | | |
| } myStruct; | |
| | 3.3.4. | Interoperability | between | Runtime | and Driver | APIs | | |
| | ------ | ---------------- | ------- | ------- | ---------- | ---- | | |
| AnapplicationcanmixruntimeAPIcodewithdriverAPIcode. | |
| If a context is created and made current via the driver API, subsequent runtime calls will use this | |
| contextinsteadofcreatinganewone. | |
| Iftheruntimeisinitialized,cuCtxGetCurrent()canbeusedtoretrievethecontextcreatedduring | |
| | initialization. | ThiscontextcanbeusedbysubsequentdriverAPIcalls. | | | | | | |
| | --------------- | ----------------------------------------------- | --- | --- | --- | --- | | |
| Theimplicitlycreatedcontextfromtheruntimeiscalledtheprimarycontext(seeRuntimeInitialization). | |
| ItcanbemanagedfromthedriverAPIwiththePrimaryContextManagementfunctions. | |
| DevicememorycanbeallocatedandfreedusingeitherAPI.CUdeviceptrcanbecasttoregularpoint- | |
| ersandvice-versa: | |
| | CUdeviceptr | devPtr; | | | | | | |
| | ------------------- | ---------------------- | --- | --- | --- | --- | | |
| | float* | d_data; | | | | | | |
| | ∕∕ Allocation | using driver | API | | | | | |
| | cuMemAlloc(&devPtr, | size); | | | | | | |
| | d_data | = (float*)devPtr; | | | | | | |
| | ∕∕ Allocation | using runtime | API | | | | | |
| | cudaMalloc(&d_data, | size); | | | | | | |
| | devPtr | = (CUdeviceptr)d_data; | | | | | | |
| Inparticular,thismeansthatapplicationswrittenusingthedriverAPIcaninvokelibrarieswrittenusing | |
| theruntimeAPI(suchascuFFT,cuBLAS,…). | |
| Allfunctionsfromthedeviceandversionmanagementsectionsofthereferencemanualcanbeused | |
| interchangeably. | |
| | 3.4. | Programming | Systems | with | Multiple | GPUs | | |
| | ---- | ----------- | ------- | ---- | -------- | ---- | | |
| Multi-GPUprogrammingallowsanapplicationtoaddressproblemsizesandachieveperformancelev- | |
| els beyond what is possible with a single GPU by exploiting the larger aggregate arithmetic perfor- | |
| mance,memorycapacity,andmemorybandwidthprovidedbymulti-GPUsystems. | |
| CUDAenablesmulti-GPUprogrammingthroughhostAPIs,driverinfrastructure,andsupportingGPU | |
| hardwaretechnologies: | |
| ▶ | |
| HostthreadCUDAcontextmanagement | |
| ▶ Unifiedmemoryaddressingforallprocessorsinthesystem | |
| 3.4. ProgrammingSystemswithMultipleGPUs 125 | |
| CUDAProgrammingGuide,Release13.1 | |
| ▶ Peer-to-peerbulkmemorytransfersbetweenGPUs | |
| ▶ Fine-grainedpeer-to-peerGPUload/storememoryaccess | |
| ▶ HigherlevelabstractionsandsupportingsystemsoftwaresuchasCUDAinterprocesscommu- | |
| nication, parallel reductions using NCCL, and communication using NVLink and/or GPU-Direct | |
| RDMAwithAPIssuchasNVSHMEMandMPI | |
| At the most basic level, multi-GPU programming requires the application to manage multiple active | |
| CUDAcontextsconcurrently,distributedatatotheGPUs,launchkernelsontheGPUstocompletetheir | |
| work, and to communicate or collect the results so that they can be acted upon by the application. | |
| The details of how this is done differ depending on the most effective mapping of an application’s | |
| algorithms, available parallelism, and existing code structure to a suitable multi-GPU programming | |
| approach. Someofthemostcommonmulti-GPUprogrammingapproachesinclude: | |
| ▶ AsinglehostthreaddrivingmultipleGPUs | |
| ▶ Multiplehostthreads,eachdrivingtheirownGPU | |
| ▶ Multiplesingle-threadedhostprocesses,eachdrivingtheirownGPU | |
| ▶ Multiplehostprocessescontainingmultiplethreads,eachdrivingtheirownGPU | |
| ▶ Multi-node NVLink-connected clusters, with GPUs driven by threads and processes running | |
| withinmultipleoperatingsysteminstancesacrosstheclusternodes | |
| GPUs can communicate with each other through memory transfers and peer accesses between de- | |
| vice memories, covering each of the multi-device work distribution approaches listed above. High | |
| performance, low-latency GPU communications are supported by querying for and enabling the use | |
| ofpeer-to-peerGPUmemoryaccess,andleveragingNVLinktoachievehighbandwidthtransfersand | |
| finer-grainedload/storeoperationsbetweendevices. | |
| CUDAunifiedvirtualaddressingpermitscommunicationbetweenmultipleGPUswithinthesamehost | |
| processwithminimaladditionalstepstoqueryandenabletheuseofhighperformancepeer-to-peer | |
| memoryaccessandtransfers,e.g.,viaNVLink. | |
| Communication between multiple GPUs managed by different host processes is supported through | |
| the use of interprocess communication (IPC) and Virtual memory Management (VMM) APIs. An in- | |
| troductiontohighlevelIPCconceptsandintra-nodeCUDAIPCAPIsarediscussedintheInterprocess | |
| Communicationsection. AdvancedVirtualMemoryManagement(VMM)APIssupportbothintra-node | |
| andmulti-nodeIPC,areusableonbothLinuxandWindowsoperatingsystems,andallowper-allocation | |
| granularitycontroloverIPCsharingofmemorybuffersasdescribedinVirtualMemoryManagement. | |
| CUDAitselfprovidestheAPIsneededtoimplementcollectiveoperationswithinagroupofGPUs,po- | |
| tentially including the host, but it does not provide high level multi-GPU collective APIs itself. Multi- | |
| GPUcollectivesareprovidedbyhigherabstractionCUDAcommunicationlibrariessuchasNCCLand | |
| NVSHMEM. | |
| 3.4.1. Multi-Device Context and Execution Management | |
| ThefirststepsthatarerequiredtoforanapplicationtousemultipleGPUsaretoenumeratetheavail- | |
| able GPU devices, select among the available devices as appropriate based on their hardware prop- | |
| erties,CPUaffinity,andconnectivitytopeers,andtocreateCUDAcontextsforeachdevicethatthe | |
| applicationwilluse. | |
| 126 Chapter3. AdvancedCUDA | |
| CUDAProgrammingGuide,Release13.1 | |
| 3.4.1.1 DeviceEnumeration | |
| ThefollowingcodesampleshowshowtoquerynumberofCUDA-enableddevices,enumerateeachof | |
| thedevices,andquerytheirproperties. | |
| int deviceCount; | |
| cudaGetDeviceCount(&deviceCount); | |
| int device; | |
| | for (device | = | 0; device | < | deviceCount; | | ++device) | | { | | | |
| | ------------------------------------ | ------- | ----------------- | --- | ------------ | ---------- | ------------------ | ---------- | --- | --- | | |
| | cudaDeviceProp | | deviceProp; | | | | | | | | | |
| | cudaGetDeviceProperties(&deviceProp, | | | | | | device); | | | | | |
| | printf("Device | | %d | has | compute | capability | | %d.%d.\n", | | | | |
| | | device, | deviceProp.major, | | | | deviceProp.minor); | | | | | |
| } | |
| 3.4.1.2 DeviceSelection | |
| AhostthreadcansetthedeviceitiscurrentlyoperatingonatanytimebycallingcudaSetDevice(). | |
| Devicememoryallocationsandkernellaunchesaremadeonthecurrentdevice; streamsandevents | |
| arecreatedinassociationwiththecurrentlysetdevice. UntilacalltocudaSetDevice()ismadeby | |
| thehostthread,thecurrentdevicedefaultstodevice0. | |
| Thefollowingcodesampleillustrateshowsettingthecurrentdeviceaffectssubsequentmemoryal- | |
| locationandkernelexecutionoperations. | |
| | size_t size | = | 1024 * | sizeof(float); | | | | | | | | |
| | ----------------- | --- | ----------- | -------------- | --- | -------- | ------ | ---- | ----------- | --- | | |
| | cudaSetDevice(0); | | | | ∕∕ | Set | device | 0 as | current | | | |
| | float* p0; | | | | | | | | | | | |
| | cudaMalloc(&p0, | | size); | | ∕∕ | Allocate | memory | | on device 0 | | | |
| | MyKernel<<<1000, | | 128>>>(p0); | | ∕∕ | Launch | kernel | on | device 0 | | | |
| | cudaSetDevice(1); | | | | ∕∕ | Set | device | 1 as | current | | | |
| | float* p1; | | | | | | | | | | | |
| | cudaMalloc(&p1, | | size); | | ∕∕ | Allocate | memory | | on device 1 | | | |
| | MyKernel<<<1000, | | 128>>>(p1); | | ∕∕ | Launch | kernel | on | device 1 | | | |
| 3.4.1.3 Multi-DeviceStream,Event,andMemoryCopyBehavior | |
| Akernellaunchwillfailifitisissuedtoastreamthatisnotassociatedtothecurrentdeviceasillus- | |
| tratedinthefollowingcodesample. | |
| | cudaSetDevice(0); | | | | | ∕∕ Set | device | 0 | as current | | | |
| | ---------------------- | --- | --- | --- | --- | --------- | ------ | ------ | ------------ | --- | | |
| | cudaStream_t | s0; | | | | | | | | | | |
| | cudaStreamCreate(&s0); | | | | | ∕∕ Create | | stream | s0 on device | 0 | | |
| MyKernel<<<100, 64, 0, s0>>>(); ∕∕ Launch kernel on device 0 in s0 | |
| | cudaSetDevice(1); | | | | | ∕∕ Set | device | 1 | as current | | | |
| | ---------------------- | --- | --- | --- | --- | --------- | ------ | ------ | ------------ | --- | | |
| | cudaStream_t | s1; | | | | | | | | | | |
| | cudaStreamCreate(&s1); | | | | | ∕∕ Create | | stream | s1 on device | 1 | | |
| MyKernel<<<100, 64, 0, s1>>>(); ∕∕ Launch kernel on device 1 in s1 | |
| ∕∕ This kernel launch will fail, since stream s0 is not associated to device 1: | |
| MyKernel<<<100, 64, 0, s0>>>(); ∕∕ Launch kernel on device 1 in s0 | |
| Amemorycopywillsucceedevenifitisissuedtoastreamthatisnotassociatedtothecurrentdevice. | |
| 3.4. ProgrammingSystemswithMultipleGPUs 127 | |
| CUDAProgrammingGuide,Release13.1 | |
| cudaEventRecord()willfailiftheinputeventandinputstreamareassociatedtodifferentdevices. | |
| cudaEventElapsedTime()willfailifthetwoinputeventsareassociatedtodifferentdevices. | |
| cudaEventSynchronize() and cudaEventQuery() will succeed even if the input event is associ- | |
| atedtoadevicethatisdifferentfromthecurrentdevice. | |
| cudaStreamWaitEvent() will succeed even if the input stream and input event are associated to | |
| different devices. cudaStreamWaitEvent() can therefore be used to synchronize multiple devices | |
| witheachother. | |
| Each device has its own defaultstream, so commands issued to the default stream of a device may | |
| executeoutoforderorconcurrentlywithrespecttocommandsissuedtothedefaultstreamofany | |
| otherdevice. | |
| | 3.4.2. | Multi-Device | | Peer-to-Peer | | Transfers | | and | Memory | | |
| | ------ | ------------ | --- | ------------ | --- | --------- | --- | --- | ------ | | |
| Access | |
| 3.4.2.1 Peer-to-PeerMemoryTransfers | |
| CUDAcanperformmemorytransfersbetweendevicesandwilltakeadvantageofdedicatedcopyen- | |
| ginesandNVLinkhardwaretomaximizeperformancewhenpeer-to-peermemoryaccessispossible. | |
| cudaMemcpycanbeusedwiththecopytypecudaMemcpyDeviceToDeviceorcudaMemcpyDefault. | |
| Otherwise, copies must be performed using cudaMemcpyPeer(), cudaMemcpyPeerAsync(), cud- | |
| aMemcpy3DPeer(),orcudaMemcpy3DPeerAsync()asillustratedinthefollowingcodesample. | |
| | cudaSetDevice(0); | | | | ∕∕ Set | device | 0 as | current | | | |
| | ------------------ | ------ | --------------------- | --------- | ----------- | ------ | ------ | --------- | --- | | |
| | float* | p0; | | | | | | | | | |
| | size_t | size = | 1024 * sizeof(float); | | | | | | | | |
| | cudaMalloc(&p0, | | size); | | ∕∕ Allocate | | memory | on device | 0 | | |
| | cudaSetDevice(1); | | | | ∕∕ Set | device | 1 as | current | | | |
| | float* | p1; | | | | | | | | | |
| | cudaMalloc(&p1, | | size); | | ∕∕ Allocate | | memory | on device | 1 | | |
| | cudaSetDevice(0); | | | | ∕∕ Set | device | 0 as | current | | | |
| | MyKernel<<<1000, | | 128>>>(p0); | | ∕∕ Launch | kernel | | on device | 0 | | |
| | cudaSetDevice(1); | | | | ∕∕ Set | device | 1 as | current | | | |
| | cudaMemcpyPeer(p1, | | 1, p0, | 0, size); | ∕∕ Copy | p0 | to p1 | | | | |
| | MyKernel<<<1000, | | 128>>>(p1); | | ∕∕ Launch | kernel | | on device | 1 | | |
| Acopy(intheimplicitNULLstream)betweenthememoriesoftwodifferentdevices: | |
| ▶ | |
| doesnotstartuntilallcommandspreviouslyissuedtoeitherdevicehavecompletedand | |
| ▶ runstocompletionbeforeanycommands(seeAsynchronousExecution)issuedafterthecopyto | |
| eitherdevicecanstart. | |
| Consistentwiththenormalbehaviorofstreams,anasynchronouscopybetweenthememoriesoftwo | |
| devicesmayoverlapwithcopiesorkernelsinanotherstream. | |
| Ifpeer-to-peeraccessisenabledbetweentwodevices,e.g.,asdescribedinPeer-to-PeerMemoryAc- | |
| cess, peer-to-peer memory copies between these two devices no longer need to be staged through | |
| thehostandarethereforefaster. | |
| | 128 | | | | | | | Chapter3. | AdvancedCUDA | | |
| | --- | --- | --- | --- | --- | --- | --- | --------- | ------------ | | |
| CUDAProgrammingGuide,Release13.1 | |
| 3.4.2.2 Peer-to-PeerMemoryAccess | |
| Depending on the system properties, specifically the PCIe and/or NVLink topology, devices are able | |
| to address each other’s memory (i.e., a kernel executing on one device can dereference a pointer to | |
| thememoryoftheotherdevice). Peer-to-peermemoryaccessissupportedbetweentwodevicesif | |
| cudaDeviceCanAccessPeer()returnstrueforthespecifieddevices. | |
| Peer-to-peer memory access must be enabled between two devices by calling | |
| cudaDeviceEn- | |
| ablePeerAccess()asillustratedinthefollowingcodesample. Onnon-NVSwitchenabledsystems, | |
| eachdevicecansupportasystem-widemaximumofeightpeerconnections. | |
| Aunifiedvirtualaddressspaceisusedforbothdevices(seeUnifiedVirtualAddressSpace),sothesame | |
| pointercanbeusedtoaddressmemoryfrombothdevicesasshowninthecodesamplebelow. | |
| | cudaSetDevice(0); | | | ∕∕ Set | device 0 as | current | | |
| | ----------------- | ------ | --------------------- | ----------- | ----------- | ----------- | | |
| | float* | p0; | | | | | | |
| | size_t | size = | 1024 * sizeof(float); | | | | | |
| | cudaMalloc(&p0, | | size); | ∕∕ Allocate | memory | on device 0 | | |
| | MyKernel<<<1000, | | 128>>>(p0); | ∕∕ Launch | kernel | on device 0 | | |
| | cudaSetDevice(1); | | | ∕∕ Set | device 1 as | current | | |
| cudaDeviceEnablePeerAccess(0, 0); ∕∕ Enable peer-to-peer access | |
| | | | | ∕∕ with | device 0 | | | |
| | --------- | ------ | ----------- | ------- | -------- | --- | | |
| | ∕∕ Launch | kernel | on device 1 | | | | | |
| ∕∕ This kernel launch can access memory on device 0 at address p0 | |
| | MyKernel<<<1000, | | 128>>>(p0); | | | | | |
| | ---------------- | --- | ----------- | --- | --- | --- | | |
| Note | |
| The use of cudaDeviceEnablePeerAccess() to enable peer memory access operates globally | |
| onallpreviousandsubsequentGPUmemoryallocationsonthepeerdevice. Enablingpeeraccess | |
| toadeviceviacudaDeviceEnablePeerAccess()addsruntimecosttodevicememoryallocation | |
| operationsonthatpeerduetotheneedmaketheallocationsimmediatelyaccessibletothecurrent | |
| deviceandanyotherpeersthatalsohaveaccess,addingmultiplicativeoverheadthatscaleswith | |
| thenumberofpeerdevices. | |
| Amorescalablealternativetoenablingpeermemoryaccessforalldevicememoryallocationsisto | |
| makeuseofCUDAVirtualMemoryManagementAPIstoexplicitlyallocatepeer-accessiblemem- | |
| ory regions only as-needed, at allocation time. By requesting peer-accessibility explicitly during | |
| memory allocation, the runtime cost of memory allocations are unharmed for allocations not ac- | |
| cessibletopeers,andpeer-accessibledatastructuresarecorrectlyscopedforimprovedsoftware | |
| debuggingandreliability(seeref::virtual-memory-management). | |
| 3.4.2.3 Peer-to-PeerMemoryConsistency | |
| Synchronization operations must be used to enforce the ordering and correctness of memory ac- | |
| cesses by concurrently executing threads in grids distributed across multiple devices. Threads syn- | |
| chronizing across devices operate at the thread_scope_system synchronization scope. Similarly, | |
| memoryoperationsfallwithinthethread_scope_systemmemorysynchronizationdomain. | |
| CUDA ref::atomic-functions can perform read-modify-write operations on an object in peer device | |
| memory when only a single GPU is accessing that object. The requirements and limitations for peer | |
| atomicityaredescribedintheCUDAmemorymodelatomicityrequirementsdiscussion. | |
| 3.4. ProgrammingSystemswithMultipleGPUs 129 | |
| CUDAProgrammingGuide,Release13.1 | |
| 3.4.2.4 Multi-DeviceManagedMemory | |
| Managed memory can be used on multi-GPU systems with peer-to-peer support. The detailed re- | |
| quirementsforconcurrentmulti-devicemanagedmemoryaccessandAPIsforGPU-exclusiveaccess | |
| tomanagedmemoryaredescribedinMulti-GPU. | |
| 3.4.2.5 HostIOMMUHardware,PCIAccessControlServices,andVMs | |
| On Linux specifically, CUDA and the display driver do not support IOMMU-enabled bare-metal PCIe | |
| peer-to-peer memory transfer. However, CUDA and the display driver do support IOMMU via virtual | |
| machine pass through. The IOMMU must be disabled when running Linux on a bare metal system | |
| topreventsilentdevicememorycorruption. Conversely,theIOMMUshouldbeenabledandtheVFIO | |
| driverbeusedforPCIepassthroughforvirtualmachines. | |
| OnWindowstheIOMMUlimitationabovedoesnotexist. | |
| SeealsoAllocatingDMABufferson64-bitPlatforms. | |
| Additionally,PCIAccessControlServices(ACS)canbeenabledonsystemsthatsupportIOMMU.The | |
| PCIACSfeatureredirectsallPCIpoint-to-pointtrafficthroughtheCPUrootcomplex,whichcancause | |
| significantperformancelossduetothereductioninoverallbisectionbandwidth. | |
| 3.5. A Tour of CUDA Features | |
| Sections1-3ofthisprogrammingguidehaveintroducedCUDAandGPUprogramming,coveringfoun- | |
| dationaltopicsbothconceptuallyandinsimplecodeexamples. ThesectionsdescribingspecificCUDA | |
| featuresinpart4ofthisguideassumeknowledgeoftheconceptscoveredinsections1-3ofthisguide. | |
| CUDAhasmanyfeatureswhichapplytodifferentproblems. Notallofthemwillbeapplicabletoevery | |
| usecase. Thischapterservestointroduceeachofthesefeaturesanddescribeitsintendeduseand | |
| the problems it may help solve. Features are coarsely sorted into categories by the type of problem | |
| theyareintendedtosolve. Somefeatures,suchasCUDAgraphs,couldfitintomorethanonecategory. | |
| Section4coverstheseCUDAfeaturesinmorecompletedetail. | |
| 3.5.1. Improving Kernel Performance | |
| Thefeaturesoutlinedinthissectionareallintendedtoaidkerneldeveloperstomaximizetheperfor- | |
| manceoftheirkernels. | |
| 3.5.1.1 AsynchronousBarriers | |
| AsynchronousbarrierswereintroducedinSection3.2.4.2andallowformorenuancedcontroloversyn- | |
| chronization between threads. Asynchronous barriers separate the arrival and the wait of a barrier. | |
| Thisallowsapplicationstoperformworkthatdoesnotdependonthebarrierwhilewaitingforother | |
| threads to arrive. Asynchronous barriers can be specified for different threadscopes. Full details of | |
| asynchronousbarriersarefoundinSection4.9. | |
| 3.5.1.2 AsynchronousDataCopiesandtheTensorMemoryAccelerator(TMA) | |
| AsynchronousdatacopiesinthecontextofCUDAkernelcodereferstotheabilitytomovedatabetween | |
| sharedmemoryandGPUDRAMwhilestillcarryingoutcomputations. Thisshouldnotbeconfusedwith | |
| asynchronousmemorycopiesbetweentheCPUandGPU.Thisfeaturemakesusedofasynchronous | |
| barriers. Section4.11coverstheuseofasynchronouscopiesindetail. | |
| 130 Chapter3. AdvancedCUDA | |
| CUDAProgrammingGuide,Release13.1 | |
| 3.5.1.3 Pipelines | |
| Pipelines are a mechanism for staging work and coordinating multi-buffer producer–consumer pat- | |
| terns, commonly used to overlap compute with asynchronous data copies. Section 4.10 has details | |
| andexamplesofusingpipelinesinCUDA. | |
| 3.5.1.4 WorkStealingwithClusterLaunchControl | |
| Workstealingisatechniqueformaintainingutilizationinunevenworkloadswhereworkersthathave | |
| completedtheirworkcan‘steal’tasksfromotherworkers. Clusterlaunchcontrol,afeatureintroduced | |
| in compute capability 10.0 (Blackwell), gives kernels direct control over in-flight block scheduling so | |
| they can adapt to uneven workloads in real time. A thread block can cancel the launch of another | |
| threadblockorclusterthathasnotyetstarted,claimitsindex,andimmediatelybeginexecutingthe | |
| stolenwork. Thiswork-stealingflowkeepsSMsbusyandcutsidletimeunderirregulardataorruntime | |
| variation—deliveringfiner-grainedloadbalancingwithoutrelyingonthehardwarescheduleralone. | |
| Section4.12providesdetailsonhowtousethisfeature. | |
| 3.5.2. Improving Latencies | |
| Thefeaturesoutlinedinthissectionshareacommonthemeofaimingtoreducesometypeoflatency, | |
| thoughthetypeoflatencybeingaddresseddiffersbetweenthedifferentfeatures. Byandlargethey | |
| are focused on latencies at the kernel launch level or higher. GPU memory access latency within a | |
| kernelisnotoneofthelatenciesconsideredhere. | |
| 3.5.2.1 GreenContexts | |
| Greencontexts, also called executioncontexts, is the name given to a CUDA feature which enables a | |
| program to create CUDAcontexts which will execute work only on a subset of the SMs of a GPU. By | |
| default,thethreadblocksofakernellauncharedispatchedtoanySMwithintheGPUwhichcanfulfill | |
| theresourcerequirementsofthekernel. Therearealargenumberoffactorswhichcanaffectwhich | |
| SMscanexecuteathreadblock,includingbutnotnecessarilylimitedto: sharedmemoryuse,register | |
| use,useofclusters,andtotalnumberofthreadsinthethreadblock. | |
| Executioncontextsallowakerneltobelaunchedintoaspeciallycreatedcontextwhichfurtherlimits | |
| the number of SMs available to execute the kernel. Importantly, when a program creates a green | |
| contextwhichusessomesetofSMs,othercontextsontheGPUwillnotschedulethreadblocksonto | |
| theSMsallocatedtothegreencontext. Thisincludestheprimarycontext,whichisthedefaultcontext | |
| usedbytheCUDAruntime. ThisallowstheseSMstobereservedforworkloadswhicharehighpriority | |
| orlatency-sensitive. | |
| Section4.6givesfulldetailsontheuseofgreencontexts. GreencontextsareavailableintheCUDA | |
| runtimeinCUDA13.1andlater. | |
| 3.5.2.2 Stream-OrderedMemoryAllocation | |
| Thestream-orderedmemoryallocatorallowsprogramstosequenceallocationandfreeingofGPUmem- | |
| ory into a CUDA stream. Unlike cudaMalloc and cudaFree which execute immediately, cudaMal- | |
| locAsync and cudaFreeAsync inserts a memory allocation or free operation into a CUDA stream. | |
| Section4.3coversallthedetailsoftheseAPIs. | |
| 3.5. ATourofCUDAFeatures 131 | |
| CUDAProgrammingGuide,Release13.1 | |
| 3.5.2.3 CUDAGraphs | |
| CUDAgraphsenableanapplicationtospecifyasequenceofCUDAoperationssuchaskernellaunches | |
| or memory copies and the dependencies between these operations so that they can be executed | |
| efficientlyontheGPU.SimilarbehaviorcanbeattainedbyusingCUDAstreams,andindeedoneofthe | |
| mechanismsforcreatingagraphiscalledstreamcapture,whichenablestheoperationsonastream | |
| toberecordedintoaCUDAgraph. GraphscanalsobecreatedusingtheCUDAgraphsAPI. | |
| Once a graph has been created, it can be instantiated and executed many times. This is useful for | |
| specifyingworkloadsthatwillberepeated. GraphsoffersomeperformancebenefitsinreducingCPU | |
| launchcostsassociatedwithinvokingCUDAoperationsaswellasenablingoptimizationsonlyavailable | |
| whenthewholeworkloadisspecifiedinadvance. | |
| Section4.2describesanddemonstrateshowtouseCUDAgraphs. | |
| 3.5.2.4 ProgrammaticDependentLaunch | |
| ProgrammaticdependentlaunchisaCUDAfeaturewhichallowsadependentkernel,i.e. akernelwhich | |
| depends on the output of a prior kernel, to begin execution before the primary kernel on which it | |
| depends has completed. The dependentkernelcan executesetup code and unrelatedworkup until | |
| it requires data from the primary kernel and block there. The primary kernel can signal when the | |
| data required by the dependent kernel is ready, which will release the dependent kernel to continue | |
| executing. This enables some overlap between the kernels which can help keep GPU utilization high | |
| while minimizing the latency of the critical data path. Section 4.5 covers programmatic dependent | |
| launch. | |
| 3.5.2.5 LazyLoading | |
| LazyloadingisafeaturewhichallowscontroloverhowtheJITcompileroperatesatapplicationstartup. | |
| ApplicationswhichhavemanykernelswhichneedtobeJITcompiledfromPTXtocubinmayexperience | |
| longstartuptimesifallkernelsareJITcompiledaspartofapplicationstartup. Thedefaultbehavioris | |
| thatmodulesarenotcompileduntiltheyareneeded. Thiscanbechangedbytheuseofenvironment | |
| variables,asdetailedinSection4.7. | |
| 3.5.3. Functionality Features | |
| Thefeaturesdescribedhereshareacommontraitthattheyaremeanttoenableadditionalcapabilities | |
| orfunctionality. | |
| 3.5.3.1 ExtendedGPUMemory | |
| ExtendedGPUmemory isafeatureavailableinNVLink-C2Cconnectedsystemsthatenablesefficient | |
| accesstoallmemorywithinthesystemfromwithinaGPU.EGMiscoveredindetailinSection4.17. | |
| 3.5.3.2 DynamicParallelism | |
| CUDAapplicationsmostcommonlylaunchkernelsfromcoderunningontheCPU.Itisalsopossibleto | |
| createnewkernelinvocationsfromakernelrunningontheGPU.ThisfeatureisreferredtoasCUDA | |
| dynamicparallelism. Section 4.18 covers the details of creating new GPU kernel launches from code | |
| runningontheGPU. | |
| 132 Chapter3. AdvancedCUDA | |
| CUDAProgrammingGuide,Release13.1 | |
| 3.5.4. CUDA Interoperability | |
| 3.5.4.1 CUDAInteroperabilitywithotherAPIs | |
| ThereareothermechanismsthanCUDAforrunningcodeonGPUs. TheapplicationGPUswereorig- | |
| inally built to accelerate, computer graphics, uses its own set of APIs such as Direct3D and Vulkan. | |
| Applications may wish to use one of the graphics APIs for 3D rendering while performing computa- | |
| tionsinCUDA.CUDAprovidesmechanismsforexchangingdatastoredontheGPUbetweentheCUDA | |
| contextsandtheGPUcontextsusedbythe3DAPIs. Forexample,anapplicationmayperformasim- | |
| ulationusingCUDA,andthenusea3DAPItocreatevisualizationsoftheresults. Thisisachievedby | |
| makingsomebuffersreadableand/orwriteablefrombothCUDAandthegraphicsAPI. | |
| ThesamemechanismswhichallowsharingofbufferswithgraphicsAPIsarealsousedtosharebuffers | |
| withcommunicationsmechanismswhichcanenablerapid,directGPU-to-GPUcommunicationwithin | |
| multi-nodeenvironments. | |
| Section4.19describeshowCUDAinteroperateswithotherGPUAPIsandhowtosharedatabetween | |
| CUDAandotherAPIs,providingspecificexamplesforanumberofdifferentAPIs. | |
| 3.5.4.2 InterprocessCommunication | |
| Forverylargecomputations,itiscommontousemultipleGPUstogethertomakeuseofmorememory | |
| andmorecomputeresourcesworkingtogetheronaproblem. Withinasinglesystem,ornodeincluster | |
| computingterminology,multipleGPUscanbeusedinasinglehostprocess. ThisisdescribedinSection | |
| 3.4. | |
| Itisalsocommontouseseparatehostprocessesspanningeitherasinglecomputerormultiplecom- | |
| puters. When multiple processes are working together, communication between them is known as | |
| interprocesscommunication. CUDAinterprocesscommunication(CUDAIPC)providesmechanismsto | |
| shareGPUbuffersbetweendifferentprocesses. Section4.15explainsanddemonstrateshowCUDA | |
| IPCcanbeusedtocoordinateandcommunicatebetweendifferenthostprocesses. | |
| 3.5.5. Fine-Grained Control | |
| 3.5.5.1 VirtualMemoryManagement | |
| AsmentionedinSection2.4.1,allGPUsinasystem,alongwiththeCPUmemory,shareasingleunified | |
| virtualaddressspace. MostapplicationscanusethedefaultmemorymanagementprovidedbyCUDA | |
| withouttheneedtochangeitsbehavior. However,theCUDAdriverAPIprovidesadvancedanddetailed | |
| controlsoverthelayoutofthisvirtualmemoryspaceforthosethatneedit. Thisismostlyapplicable | |
| for controlling the behavior of buffers when sharing between GPUs both within and across multiple | |
| systems. | |
| Section4.16coversthecontrolsofferedbytheCUDAdriverAPI,howtheyworkandwhenadeveloper | |
| mayfindthemadvantageous. | |
| 3.5.5.2 DriverEntryPointAccess | |
| Driverentrypointaccess refers to the ability, starting in CUDA 11.3, to retrieve function pointers to | |
| the CUDA Driver and CUDA Runtime APIs. It also allows developers to retrieve function pointers for | |
| specific variants of driver functions, and to access driver functions from drivers newer than those | |
| availableintheCUDAtoolkit. Section4.20coversdriverentrypointaccess. | |
| 3.5. ATourofCUDAFeatures 133 | |
| CUDAProgrammingGuide,Release13.1 | |
| 3.5.5.3 ErrorLogManagement | |
| Error log management provides utilities for handling and logging errors from CUDA APIs. Setting a | |
| singleenvironmentvariableCUDA_LOG_FILEenablescapturingCUDAerrorsdirectlytostderr,stdout, | |
| orafile. Errorlogmanagementalsoenablesapplicationstoregisteracallbackwhichistriggeredwhen | |
| CUDAencountersanerror. Section4.8providesmoredetailsonerrorlogmanagement. | |
| 134 Chapter3. AdvancedCUDA | |
| Chapter 4. CUDA Features | |
| 4.1. Unified Memory | |
| Thissectionexplainsthedetailedbehavioranduseofeachofthedifferentparadigmsofunifiedmem- | |
| oryavailable. Theearliersectiononunifiedmemory showedhowtodeterminewhichunifiedmemory | |
| paradigmappliesandbrieflyintroducedeach. | |
| Asdiscussedpreviouslytherearefourparadigmsofunifiedmemoryprogramming: | |
| ▶ Fullsupportforexplicitmanagedmemoryallocations | |
| ▶ Fullsupportforallallocationswithsoftwarecoherence | |
| ▶ Fullsupportforallallocationswithhardwarecoherence | |
| ▶ Limitedunifiedmemorysupport | |
| The first three paradigms involving full unified memory support have very similar behavior and pro- | |
| grammingmodelandarecoveredinUnifiedMemoryonDeviceswithFullCUDAUnifiedMemorySupport | |
| withanydifferenceshighlighted. | |
| Thelastparadigm, whereunifiedmemorysupportislimited, isdiscussedindetailinUnifiedMemory | |
| onWindows,WSL,andTegra. | |
| 4.1.1. Unified Memory on Devices with Full CUDA Unified | |
| Memory Support | |
| Thesesystemsincludehardware-coherentmemorysystems,suchasNVIDIAGraceHopperandmod- | |
| ern Linux systems with Heterogeneous Memory Management (HMM) enabled. HMM is a software- | |
| basedmemorymanagementsystem,providingthesameprogrammingmodelashardware-coherent | |
| memorysystems. | |
| LinuxHMMrequiresLinuxkernelversion6.1.24+,6.2.11+or6.3+,deviceswithcomputecapability7.5 | |
| orhigherandaCUDAdriverversion535+installedwithOpenKernelModules. | |
| Note | |
| WerefertosystemswithacombinedpagetableforbothCPUsandGPUsashardwarecoherentsys- | |
| tems. SystemswithseparatepagetablesforCPUsandGPUsarereferredtoassoftware-coherent. | |
| Hardware-coherent systems such as NVIDIA Grace Hopper offer a logically combined page table for | |
| bothCPUsandGPUs,seeCPUandGPUPageTables:HardwareCoherencyvs.SoftwareCoherency. The | |
| 135 | |
| CUDAProgrammingGuide,Release13.1 | |
| followingsectiononlyappliestohardware-coherentsystems: | |
| ▶ AccessCounterMigration | |
| | 4.1.1.1 UnifiedMemory: | | | In-DepthExamples | | | | | | | | |
| | ---------------------- | --- | --- | ---------------- | --- | --- | --- | --- | --- | --- | | |
| Systems with full CUDA unified memory support, see table Overview of Unified Memory Paradigms, | |
| allowthedevicetoaccessanymemoryownedbythehostprocessinteractingwiththedevice. | |
| Thissectionshowsafewadvanceduse-cases,usingakernelthatsimplyprintsthefirst8characters | |
| ofaninputcharacterarraytothestandardoutputstream: | |
| | __global__ | void | kernel(const | | char* | type, | const | char* | data) { | | | |
| | ---------- | ----- | ------------ | ----------- | ----------- | ------------ | ----- | --------- | ------- | --- | | |
| | static | const | int | n_char | = 8; | | | | | | | |
| | printf("%s | | - first | %d | characters: | '", | type, | n_char); | | | | |
| | for (int | i | = 0; | i < n_char; | ++i) | printf("%c", | | data[i]); | | | | |
| printf("'\n"); | |
| } | |
| Thefollowingtabsshowvariouswaysofhowthiskernelmaybecalledwithsystem-allocatedmemory: | |
| Malloc | |
| | void test_malloc() | | | { | | | | | | | | |
| | ------------------------------ | --------- | -------------- | ------------------------------------- | ----------- | --------------------- | --- | --- | --- | --- | | |
| | const | char | test_string[] | | = "Hello | World"; | | | | | | |
| | char* | heap_data | | = (char*)malloc(sizeof(test_string)); | | | | | | | | |
| | strncpy(heap_data, | | | test_string, | | sizeof(test_string)); | | | | | | |
| | kernel<<<1, | | 1>>>("malloc", | | heap_data); | | | | | | | |
| | ASSERT(cudaDeviceSynchronize() | | | | | == cudaSuccess, | | | | | | |
| "CUDA failed with '%s'", cudaGetErrorString(cudaGetLastError())); | |
| free(heap_data); | |
| } | |
| Managed | |
| | void test_managed() | | | { | | | | | | | | |
| | ------------------------------ | ----- | --------------- | --- | --------------------- | --------------- | --- | --- | --- | --- | | |
| | const | char | test_string[] | | = "Hello | World"; | | | | | | |
| | char* | data; | | | | | | | | | | |
| | cudaMallocManaged(&data, | | | | sizeof(test_string)); | | | | | | | |
| | strncpy(data, | | test_string, | | sizeof(test_string)); | | | | | | | |
| | kernel<<<1, | | 1>>>("managed", | | data); | | | | | | | |
| | ASSERT(cudaDeviceSynchronize() | | | | | == cudaSuccess, | | | | | | |
| "CUDA failed with '%s'", cudaGetErrorString(cudaGetLastError())); | |
| cudaFree(data); | |
| } | |
| Stackvariable | |
| | void test_stack() | | | { | | | | | | | | |
| | ------------------------------ | ---- | ------------- | --- | ------------- | --------------- | --- | --- | --- | --- | | |
| | const | char | test_string[] | | = "Hello | World"; | | | | | | |
| | kernel<<<1, | | 1>>>("stack", | | test_string); | | | | | | | |
| | ASSERT(cudaDeviceSynchronize() | | | | | == cudaSuccess, | | | | | | |
| (continuesonnextpage) | |
| | 136 | | | | | | | | Chapter4. | CUDAFeatures | | |
| | --- | --- | --- | --- | --- | --- | --- | --- | --------- | ------------ | | |
| CUDAProgrammingGuide,Release13.1 | |
| (continuedfrompreviouspage) | |
| "CUDA failed with '%s'", cudaGetErrorString(cudaGetLastError())); | |
| } | |
| File-scopestaticvariable | |
| | void test_static() | | | { | | | | | | | |
| | ------------------------------ | ----- | -------------- | ------------------ | --- | ------------- | --------------- | ------- | --- | | |
| | static | const | | char test_string[] | | | = "Hello | World"; | | | |
| | kernel<<<1, | | 1>>>("static", | | | test_string); | | | | | |
| | ASSERT(cudaDeviceSynchronize() | | | | | | == cudaSuccess, | | | | |
| "CUDA failed with '%s'", cudaGetErrorString(cudaGetLastError())); | |
| } | |
| Global-scopevariable | |
| | const | char | global_string[] | | | = "Hello | World"; | | | | |
| | ------------------------------ | ---- | --------------- | --- | --- | --------------- | --------------- | --- | --- | | |
| | void test_global() | | | { | | | | | | | |
| | kernel<<<1, | | 1>>>("global", | | | global_string); | | | | | |
| | ASSERT(cudaDeviceSynchronize() | | | | | | == cudaSuccess, | | | | |
| "CUDA failed with '%s'", cudaGetErrorString(cudaGetLastError())); | |
| } | |
| Global-scopeexternvariable | |
| | ∕∕ declared | | in | separate | file, | see | below | | | | |
| | ------------------------------ | ----- | -------------- | -------- | ----- | ---------- | --------------- | --- | --- | | |
| | extern | char* | ext_data; | | | | | | | | |
| | void test_extern() | | | { | | | | | | | |
| | kernel<<<1, | | 1>>>("extern", | | | ext_data); | | | | | |
| | ASSERT(cudaDeviceSynchronize() | | | | | | == cudaSuccess, | | | | |
| "CUDA failed with '%s'", cudaGetErrorString(cudaGetLastError())); | |
| } | |
| | ∕** This | may | be | a non-CUDA | | file | *∕ | | | | |
| | -------- | --- | --- | ---------- | --- | ---- | --- | --- | --- | | |
| char* ext_data; | |
| | static | const | char | global_string[] | | | = "Hello | World"; | | | |
| | ------------------ | ----- | --------------------------------------- | --------------- | --- | --- | ----------------------- | ------- | --- | | |
| | void __attribute__ | | | ((constructor)) | | | setup(void) | | { | | |
| | ext_data | | = (char*)malloc(sizeof(global_string)); | | | | | | | | |
| | strncpy(ext_data, | | | global_string, | | | sizeof(global_string)); | | | | |
| } | |
| | void __attribute__ | | | ((destructor)) | | | tear_down(void) | | { | | |
| | ------------------ | --- | --- | -------------- | --- | --- | --------------- | --- | --- | | |
| free(ext_data); | |
| } | |
| Notethatfortheexternvariable,itcouldbedeclaredanditsmemoryownedandmanagedbyathird- | |
| partylibrary,whichdoesnotinteractwithCUDAatall. | |
| Also note that stack variables as well as file-scope and global-scope variables can only be accessed | |
| throughapointerbytheGPU.Inthisspecificexample,thisisconvenientbecausethecharacterarray | |
| 4.1. UnifiedMemory 137 | |
| CUDAProgrammingGuide,Release13.1 | |
| isalreadydeclaredasapointer: const char*. However,considerthefollowingexamplewithaglobal- | |
| scopeinteger: | |
| | ∕∕ this | variable | is declared | | at | global | scope | | | | | | |
| | ------- | -------- | ----------- | --- | --- | ------ | ----- | --- | --- | --- | --- | | |
| int global_variable; | |
| | __global__ | void | kernel_uncompilable() | | | | { | | | | | | |
| | ---------- | ---- | --------------------- | --- | --- | --- | --- | --- | --- | --- | --- | | |
| ∕∕ this causes a compilation error: global (__host__) variables must not | |
| | ∕∕ be | accessed | from | __device__ | | ∕ __global__ | | | code | | | | |
| | -------------- | -------- | ----------------- | ---------- | --- | ------------ | --- | --- | ---- | --- | --- | | |
| | printf("%d\n", | | global_variable); | | | | | | | | | | |
| } | |
| ∕∕ On systems with pageableMemoryAccess set to 1, we can access the address | |
| ∕∕ of a global variable. The below kernel takes that address as an argument | |
| | __global__ | void | kernel(int* | | global_variable_addr) | | | | | { | | | |
| | -------------- | ---- | ----------------------- | --- | --------------------- | --- | --- | --- | --- | --- | --- | | |
| | printf("%d\n", | | *global_variable_addr); | | | | | | | | | | |
| } | |
| | int main() | { | | | | | | | | | | | |
| | ----------- | --- | ----------------------- | --- | --- | --- | --- | --- | --- | --- | --- | | |
| | kernel<<<1, | | 1>>>(&global_variable); | | | | | | | | | | |
| ... | |
| | return | 0; | | | | | | | | | | | |
| | ------ | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | |
| } | |
| Intheexampleabove,weneedtoensuretopassapointertotheglobalvariabletothekernelinstead | |
| of directly accessing the global variable in the kernel. This is because global variables without the | |
| __managed__ specifier are declared as __host__-only by default, thus most compilers won’t allow | |
| usingthesevariablesdirectlyindevicecodeasofnow. | |
| | 4.1.1.1.1 File-backedUnifiedMemory | | | | | | | | | | | | |
| | ---------------------------------- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | |
| SincesystemswithfullCUDAunifiedmemorysupportallowthedevicetoaccessanymemoryowned | |
| bythehostprocess,theycandirectlyaccessfile-backedmemory. | |
| Here, we show a modified version of the initial example shown in the previous section to use file- | |
| backedmemoryinordertoprintastringfromtheGPU,readdirectlyfromaninputfile. Inthefollowing | |
| example,thememoryisbackedbyaphysicalfile,buttheexampleappliestomemory-backedfilestoo. | |
| | __global__ | void | kernel(const | | char* | | type, | const | char* | data) { | | | |
| | ---------- | ----- | ------------ | -------------- | ----- | ---- | ------------ | ----- | --------- | ------- | --- | | |
| | static | const | int n_char | | = 8; | | | | | | | | |
| | printf("%s | | - first | %d characters: | | | '", | type, | n_char); | | | | |
| | for (int | i | = 0; i < | n_char; | | ++i) | printf("%c", | | data[i]); | | | | |
| printf("'\n"); | |
| } | |
| | void test_file_backed() | | | { | | | | | | | | | |
| | ----------------------- | ----------------------- | ----------- | ------------ | ---- | ---------- | -------- | --- | --- | --- | --- | | |
| | int fd | = open(INPUT_FILE_NAME, | | | | O_RDONLY); | | | | | | | |
| | ASSERT(fd | >= | 0, "Invalid | | file | handle"); | | | | | | | |
| | struct | stat | file_stat; | | | | | | | | | | |
| | int status | | = fstat(fd, | &file_stat); | | | | | | | | | |
| | ASSERT(status | | >= 0, | "Invalid | | file | stats"); | | | | | | |
| char* mapped = (char*)mmap(0, file_stat.st_size, PROT_READ, MAP_PRIVATE, fd, | |
| ,→ 0); | |
| | ASSERT(mapped | | != MAP_FAILED, | | | "Cannot | map | file | into | memory"); | | | |
| | ------------- | --- | ------------------- | --- | --- | -------- | --- | ---- | ---- | --------- | --- | | |
| | kernel<<<1, | | 1>>>("file-backed", | | | mapped); | | | | | | | |
| (continuesonnextpage) | |
| | 138 | | | | | | | | | Chapter4. | CUDAFeatures | | |
| | --- | --- | --- | --- | --- | --- | --- | --- | --- | --------- | ------------ | | |
| CUDAProgrammingGuide,Release13.1 | |
| (continuedfrompreviouspage) | |
| ASSERT(cudaDeviceSynchronize() == cudaSuccess, | |
| "CUDA failed with '%s'", cudaGetErrorString(cudaGetLastError())); | |
| ASSERT(munmap(mapped, file_stat.st_size) == 0, "Cannot unmap file"); | |
| ASSERT(close(fd) == 0, "Cannot close file"); | |
| } | |
| Note that on systems without the hostNativeAtomicSupported property (see HostNativeAtom- | |
| ics)includingsystemswithLinuxHMMenabled,atomicaccessestofile-backedmemoryarenotsup- | |
| ported. | |
| 4.1.1.1.2 Inter-ProcessCommunication(IPC)withUnifiedMemory | |
| Note | |
| Asofnow,usingIPCwithunifiedmemorycanhavesignificantperformanceimplications. | |
| Many applications prefer to manage one GPU per process, but still need to use unified memory, for | |
| exampleforover-subscription,andaccessitfrommultipleGPUs. | |
| CUDA IPC ( see Interprocess Communication ) does not support managed memory: handles to this | |
| typeofmemorymaynotbesharedthroughanyofthemechanismsdiscussedinthissection. Onsys- | |
| temswithfullCUDAunifiedmemorysupport,system-allocatedmemoryisIPCcapable. Onceaccess | |
| to system-allocated memory has been shared with other processes, the same programming model | |
| applies,similartoFile-backedUnifiedMemory. | |
| See the following references for more information on various ways of creating IPC-capable system- | |
| allocatedmemoryunderLinux: | |
| ▶ mmapwithMAP_SHARED | |
| ▶ POSIXIPCAPIs | |
| ▶ Linuxmemfd_create. | |
| Note that it is not possible to share memory between different hosts and their devices using this | |
| technique. | |
| 4.1.1.2 PerformanceTuning | |
| Inordertoachievegoodperformancewithunifiedmemory,itisimportantto: | |
| ▶ understandhowpagingworksonyoursystem,andhowtoavoidunnecessarypagefaults | |
| ▶ understandthevariousmechanismsallowingyoutokeepdatalocaltotheaccessingprocessor | |
| ▶ considertuningyourapplicationforthegranularityofmemorytransfersofyoursystem. | |
| As general advice, performance hints (see PerformanceHints) might provide improved performance, | |
| butusingthemincorrectlymightdegradeperformancecomparedtothedefaultbehavior. Alsonote | |
| thatanyhinthasaperformancecostassociatedwithitonthehost,thususefulhintsmustatthevery | |
| leastimproveperformanceenoughtoovercomethiscost. | |
| 4.1. UnifiedMemory 139 | |
| CUDAProgrammingGuide,Release13.1 | |
| 4.1.1.2.1 MemoryPagingandPageSizes | |
| To better understand the performance implication of unified memory, it is important to understand | |
| virtualaddressing, memorypagesandpagesizes. Thissub-sectionattemptstodefineallnecessary | |
| termsandexplainwhypagingmattersforperformance. | |
| Allcurrentlysupportedsystemsforunifiedmemoryuseavirtualaddressspace: thismeansthatmem- | |
| oryaddressesusedbyanapplicationrepresentavirtuallocationwhichmightbemappedtoaphysical | |
| locationwherethememoryactuallyresides. | |
| All currently supported processors, including both CPUs and GPUs, additionally use memory paging. | |
| Becauseallsystemsuseavirtualaddressspace,therearetwotypesofmemorypages: | |
| ▶ Virtual pages: This represents a fixed-size contiguous chunk of virtual memory per process | |
| tracked by the operating system, which can be mapped into physical memory. Note that the | |
| virtualpageislinkedtothemapping: forexample,asinglevirtualaddressmightbemappedinto | |
| physicalmemoryusingdifferentpagesizes. | |
| ▶ Physicalpages: Thisrepresentsafixed-sizecontiguouschunkofmemorytheprocessor’smain | |
| MemoryManagementUnit(MMU)supportsandintowhichavirtualpagecanbemapped. | |
| Currently,allx86_64CPUsuseadefaultphysicalpagesizeof4KiB.ArmCPUssupportmultiplephysical | |
| pagesizes-4KiB,16KiB,32KiBand64KiB-dependingontheexactCPU.Finally,NVIDIAGPUssupport | |
| multiplephysicalpagesizes,butprefer2MiBphysicalpagesorlarger. Notethatthesesizesaresubject | |
| tochangeinfuturehardware. | |
| Thedefaultpagesizeofvirtualpagesusuallycorrespondstothephysicalpagesize,butanapplication | |
| mayusedifferentpagesizesaslongastheyaresupportedbytheoperatingsystemandthehardware. | |
| Typically,supportedvirtualpagesizesmustbepowersof2andmultiplesofthephysicalpagesize. | |
| Thelogicalentitytrackingthemappingofvirtualpagesintophysicalpageswillbereferredtoasapage | |
| table,andeachmappingofagivenvirtualpagewithagivenvirtualsizetophysicalpagesiscalledaPage | |
| TableEntry(PTE).Allsupportedprocessorsprovidespecificcachesforthepagetabletospeedupthe | |
| translationofvirtualaddressestophysicaladdresses. ThesecachesarecalledTranslationLookaside | |
| Buffers(TLBs). | |
| Therearetwoimportantaspectsforperformancetuningofapplications: | |
| ▶ thechoiceofvirtualpagesize, | |
| ▶ whetherthesystemoffersacombinedpagetableusedbybothCPUsandGPUs,orseparatepage | |
| tablesforeachCPUandGPUindividually. | |
| 4.1.1.2.1.1 ChoosingtheRightPageSize | |
| Ingeneral,smallpagesizesleadtoless(virtual)memoryfragmentationbutmoreTLBmisses,whereas | |
| largerpagesizesleadtomorememoryfragmentationbutlessTLBmisses. Additionally,memorymi- | |
| grationisgenerallymoreexpensivewithlargerpagesizescomparedtosmallerpagesizes,becausewe | |
| typicallymigratefullmemorypages. Thiscancauselargerlatencyspikesinanapplicationusinglarge | |
| pagesizes. Seealsothenextsectionformoredetailsonpagefaults. | |
| OneimportantaspectforperformancetuningisthatTLBmissesaregenerallysignificantlymoreex- | |
| pensiveontheGPUcomparedtotheCPU.ThismeansthatifaGPUthreadfrequentlyaccessesrandom | |
| locationsofunifiedmemorymappedusingasmallenoughpagesize,itmightbesignificantlyslower | |
| comparedtothesameaccessestounifiedmemorymappedusingalargeenoughpagesize. Whilea | |
| similareffectmightoccurforaCPUthreadrandomlyaccessingalargeareaofmemorymappedus- | |
| ing a small page size, the slowdown is less pronounced, meaning that the application might want to | |
| trade-offthisslowdownwithhavinglessmemoryfragmentation. | |
| 140 Chapter4. CUDAFeatures | |
| CUDAProgrammingGuide,Release13.1 | |
| Note that in general, applications should not tune their performance to the physical page size of a | |
| given processor, since physical page sizes are subject to change depending on the hardware. The | |
| adviceaboveonlyappliestovirtualpagesizes. | |
| 4.1.1.2.1.2 CPUandGPUPageTables: HardwareCoherencyvs. SoftwareCoherency | |
| Hardware-coherent systems such as NVIDIA Grace Hopper offer a logically combined page table for | |
| both CPUs and GPUs. This is important because in order to access system-allocated memory from | |
| theGPU,theGPUuseswhicheverpagetableentrywascreatedbytheCPUfortherequestedmemory. | |
| If that page table entry uses the default CPU page size of 4KiB or 64KiB, accesses to large virtual | |
| memoryareaswillcausesignificantTLBmisses,thussignificantslowdowns. | |
| On the other hand, on software-coherent systems where the CPUs and GPUs each have their own | |
| logicalpagetable,differentperformancetuningaspectsshouldbeconsidered: inordertoguarantee | |
| coherency, these systems usually use page faults in case a processor accesses a memory address | |
| mappedintothephysicalmemoryofadifferentprocessor. Suchapagefaultmeansthat: | |
| ▶ It needs to be ensured that the currently owning processor (where the physical page currently | |
| resides)cannotaccessthispageanymore,eitherbydeletingthepagetableentryorupdatingit. | |
| ▶ It needs to be ensured that the processor requesting access can access this page, either by | |
| creatinganewpagetableentryorupdatingandexistingentry,suchthatitbecomesvalid/active. | |
| ▶ Thephysicalpagebackingthisvirtualpagemustbemoved/migratedtotheprocessorrequesting | |
| access: thiscanbeanexpensiveoperation,andtheamountofworkisproportionaltothepage | |
| size. | |
| Overall,hardware-coherentsystemsprovidesignificantperformancebenefitscomparedtosoftware- | |
| coherentsystemsincaseswherefrequentconcurrentaccessestothesamememorypagearemade | |
| bybothCPUandGPUthreads: | |
| ▶ lesspage-faults: thesesystemsdonotneedtousepage-faultsforemulatingcoherencyormi- | |
| gratingmemory, | |
| ▶ lesscontention: thesesystemsarecoherentatcache-linegranularityinsteadofpage-sizegran- | |
| ularity, that is, when there is contention from multiple processors within a cache line, only the | |
| cachelineisexchangedwhichismuchsmallerthanthesmallestpage-size,andwhenthediffer- | |
| entprocessorsaccessdifferentcache-lineswithinapage,thenthereisnocontention. | |
| Thisimpactstheperformanceofthefollowingscenarios: | |
| ▶ atomicupdatestothesameaddressconcurrentlyfrombothCPUsandGPUs | |
| ▶ signalingaGPUthreadfromaCPUthreadorvice-versa. | |
| 4.1.1.2.2 DirectUnifiedMemoryAccessfromtheHost | |
| Somedeviceshavehardwaresupportforcoherentreads, storesandatomicaccessesfromthehost | |
| on GPU-resident unified memory. These devices have the attribute cudaDevAttrDirectManaged- | |
| MemAccessFromHost set to 1. Note that all hardware-coherent systems have this attribute set for | |
| NVLink-connected devices. On these systems, the host has direct access to GPU-resident memory | |
| withoutpagefaultsanddatamigration. NotethatwithCUDAmanagedmemory,thecudaMemAdvis- | |
| eSetAccessedBy hint with location type cudaMemLocationTypeHost is necessary to enable this | |
| directaccesswithoutpagefaults,seeexamplebelow. | |
| 4.1. UnifiedMemory 141 | |
| CUDAProgrammingGuide,Release13.1 | |
| SystemAllocator | |
| | __global__ | | void | write(int | *ret, | int | a, int | b) { | | | | |
| | ---------------- | --- | ---- | --------- | ----- | ------------ | ------ | ---- | --- | --- | | |
| | ret[threadIdx.x] | | | = a | + b + | threadIdx.x; | | | | | | |
| } | |
| | __global__ | | void | append(int | *ret, | | int a, int | b) { | | | | |
| | ---------------- | --- | ---- | ---------- | ----- | -------------- | ---------- | ---- | --- | --- | | |
| | ret[threadIdx.x] | | | += a | + b | + threadIdx.x; | | | | | | |
| } | |
| | void | test_malloc() | | { | | | | | | | | |
| | ---- | ------------- | ------------------- | --- | --- | --- | ------------- | --- | --- | --- | | |
| | int | *ret | = (int*)malloc(1000 | | | * | sizeof(int)); | | | | | |
| ∕∕ for shared page table systems, the following hint is not necesary | |
| cudaMemLocation location = {.type = cudaMemLocationTypeHost}; | |
| cudaMemAdvise(ret, 1000 * sizeof(int), cudaMemAdviseSetAccessedBy, | |
| ,→location); | |
| write<<< 1, 1000 >>>(ret, 10, 100); ∕∕ pages populated in GPU | |
| ,→memory | |
| cudaDeviceSynchronize(); | |
| | for(int | | i = 0; | i < 1000; | i++) | | | | | | | |
| | ------- | ----------- | ------ | --------- | ------ | --- | -------- | --- | --- | --- | | |
| | | printf("%d: | | A+B = | %d\n", | i, | ret[i]); | ∕∕ | | | | |
| ,→directManagedMemAccessFromHost=1: CPU accesses GPU memory directly without | |
| ,→migrations | |
| ∕∕ | |
| ,→directManagedMemAccessFromHost=0: CPU faults and triggers device-to-host | |
| ,→migrations | |
| | append<<< | | 1, | 1000 >>>(ret, | | 10, | 100); | ∕∕ | | | | |
| | --------- | --- | --- | ------------- | --- | --- | ----- | --- | --- | --- | | |
| ,→directManagedMemAccessFromHost=1: GPU accesses GPU memory without migrations | |
| | cudaDeviceSynchronize(); | | | | | | | ∕∕ | | | | |
| | ------------------------ | --- | --- | --- | --- | --- | --- | --- | --- | --- | | |
| ,→directManagedMemAccessFromHost=0: GPU faults and triggers host-to-device | |
| ,→migrations | |
| free(ret); | |
| } | |
| Managed | |
| | __global__ | | void | write(int | *ret, | int | a, int | b) { | | | | |
| | ---------------- | --- | ---- | --------- | ----- | ------------ | ------ | ---- | --- | --- | | |
| | ret[threadIdx.x] | | | = a | + b + | threadIdx.x; | | | | | | |
| } | |
| | __global__ | | void | append(int | *ret, | | int a, int | b) { | | | | |
| | ---------------- | --- | ---- | ---------- | ----- | -------------- | ---------- | ---- | --- | --- | | |
| | ret[threadIdx.x] | | | += a | + b | + threadIdx.x; | | | | | | |
| } | |
| | void | test_managed() | | { | | | | | | | | |
| | ---- | -------------- | --- | --- | --- | --- | --- | --- | --- | --- | | |
| int *ret; | |
| | cudaMallocManaged(&ret, | | | | 1000 | * | sizeof(int)); | | | | | |
| | ----------------------- | --- | --- | --- | ---- | --- | ------------- | --- | --- | --- | | |
| cudaMemLocation location = {.type = cudaMemLocationTypeHost}; | |
| cudaMemAdvise(ret, 1000 * sizeof(int), cudaMemAdviseSetAccessedBy, | |
| | ,→location); | | ∕∕ | set direct | | access | hint | | | | | |
| | ------------ | --- | --- | ---------- | --- | ------ | ---- | --- | --- | --- | | |
| write<<< 1, 1000 >>>(ret, 10, 100); ∕∕ pages populated in GPU | |
| (continuesonnextpage) | |
| | 142 | | | | | | | | Chapter4. | CUDAFeatures | | |
| | --- | --- | --- | --- | --- | --- | --- | --- | --------- | ------------ | | |
| CUDAProgrammingGuide,Release13.1 | |
| (continuedfrompreviouspage) | |
| ,→memory | |
| cudaDeviceSynchronize(); | |
| for(int i = 0; i < 1000; i++) | |
| printf("%d: A+B = %d\n", i, ret[i]); ∕∕ | |
| ,→directManagedMemAccessFromHost=1: CPU accesses GPU memory directly without | |
| ,→migrations | |
| ∕∕ | |
| ,→directManagedMemAccessFromHost=0: CPU faults and triggers device-to-host | |
| ,→migrations | |
| append<<< 1, 1000 >>>(ret, 10, 100); ∕∕ | |
| ,→directManagedMemAccessFromHost=1: GPU accesses GPU memory without migrations | |
| cudaDeviceSynchronize(); ∕∕ | |
| ,→directManagedMemAccessFromHost=0: GPU faults and triggers host-to-device | |
| ,→migrations | |
| cudaFree(ret); | |
| Afterwritekerneliscompleted,retwillbecreatedandinitializedinGPUmemory. Next,theCPUwill | |
| accessretfollowedbyappendkernelusingthesameretmemoryagain. Thiscodewillshowdifferent | |
| behaviordependingonthesystemarchitectureandsupportofhardwarecoherency: | |
| ▶ onsystemswithdirectManagedMemAccessFromHost=1: CPUaccessestothemanagedbuffer | |
| willnottriggeranymigrations;thedatawillremainresidentinGPUmemoryandanysubsequent | |
| GPUkernelscancontinuetoaccessitdirectlywithoutinflictingfaultsormigrations | |
| ▶ onsystemswithdirectManagedMemAccessFromHost=0: CPUaccessestothemanagedbuffer | |
| will page fault and initiate data migration; any GPU kernel trying to access the same data first | |
| timewillpagefaultandmigratepagesbacktoGPUmemory. | |
| 4.1.1.2.3 HostNativeAtomics | |
| Somedevices,includingNVLink-connecteddevicesofhardware-coherentsystems,supporthardware- | |
| accelerated atomic accesses to CPU-resident memory. This implies that atomic accesses to host | |
| memorydonothavetobeemulatedwithapagefault. Forthesedevices,theattributecudaDevAt- | |
| trHostNativeAtomicSupportedissetto1. | |
| 4.1.1.2.4 AtomicAccessesandSynchronizationPrimitives | |
| CUDAunifiedmemorysupportsallatomicoperationsavailabletohostanddevicethreads,enablingall | |
| threadstocooperatebyconcurrentlyaccessingthesamesharedmemorylocation. Thelibcu++library | |
| providesmanyheterogeneoussynchronizationprimitivestunedforconcurrentusebetweenhostand | |
| devicethreads,includingcuda::atomic,cuda::atomic_ref,cuda::barrier,cuda::semaphore, | |
| amongmanyothers. | |
| Onsoftware-coherentsystems,atomicaccessesfromthedevicetofile-backedhostmemoryarenot | |
| supported. Thefollowingexamplecodeisvalidonhardware-coherentsystemsbutexhibitsundefined | |
| behavioronothersystems: | |
| #include <cuda∕atomic> | |
| #include <cstdio> | |
| #include <fcntl.h> | |
| #include <sys∕mman.h> | |
| (continuesonnextpage) | |
| 4.1. UnifiedMemory 143 | |
| CUDAProgrammingGuide,Release13.1 | |
| (continuedfrompreviouspage) | |
| #define ERR(msg, ...) { fprintf(stderr, msg, ##__VA_ARGS__); return EXIT_ | |
| | ,→FAILURE; | } | | | | | | | |
| | ---------- | ---------------- | ---- | --- | --- | --- | --- | | |
| | __global__ | void kernel(int* | ptr) | { | | | | | |
| cuda::atomic_ref{*ptr}.store(2); | |
| } | |
| | int main() | { | | | | | | | |
| | ---------- | ----------------------- | --- | ---------- | ------- | --- | --- | | |
| | ∕∕ this | will be closed∕deleted | | by default | on exit | | | | |
| | FILE* | tmp_file = tmpfile64(); | | | | | | | |
| ∕∕ need to allocate space in the file, we do this with posix_fallocate here | |
| | int status | = posix_fallocate(fileno(tmp_file), | | | 0, 4096); | | | | |
| | ---------- | ----------------------------------- | --- | --- | --------- | --- | --- | | |
| if (status != 0) ERR("Failed to allocate space in temp file\n"); | |
| int* ptr = (int*)mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_PRIVATE, | |
| | ,→fileno(tmp_file), | 0); | | | | | | | |
| | ------------------- | -------------- | ----------- | ----------- | ------------------ | --- | --- | | |
| | if (ptr | == MAP_FAILED) | ERR("Failed | to | map temp file\n"); | | | | |
| | ∕∕ initialize | the value | in our | file-backed | memory | | | | |
| | *ptr = | 1; | | | | | | | |
| | printf("Atom | value: | %d\n", | *ptr); | | | | | |
| ∕∕ device and host thread access ptr concurrently, using cuda::atomic_ref | |
| | kernel<<<1, | 1>>>(ptr); | | | | | | | |
| | ------------ | ------------------------------ | ------ | ------ | --- | --- | --- | | |
| | while | (cuda::atomic_ref{*ptr}.load() | | != | 2); | | | | |
| | ∕∕ this | will always | be 2 | | | | | | |
| | printf("Atom | value: | %d\n", | *ptr); | | | | | |
| | return | EXIT_SUCCESS; | | | | | | | |
| } | |
| Onsoftware-coherentsystems,atomicaccessestounifiedmemorymayincurpagefaultswhichcan | |
| lead to significant latencies. Note that this is not the case for all GPU atomics to CPU memory on | |
| thesesystems: operationslistedbynvidia-smi -q | grep "Atomic Caps Outbound"mayavoid | |
| pagefaults. | |
| Onhardware-coherentsystems,atomicsbetweenhostanddevicedonotrequirepagefaults,butmay | |
| stillfaultforotherreasonsthatcancauseanymemoryaccesstofault. | |
| | 4.1.1.2.5 Memcpy()/Memset()BehaviorWithUnifiedMemory | | | | | | | | |
| | ---------------------------------------------------- | --- | --- | --- | --- | --- | --- | | |
| cudaMemcpy*()andcudaMemset*()acceptanyunifiedmemorypointerasarguments. | |
| For cudaMemcpy*(), the direction specified as cudaMemcpyKind is a performance hint, which can | |
| haveahigherperformanceimpactifanyoftheargumentsisaunifiedmemorypointer. | |
| Thus,itisrecommendedtofollowthefollowingperformanceadvice: | |
| ▶ | |
| Whenthephysicallocationofunifiedmemoryisknown,useanaccuratecudaMemcpyKindhint. | |
| ▶ PrefercudaMemcpyDefaultoveraninaccuratecudaMemcpyKindhint. | |
| ▶ Alwaysusepopulated(initialized)buffers: avoidusingtheseAPIstoinitializememory. | |
| ▶ | |
| AvoidusingcudaMemcpy*()ifbothpointerspointtosystem-allocatedmemory: launchakernel | |
| oruseaCPUmemorycopyalgorithmsuchasstd::memcpyinstead. | |
| | 144 | | | | | Chapter4. | CUDAFeatures | | |
| | --- | --- | --- | --- | --- | --------- | ------------ | | |
| CUDAProgrammingGuide,Release13.1 | |
| 4.1.1.2.6 OverviewofMemoryAllocatorsforUnifiedMemory | |
| For systems with full CUDA unified memory support various different allocators may be used to al- | |
| locate unified memory. The following table shows an overview of a selection of allocators with their | |
| respectivefeatures. NotethatallinformationinthissectionissubjecttochangeinfutureCUDAver- | |
| sions. | |
| | | Table7: Overviewofunifiedmemorysupportofdifferentallo- | | | | | | |
| | --- | ------------------------------------------------------ | --- | --- | --- | --- | | |
| cators | |
| PageSizes45 | |
| | API | | Place- | Acces- Migrate | | | | |
| | --- | --- | ------ | -------------- | --- | --- | | |
| | | | ment | sible Based On | | | | |
| Policy From Access2 | |
| | malloc,new,mmap | | First | CPU, Yes3 | System | or huge page | | |
| | --------------- | --- | ----- | --------- | ------ | ------------ | | |
| touch/hinGt1PU size6 | |
| | cudaMallocManaged | | First | CPU, Yes | CPU resident: | system | | |
| | ----------------- | --- | ------------- | -------- | ------------- | ------------- | | |
| | | | touch/hinGtPU | | page size | GPU resident: | | |
| 2MB | |
| | | | GPU | GPU No | GPUpagesize: | 2MB | | |
| | --- | --- | --- | ------ | ------------ | --- | | |
| cudaMalloc | |
| cudaMallocHost, cudaHostAlloc, CPU CPU, No MappedbyCPU:system | |
| | | | | GPU | pagesize | | | |
| | --- | --- | --- | --- | -------- | --- | | |
| cudaHostRegister | |
| MappedbyGPU:2MB | |
| Memory pools, location type host: CPU CPU, No MappedbyCPU:system | |
| | cuMemCreate,cudaMemPoolCreate | | | GPU | pagesize | | | |
| | ----------------------------- | --- | --- | --- | -------- | --- | | |
| MappedbyGPU:2MB | |
| | Memory pools, | location type | device: GPU | GPU No | 2MB | | | |
| | ------------- | ------------------ | ----------- | ------ | --- | --- | | |
| | cuMemCreate, | cudaMemPoolCreate, | | | | | | |
| cudaMallocAsync | |
| ThetableOverviewofunifiedmemorysupportofdifferentallocatorsshowsthedifferenceinsemantics | |
| of several allocators that may be considered to allocate data accessible from multiple processors at | |
| atime,includinghostanddevice. ForadditionaldetailsaboutcudaMemPoolCreate,seetheMemory | |
| Poolssection,foradditionaldetailsaboutcuMemCreate,seetheVirtualMemoryManagementsection. | |
| Onhardware-coherentsystemswheredevicememoryisexposedasaNUMAdomaintothesystem, | |
| specialallocatorssuchasnuma_alloc_on_nodemaybeusedtopinmemorytothegivenNUMAnode, | |
| either host or device. This memory is accessible from both host and device and does not migrate. | |
| Similarly, mbind can be used to pin memory to the given NUMA node(s), and can cause file-backed | |
| memorytobeplacedonthegivenNUMAnode(s)beforeitisfirstaccessed. | |
| Thefollowingappliestoallocatorsofmemorythatisshared: | |
| 2ThisfeaturecanbeoverriddenwithcudaMemAdvise.Evenifaccess-basedmigrationsaredisabled,ifthebackingmemory | |
| spaceisfull,memorymightmigrate. | |
| 4Thedefaultsystempagesizeis4KiBor64KiBonmostsystems,unlesshugepagesizewasexplicitlyspecified(forexample, | |
| withmmapMAP_HUGETLB/MAP_HUGE_SHIFT).Inthiscase,anyhugepagesizeconfiguredonthesystemissupported. | |
| 5Page-sizesforGPU-residentmemorymayevolveinfutureCUDAversions. | |
| 1Formmap,file-backedmemoryisplacedontheCPUbydefault,unlessspecifiedotherwisethroughcudaMemAdviseSet- | |
| PreferredLocation(ormbind,seebulletpointsbelow). | |
| 3File-backedmemorywillnotmigratebasedonaccess. | |
| 6CurrentlyhugepagesizesmaynotbekeptwhenmigratingmemorytotheGPUorplacingitthroughfirst-touchonthe | |
| GPU. | |
| 4.1. UnifiedMemory 145 | |
| CUDAProgrammingGuide,Release13.1 | |
| ▶ System allocators such as mmap allow sharing the memory between processes using the | |
| MAP_SHARED flag. This is supported in CUDA and can be used to share memory between dif- | |
| ferentdevicesconnectedtothesamehost. However,thisiscurrentlynotsupportedforsharing | |
| memory between multiple hosts as well as multiple devices. See Inter-ProcessCommunication | |
| (IPC)withUnifiedMemoryfordetails. | |
| ▶ ForaccesstounifiedmemoryorotherCUDAmemorythroughanetworkonmultiplehosts,con- | |
| sultthedocumentationofthecommunicationlibraryused,forexampleNCCL,NVSHMEM,Open- | |
| MPI,UCX,etc. | |
| 4.1.1.2.7 AccessCounterMigration | |
| Onhardware-coherentsystems,theaccesscountersfeaturekeepstrackofthefrequencyofaccess | |
| that a GPU makes to memory located on other processors. This is needed to ensure memory pages | |
| are moved to the physical memory of the processor that is accessing the pages most frequently. It | |
| canguidemigrationsbetweenCPUandGPU,aswellasbetweenpeerGPUs, aprocesscalledaccess | |
| countermigration. | |
| Starting with CUDA 12.4, access counters are supported system-allocated memory. Note that file- | |
| backed memory does not migrate based on access. For system-allocated memory, access counters | |
| migrationcanbeswitchedonbyusingthecudaMemAdviseSetAccessedByhinttoadevicewiththe | |
| correspondingdeviceid. Ifaccesscountersareon, onecanusecudaMemAdviseSetPreferredLo- | |
| cation set to host to prevent migrations. Per default cudaMallocManaged migrates based on a | |
| fault-and-migratemechanism.7 | |
| Thedrivermayalsouseaccesscountersformoreefficientthrashingmitigationormemoryoversub- | |
| scriptionscenarios. | |
| 4.1.1.2.8 AvoidFrequentWritestoGPU-ResidentMemoryfromtheCPU | |
| Ifthehostaccessesunifiedmemory,cachemissesmayintroducemoretrafficthanexpectedbetween | |
| host and device. Many CPU architectures require all memory operations to go through the cache | |
| hierarchy,includingwrites. IfsystemmemoryisresidentontheGPU,thismeansthatfrequentwrites | |
| bytheCPUtothismemorycancausecachemisses,thustransferringthedatafirstfromtheGPUto | |
| CPUbeforewritingtheactualvalueintotherequestedmemoryrange. Onsoftware-coherentsystems, | |
| this may introduce additional page faults, while on hardware-coherent systems, it may cause higher | |
| latenciesbetweenCPUoperations. Thus,inordertosharedataproducedbythehostwiththedevice, | |
| considerwritingtoCPU-residentmemoryandreadingthevaluesdirectlyfromthedevice. Thecode | |
| belowshowshowtoachievethiswithunifiedmemory. | |
| SystemAllocator | |
| size_t data_size = sizeof(int); | |
| int* data = (int*)malloc(data_size); | |
| ∕∕ ensure that data stays local to the host and avoid faults | |
| cudaMemLocation location = {.type = cudaMemLocationTypeHost}; | |
| cudaMemAdvise(data, data_size, cudaMemAdviseSetPreferredLocation, location); | |
| cudaMemAdvise(data, data_size, cudaMemAdviseSetAccessedBy, location); | |
| ∕∕ frequent exchanges of small data: if the CPU writes to CPU-resident | |
| ,→memory, | |
| ∕∕ and GPU directly accesses that data, we can avoid the CPU caches re- | |
| (continuesonnextpage) | |
| 7Currentsystemsallowtheuseofaccess-countermigrationwithmanagedmemorywhentheaccessed-bydevicehintis | |
| set.Thisisanimplementationdetailandshouldnotbereliedonforfuturecompatibility. | |
| 146 Chapter4. CUDAFeatures | |
| CUDAProgrammingGuide,Release13.1 | |
| (continuedfrompreviouspage) | |
| ,→loading | |
| | ∕∕ data | if it | was | evicted | in | between | writes | | | | |
| | ----------- | ----- | ----------- | ------- | ---- | ------- | ------ | --- | --- | | |
| | for (int | i = | 0; i | < 10; | ++i) | { | | | | | |
| | *data | = 42 | + i; | | | | | | | | |
| | kernel<<<1, | | 1>>>(data); | | | | | | | | |
| cudaDeviceSynchronize(); | |
| | ∕∕ CPU | cache | potentially | | evicted | | data here | | | | |
| | ------ | ----- | ----------- | --- | ------- | --- | --------- | --- | --- | | |
| } | |
| free(data); | |
| Managed | |
| int* data; | |
| | size_t | data_size | | = sizeof(int); | | | | | | | |
| | ------------------------ | --------- | ---- | -------------- | ----------- | --- | -------- | --------- | ------ | | |
| | cudaMallocManaged(&data, | | | | data_size); | | | | | | |
| | ∕∕ ensure | that | data | stays | local | to | the host | and avoid | faults | | |
| cudaMemLocation location = {.type = cudaMemLocationTypeHost}; | |
| cudaMemAdvise(data, data_size, cudaMemAdviseSetPreferredLocation, location); | |
| cudaMemAdvise(data, data_size, cudaMemAdviseSetAccessedBy, location); | |
| ∕∕ frequent exchanges of small data: if the CPU writes to CPU-resident | |
| ,→memory, | |
| ∕∕ and GPU directly accesses that data, we can avoid the CPU caches re- | |
| ,→loading | |
| | ∕∕ data | if it | was | evicted | in | between | writes | | | | |
| | ----------- | ----- | ----------- | ------- | ---- | ------- | ------ | --- | --- | | |
| | for (int | i = | 0; i | < 10; | ++i) | { | | | | | |
| | *data | = 42 | + i; | | | | | | | | |
| | kernel<<<1, | | 1>>>(data); | | | | | | | | |
| cudaDeviceSynchronize(); | |
| | ∕∕ CPU | cache | potentially | | evicted | | data here | | | | |
| | ------ | ----- | ----------- | --- | ------- | --- | --------- | --- | --- | | |
| } | |
| cudaFree(data); | |
| 4.1.1.2.9 ExploitingAsynchronousAccesstoSystemMemory | |
| Ifanapplicationneedstoshareresultsfromworkonthedevicewiththehost,thereareseveralpossible | |
| options: | |
| 1. The device writes its result to GPU-resident memory, the result is transferred using cudaMem- | |
| cpy*,andthehostreadsthetransferreddata. | |
| 2. ThedevicedirectlywritesitsresulttoCPU-residentmemory,andthehostreadsthatdata. | |
| 3. ThedevicewritestoGPU-residentmemory,andthehostdirectlyaccessesthatdata. | |
| If independent work can be scheduled on the device while the result is transferred/accessed by the | |
| host,options1or3arepreferred. Ifthedeviceisstarveduntilthehosthasaccessedtheresult,option | |
| 2mightbe preferred. Thisisbecausethedevicecan generallywriteata higherbandwidth thanthe | |
| hostcanread,unlessmanyhostthreadsareusedtoreadthedata. | |
| 4.1. UnifiedMemory 147 | |
| CUDAProgrammingGuide,Release13.1 | |
| 1. ExplicitCopy | |
| | void exchange_explicit_copy(cudaStream_t | | | | | | | stream) | { | | | | |
| | ---------------------------------------- | ----------- | ---------------------- | ----------- | ------ | --- | --- | ------- | --- | --- | --- | | |
| | int* data, | *host_data; | | | | | | | | | | | |
| | size_t | n_bytes | = | sizeof(int) | | * | 16; | | | | | | |
| | ∕∕ allocate | | receiving | | buffer | | | | | | | | |
| | host_data | = | (int*)malloc(n_bytes); | | | | | | | | | | |
| ∕∕ allocate, since we touch on the device first, will be GPU-resident | |
| | cudaMallocManaged(&data, | | | | n_bytes); | | | | | | | | |
| | ------------------------ | ----------- | ------- | ---------------- | --------- | ------ | --------------------- | --- | --- | ----- | --- | | |
| | kernel<<<1, | | 16, 0, | stream>>>(data); | | | | | | | | | |
| | ∕∕ launch | independent | | | work | on the | device | | | | | | |
| | ∕∕ other_kernel<<<1024, | | | | 256, | 0, | stream>>>(other_data, | | | ...); | | | |
| | ∕∕ transfer | | to host | | | | | | | | | | |
| cudaMemcpyAsync(host_data, data, n_bytes, cudaMemcpyDeviceToHost, stream); | |
| | ∕∕ sync | stream | to | ensure | data | has | been transferred | | | | | | |
| | ------- | ------ | --- | ------ | ---- | --- | ---------------- | --- | --- | --- | --- | | |
| cudaStreamSynchronize(stream); | |
| | ∕∕ read | transferred | | data | | | | | | | | | |
| | ------- | ----------- | --- | ---- | --- | --- | --- | --- | --- | --- | --- | | |
| printf("Got values %d - %d from GPU\n", host_data[0], host_data[15]); | |
| cudaFree(data); | |
| free(host_data); | |
| } | |
| 2. DeviceDirectWrite | |
| | void exchange_device_direct_write(cudaStream_t | | | | | | | stream) | | { | | | |
| | ---------------------------------------------- | ------- | --------- | ----------- | --------- | --- | -------- | ------- | ---- | --- | --- | | |
| | int* data; | | | | | | | | | | | | |
| | size_t | n_bytes | = | sizeof(int) | | * | 16; | | | | | | |
| | ∕∕ allocate | | receiving | | buffer | | | | | | | | |
| | cudaMallocManaged(&data, | | | | n_bytes); | | | | | | | | |
| | ∕∕ ensure | that | data | is | mapped | and | resident | on the | host | | | | |
| cudaMemLocation location = {.type = cudaMemLocationTypeHost}; | |
| cudaMemAdvise(data, n_bytes, cudaMemAdviseSetPreferredLocation, location); | |
| cudaMemAdvise(data, n_bytes, cudaMemAdviseSetAccessedBy, location); | |
| | kernel<<<1, | | 16, 0, | stream>>>(data); | | | | | | | | | |
| | ----------- | ------ | ------ | ---------------- | ---- | --- | ---------------- | --- | --- | --- | --- | | |
| | ∕∕ sync | stream | to | ensure | data | has | been transferred | | | | | | |
| cudaStreamSynchronize(stream); | |
| | ∕∕ read | transferred | | data | | | | | | | | | |
| | ----------- | ----------- | ------ | ---- | ---- | ---- | ------- | -------- | ---------- | --- | --- | | |
| | printf("Got | | values | %d | - %d | from | GPU\n", | data[0], | data[15]); | | | | |
| cudaFree(data); | |
| } | |
| 3. HostDirectRead | |
| | void exchange_host_direct_read(cudaStream_t | | | | | | | stream) | { | | | | |
| | ------------------------------------------- | ------- | ---------- | ----------- | --------- | ----- | -------- | ------- | ------ | --- | --- | | |
| | int* data; | | | | | | | | | | | | |
| | size_t | n_bytes | = | sizeof(int) | | * | 16; | | | | | | |
| | ∕∕ allocate | | receiving | | buffer | | | | | | | | |
| | cudaMallocManaged(&data, | | | | n_bytes); | | | | | | | | |
| | ∕∕ ensure | that | data | is | mapped | and | resident | on the | device | | | | |
| | cudaMemLocation | | device_loc | | | = {}; | | | | | | | |
| cudaGetDevice(&device_loc.id); | |
| | device_loc.type | | = | cudaMemLocationTypeDevice; | | | | | | | | | |
| | --------------- | --- | --- | -------------------------- | --- | --- | --- | --- | --- | --- | --- | | |
| (continuesonnextpage) | |
| | 148 | | | | | | | | | Chapter4. | CUDAFeatures | | |
| | --- | --- | --- | --- | --- | --- | --- | --- | --- | --------- | ------------ | | |
| CUDAProgrammingGuide,Release13.1 | |
| (continuedfrompreviouspage) | |
| cudaMemAdvise(data, n_bytes, cudaMemAdviseSetPreferredLocation, device_loc); | |
| cudaMemAdvise(data, n_bytes, cudaMemAdviseSetAccessedBy, device_loc); | |
| kernel<<<1, 16, 0, stream>>>(data); | |
| ∕∕ launch independent work on the GPU | |
| ∕∕ other_kernel<<<1024, 256, 0, stream>>>(other_data, ...); | |
| ∕∕ sync stream to ensure data may be accessed (has been written by device) | |
| cudaStreamSynchronize(stream); | |
| ∕∕ read data directly from host | |
| printf("Got values %d - %d from GPU\n", data[0], data[15]); | |
| cudaFree(data); | |
| Finally,intheExplicitCopyexampleabove,insteadofusingcudaMemcpy*totransferdata,onecould | |
| use a host or device kernel to perform this transfer explicitly. For contiguous data, using the CUDA | |
| copy-engines is preferred because operations performed by copy-engines can be overlapped with | |
| work on both the host and device. Copy-engines might be used in cudaMemcpy* and cudaMem- | |
| PrefetchAsyncAPIs,butthereisnoguarantee. thatcopy-enginesareusedwithcudaMemcpy*API | |
| calls. For the same reason, explicitly copy is preferred over direct host read for large enough data: | |
| if both host and device perform work that does not saturate their respective memory systems, the | |
| transfer can be performed by the copy-engines concurrently with the work performed by both host | |
| anddevice. | |
| Copy-enginesaregenerallyusedforbothtransfersbetweenhostanddeviceaswellasbetweenpeer | |
| deviceswithinanNVLink-connectedsystem. Duetothelimitedtotalnumberofcopy-engines,some | |
| systemsmayhavealowerbandwidthofcudaMemcpy*comparedtousingthedevicetoexplicitlyper- | |
| form the transfer. In such a case, if the transfer is in the critical path of the application, it may be | |
| preferredtouseanexplicitdevice-basedtransfer. | |
| 4.1.2. Unified Memory on Devices with only CUDA | |
| Managed Memory Support | |
| For devices with compute capability 6.x or higher but without pageable memory access, see table | |
| OverviewofUnifiedMemoryParadigms, CUDA managed memory is fully supported and coherent but | |
| theGPUcannotaccesssystem-allocatedmemory. Theprogrammingmodelandperformancetuning | |
| ofunifiedmemoryislargelysimilartothemodelasdescribedinthesection,UnifiedMemoryonDevices | |
| withFullCUDAUnifiedMemorySupport,withthenotableexceptionthatsystemallocatorscannotbe | |
| usedtoallocatememory. Thus,thefollowinglistofsub-sectionsdonotapply: | |
| ▶ UnifiedMemory: In-DepthExamples | |
| ▶ CPUandGPUPageTables: HardwareCoherencyvs. SoftwareCoherency | |
| ▶ AtomicAccessesandSynchronizationPrimitives | |
| ▶ AccessCounterMigration | |
| ▶ AvoidFrequentWritestoGPU-ResidentMemoryfromtheCPU | |
| ▶ ExploitingAsynchronousAccesstoSystemMemory | |
| 4.1. UnifiedMemory 149 | |
| CUDAProgrammingGuide,Release13.1 | |
| 4.1.3. Unified Memory on Windows, WSL, and Tegra | |
| Note | |
| Thissectionisonlylookingatdeviceswithcomputecapabilitylowerthan6.0orWindowsplatforms, | |
| deviceswithconcurrentManagedAccesspropertysetto0. | |
| Deviceswithcomputecapabilitylowerthan6.0orWindowsplatforms,deviceswithconcurrentMan- | |
| agedAccess property set to 0, see OverviewofUnifiedMemoryParadigms, support CUDA managed | |
| memorywiththefollowinglimitations: | |
| ▶ DataMigrationandCoherency: Fine-grainedmovementofthemanageddatatoGPUon-demand | |
| isnotsupported. WheneveraGPUkernelislaunchedallmanagedmemorygenerallyhastobe | |
| transferredtoGPUmemorytoavoidfaultingonmemoryaccess. Pagefaultingisonlysupported | |
| fromtheCPUside. | |
| ▶ GPUMemoryOversubscription: Theycannotallocatemoremanagedmemorythanthephysical | |
| sizeofGPUmemory. | |
| ▶ CoherencyandConcurrency: Simultaneousaccesstomanagedmemoryisnotpossible,because | |
| coherencecouldnotbeguaranteediftheCPUaccessedaunifiedmemoryallocationwhileaGPU | |
| kernelisactivebecauseofthemissingGPUpagefaultingmechanism. | |
| 4.1.3.1 Multi-GPU | |
| Onsystemswithdevicesofcomputecapabilitieslowerthan6.0orWindowsplatformsmanagedallo- | |
| cationsareautomaticallyvisibletoallGPUsinasystemviathepeer-to-peercapabilitiesoftheGPUs. | |
| ManagedmemoryallocationsbehavesimilartounmanagedmemoryallocatedusingcudaMalloc(): | |
| the current active device is the home for the physical allocation but other GPUs in the system will | |
| accessthememoryatreducedbandwidthoverthePCIebus. | |
| OnLinuxthemanagedmemoryisallocatedinGPUmemoryaslongasallGPUsthatareactivelybeing | |
| used by a program have the peer-to-peer support. If at any time the application starts using a GPU | |
| thatdoesn’thavepeer-to-peersupportwithanyoftheotherGPUsthathavemanagedallocationson | |
| them, then the driver will migrate all managed allocations to system memory. In this case, all GPUs | |
| experiencePCIebandwidthrestrictions. | |
| OnWindows,ifpeermappingsarenotavailable(forexample,betweenGPUsofdifferentarchitectures), | |
| then the system will automatically fall back to using mapped memory, regardless of whether both | |
| GPUsareactuallyusedbya program. Ifonlyone GPUisactuallygoingtobe used, itisnecessaryto | |
| settheCUDA_VISIBLE_DEVICESenvironmentvariablebeforelaunchingtheprogram. Thisconstrains | |
| whichGPUsarevisibleandallowsmanagedmemorytobeallocatedinGPUmemory. | |
| Alternatively, on Windows users can also set CUDA_MANAGED_FORCE_DEVICE_ALLOC to a non-zero | |
| value to force the driver to always use device memory for physical storage. When this environment | |
| variable is set to a non-zero value, all devices used in that process that support managed memory | |
| havetobepeer-to-peercompatiblewitheachother. Theerror::cudaErrorInvalidDevicewillbe | |
| returnedifadevicethatsupportsmanagedmemoryisusedanditisnotpeer-to-peercompatiblewith | |
| anyoftheothermanagedmemorysupportingdevicesthatwerepreviouslyusedinthatprocess,even | |
| if::cudaDeviceResethasbeencalledonthosedevices. Theseenvironmentvariablesaredescribed | |
| inCUDAEnvironmentVariables. | |
| 150 Chapter4. CUDAFeatures | |
| CUDAProgrammingGuide,Release13.1 | |
| 4.1.3.2 CoherencyandConcurrency | |
| Toensurecoherencytheunifiedmemoryprogrammingmodelputsconstraintsondataaccesseswhile | |
| boththeCPUandGPUareexecutingconcurrently. Ineffect,theGPUhasexclusiveaccesstoallman- | |
| ageddataandtheCPUisnotpermittedtoaccessit,whileanykerneloperationisexecuting,regardless | |
| ofwhetherthespecifickernelisactivelyusingthedata. ConcurrentCPU/GPUaccesses,eventodif- | |
| ferentmanagedmemoryallocations,willcauseasegmentationfaultbecausethepageisconsidered | |
| inaccessibletotheCPU. | |
| For example the following code runs successfully on devices of compute capability 6.x due to the | |
| GPUpagefaultingcapabilitywhichliftsallrestrictionsonsimultaneousaccessbutfailsononpre-6.x | |
| architectures and Windows platforms because the GPU program kernel is still active when the CPU | |
| touchesy: | |
| | __device__ | __managed__ | int x, | y=2; | | | | | |
| | ---------- | ------------- | ------ | ---- | --- | --- | --- | | |
| | __global__ | void kernel() | { | | | | | | |
| | x = | 10; | | | | | | | |
| } | |
| | int main() | { | | | | | | | |
| | ---------- | ----------- | -------- | ----------- | ---------- | ---------- | ------ | | |
| | kernel<<< | 1, 1 >>>(); | | | | | | | |
| | y = | 20; | ∕∕ Error | on GPUs not | supporting | concurrent | access | | |
| cudaDeviceSynchronize(); | |
| | return | 0; | | | | | | | |
| | ------ | --- | --- | --- | --- | --- | --- | | |
| } | |
| TheprogrammustexplicitlysynchronizewiththeGPUbeforeaccessingy(regardlessofwhetherthe | |
| GPUkernelactuallytouchesy(oranymanageddataatall): | |
| | __device__ | __managed__ | int x, | y=2; | | | | | |
| | ---------- | ------------- | ------ | ---- | --- | --- | --- | | |
| | __global__ | void kernel() | { | | | | | | |
| | x = | 10; | | | | | | | |
| } | |
| | int main() | { | | | | | | | |
| | ---------- | ----------- | --- | --- | --- | --- | --- | | |
| | kernel<<< | 1, 1 >>>(); | | | | | | | |
| cudaDeviceSynchronize(); | |
| | y = | 20; | ∕∕ Success | on GPUs | not supporting | concurrent | access | | |
| | ------ | --- | ---------- | ------- | -------------- | ---------- | ------ | | |
| | return | 0; | | | | | | | |
| } | |
| Note that any function call that logically guarantees the GPU completes its work is valid to ensure | |
| logicallythattheGPUworkiscompleted,seeExplicitSynchronization. | |
| NotethatifmemoryisdynamicallyallocatedwithcudaMallocManaged()orcuMemAllocManaged() | |
| whiletheGPUisactive,thebehaviorofthememoryisunspecifieduntiladditionalworkislaunchedor | |
| theGPUissynchronized. AttemptingtoaccessthememoryontheCPUduringthistimemayormay | |
| notcauseasegmentationfault. ThisdoesnotapplytomemoryallocatedusingtheflagcudaMemAt- | |
| tachHostorCU_MEM_ATTACH_HOST. | |
| 4.1.3.3 StreamAssociatedUnifiedMemory | |
| The CUDA programming model provides streams as a mechanism for programs to indicate depen- | |
| denceandindependenceamongkernellaunches. Kernelslaunchedintothesamestreamareguaran- | |
| teedtoexecuteconsecutively,whilekernelslaunchedintodifferentstreamsarepermittedtoexecute | |
| | concurrently. | SeesectionCUDAStreams. | | | | | | | |
| | ------------- | ---------------------- | --- | --- | --- | --- | --- | | |
| 4.1. UnifiedMemory 151 | |
| CUDAProgrammingGuide,Release13.1 | |
| | 4.1.3.3.1 StreamCallbacks | | | | | | | | | | | | |
| | ------------------------- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | |
| ItislegalfortheCPUtoaccessmanageddatafromwithinastreamcallback,providednootherstream | |
| thatcouldpotentiallybeaccessingmanageddataisactiveontheGPU.Inaddition,acallbackthatis | |
| notfollowedbyanydeviceworkcanbeusedforsynchronization: forexample,bysignalingacondition | |
| variablefrominsidethecallback;otherwise,CPUaccessisvalidonlyforthedurationofthecallback(s). | |
| Thereareseveralimportantpointsofnote: | |
| 1. ItisalwayspermittedfortheCPUtoaccessnon-managedmappedmemorydatawhiletheGPU | |
| isactive. | |
| 2. TheGPUisconsideredactivewhenitisrunninganykernel,evenifthatkerneldoesnotmakeuse | |
| | ofmanageddata. | | Ifakernelmightusedata,thenaccessisforbidden | | | | | | | | | | |
| | -------------- | --- | ------------------------------------------- | --- | --- | --- | --- | --- | --- | --- | --- | | |
| 3. Therearenoconstraintsonconcurrentinter-GPUaccessofmanagedmemory,otherthanthose | |
| thatapplytomulti-GPUaccessofnon-managedmemory. | |
| 4. TherearenoconstraintsonconcurrentGPUkernelsaccessingmanageddata. | |
| NotehowthelastpointallowsforracesbetweenGPUkernels,asiscurrentlythecasefornon-managed | |
| GPUmemory. IntheperspectiveoftheGPU,managedmemoryfunctionsareidenticaltonon-managed | |
| | memory. Thefollowingcodeexampleillustratesthesepoints: | | | | | | | | | | | | |
| | ------------------------------------------------------ | --- | -------- | --- | -------- | --- | --- | --- | --- | --- | --- | | |
| | int main() | { | | | | | | | | | | | |
| | cudaStream_t | | stream1, | | stream2; | | | | | | | | |
| cudaStreamCreate(&stream1); | |
| cudaStreamCreate(&stream2); | |
| | int *non_managed, | | | *managed, | | *also_managed; | | | | | | | |
| | ----------------- | --- | --- | --------- | --- | -------------- | --- | --- | --- | --- | --- | | |
| cudaMallocHost(&non_managed, 4); ∕∕ Non-managed, CPU-accessible memory | |
| | cudaMallocManaged(&managed, | | | | | 4); | | | | | | | |
| | -------------------------------- | --- | ----- | ---------- | ------ | ------------- | --- | ----- | --- | --- | --- | | |
| | cudaMallocManaged(&also_managed, | | | | | | 4); | | | | | | |
| | ∕∕ Point | 1: | CPU | can | access | non-managed | | data. | | | | | |
| | kernel<<< | | 1, 1, | 0, stream1 | | >>>(managed); | | | | | | | |
| | *non_managed | | = | 1; | | | | | | | | | |
| ∕∕ Point 2: CPU cannot access any managed data while GPU is busy, | |
| | ∕∕ | | unless | concurrentManagedAccess | | | | | = 1 | | | | |
| | --- | --- | ------ | ----------------------- | --- | --- | --- | --- | --- | --- | --- | | |
| ∕∕ Note we have not yet synchronized, so "kernel" is still active. | |
| | *also_managed | | = | 2; | ∕∕ | Will | issue | segmentation | | fault | | | |
| | ------------- | --- | ---------- | ---------- | ---------- | ------------- | ------ | ------------ | ------------------ | ---------- | --- | | |
| | ∕∕ Point | 3: | Concurrent | | GPU | kernels | can | access | the | same data. | | | |
| | kernel<<< | | 1, 1, | 0, stream2 | | >>>(managed); | | | | | | | |
| | ∕∕ Point | 4: | Multi-GPU | | concurrent | | access | | is also permitted. | | | | |
| cudaSetDevice(1); | |
| | kernel<<< | | 1, 1 | >>>(managed); | | | | | | | | | |
| | --------- | --- | ---- | ------------- | --- | --- | --- | --- | --- | --- | --- | | |
| | return | 0; | | | | | | | | | | | |
| } | |
| 4.1.3.3.2 Managedmemoryassociatedtostreamsallowsforfiner-grainedcontrol | |
| Unifiedmemorybuildsuponthestream-independencemodelbyallowingaCUDAprogramtoexplicitly | |
| associatemanagedallocationswithaCUDAstream. Inthisway,theprogrammerindicatestheuseof | |
| databykernelsbasedonwhethertheyarelaunchedintoaspecifiedstreamornot. Thisenablesop- | |
| portunitiesforconcurrencybasedonprogram-specificdataaccesspatterns. Thefunctiontocontrol | |
| thisbehavioris: | |
| | cudaError_t | cudaStreamAttachMemAsync(cudaStream_t | | | | | | | stream, | | | | |
| | ----------- | ------------------------------------- | --- | --- | --- | --- | ---- | ----- | ------- | --- | --- | | |
| | | | | | | | void | *ptr, | | | | | |
| (continuesonnextpage) | |
| | 152 | | | | | | | | | Chapter4. | CUDAFeatures | | |
| | --- | --- | --- | --- | --- | --- | --- | --- | --- | --------- | ------------ | | |
| CUDAProgrammingGuide,Release13.1 | |
| (continuedfrompreviouspage) | |
| size_t length=0, | |
| | | | | unsigned | int flags=0); | | | | |
| | --- | --- | --- | -------- | ------------- | --- | --- | | |
| ThecudaStreamAttachMemAsync()functionassociateslengthbytesofmemorystartingfromptr | |
| with the specified stream. This allows CPU access to that memory region so long as all operations | |
| instreamhavecompleted, regardlessofwhetherotherstreamsareactive. Ineffect, thisconstrains | |
| exclusiveownershipofthemanagedmemoryregionbyanactiveGPUtoper-streamactivityinstead | |
| ofwhole-GPUactivity. Mostimportantly,ifanallocationisnotassociatedwithaspecificstream,itis | |
| visibletoallrunningkernelsregardlessoftheirstream. ThisisthedefaultvisibilityforacudaMalloc- | |
| Managed()allocationora__managed__variable;hence,thesimple-caserulethattheCPUmaynot | |
| touchthedatawhileanykernelisrunning. | |
| Note | |
| Byassociatinganallocationwithaspecificstream,theprogrammakesaguaranteethatonlyker- | |
| nelslaunchedintothatstreamwilltouchthatdata. Noerrorcheckingisperformedbytheunified | |
| memorysystem. | |
| Note | |
| Inadditiontoallowinggreaterconcurrency,theuseofcudaStreamAttachMemAsync()canenable | |
| datatransferoptimizationswithintheunifiedmemorysystemthatmayaffectlatenciesandother | |
| overhead. | |
| Thefollowingexampleshowshowtoexplicitlyassociateywithhostaccessibility,thusenablingaccess | |
| at all times from the CPU. (Note the absence of cudaDeviceSynchronize() after the kernel call.) | |
| AccessestoybytheGPUrunningkernelwillnowproduceundefinedresults. | |
| | __device__ | __managed__ | int x, y=2; | | | | | | |
| | ---------- | ------------- | ----------- | --- | --- | --- | --- | | |
| | __global__ | void kernel() | { | | | | | | |
| | x = | 10; | | | | | | | |
| } | |
| | int main() | { | | | | | | | |
| | ------------ | -------- | --- | --- | --- | --- | --- | | |
| | cudaStream_t | stream1; | | | | | | | |
| cudaStreamCreate(&stream1); | |
| | cudaStreamAttachMemAsync(stream1, | | | &y, 0, cudaMemAttachHost); | | | | | |
| | --------------------------------- | --- | --- | -------------------------- | --- | --- | --- | | |
| cudaDeviceSynchronize(); ∕∕ Wait for Host attachment to occur. | |
| kernel<<< 1, 1, 0, stream1 >>>(); ∕∕ Note: Launches into stream1. | |
| | y = | 20; | | ∕∕ Success | – a kernel | is running | but “y” | | |
| | ------ | --- | --- | ----------- | ---------- | ---------- | ------- | | |
| | | | | ∕∕ has been | associated | with no | stream. | | |
| | return | 0; | | | | | | | |
| } | |
| | 4.1.3.3.3 Amoreelaborateexampleonmultithreadedhostprograms | | | | | | | | |
| | ---------------------------------------------------------- | --- | --- | --- | --- | --- | --- | | |
| The primary use for cudaStreamAttachMemAsync() is to enable independent task parallelism us- | |
| ing CPU threads. Typically in such a program, a CPU thread creates its own stream for all work that | |
| it generates because using CUDA’s NULL stream would cause dependencies between threads. The | |
| default global visibility of managed data to any GPU stream can make it difficult to avoid interac- | |
| tionsbetweenCPUthreadsinamulti-threadedprogram. FunctioncudaStreamAttachMemAsync() | |
| | 4.1. UnifiedMemory | | | | | | 153 | | |
| | ------------------ | --- | --- | --- | --- | --- | --- | | |
| CUDAProgrammingGuide,Release13.1 | |
| isthereforeusedtoassociateathread’smanagedallocationswiththatthread’sownstream,andthe | |
| association is typically not changed for the life of the thread. Such a program would simply add a | |
| singlecalltocudaStreamAttachMemAsync()touseunifiedmemoryforitsdataaccesses: | |
| ∕∕ This function performs some task, in its own , in its own private stream and | |
| | ,→can | be run | in | parallel | | | | | | | | | |
| | ------------ | ------------ | --- | -------- | --- | ----- | ------- | ------- | --- | --- | --- | | |
| | void | run_task(int | | *in, | int | *out, | int | length) | { | | | | |
| | ∕∕ | Create | a | stream | for | us | to use. | | | | | | |
| | cudaStream_t | | | stream; | | | | | | | | | |
| cudaStreamCreate(&stream); | |
| | ∕∕ | Allocate | | some | managed | data | and | associate | with | our stream. | | | |
| | --- | -------- | --- | ---- | ------- | ---- | --- | --------- | ---- | ----------- | --- | | |
| ∕∕ Note the use of the host-attach flag to cudaMallocManaged(); | |
| | ∕∕ | we then | | associate | | the allocation | | with | our stream | so that | | | |
| | --- | ------- | --- | --------- | -------- | -------------- | --- | ---------- | ---------- | ------- | --- | | |
| | ∕∕ | our | GPU | kernel | launches | | can | access it. | | | | | |
| | int | *data; | | | | | | | | | | | |
| cudaMallocManaged((void **)&data, length, cudaMemAttachHost); | |
| | cudaStreamAttachMemAsync(stream, | | | | | | | data); | | | | | |
| | -------------------------------- | --- | --- | --- | --- | --- | --- | ------ | --- | --- | --- | | |
| cudaStreamSynchronize(stream); | |
| ∕∕ Iterate on the data in some way, using both Host & Device. | |
| | for(int | | i=0; | i<N; | i++) | { | | | | | | | |
| | ------- | ------------ | ---- | ---- | ---- | ---- | --- | -------------- | ----- | -------- | --- | | |
| | | transform<<< | | | 100, | 256, | 0, | stream >>>(in, | data, | length); | | | |
| cudaStreamSynchronize(stream); | |
| | | host_process(data, | | | | length); | | ∕∕ CPU | uses managed | data. | | | |
| | --- | ------------------ | --- | ---- | --- | -------- | ------ | -------- | ------------ | -------- | --- | | |
| | | convert<<< | | 100, | | 256, 0, | stream | >>>(out, | data, | length); | | | |
| } | |
| cudaStreamSynchronize(stream); | |
| cudaStreamDestroy(stream); | |
| cudaFree(data); | |
| } | |
| Inthisexample, theallocation-streamassociationisestablishedjustonce, andthen dataisusedre- | |
| peatedly by both the host and device. The result is much simpler code than occurs with explicitly | |
| copyingdatabetweenhostanddevice,althoughtheresultisthesame. | |
| ThefunctioncudaMallocManaged()specifiesthecudaMemAttachHostflag,whichcreatesanallo- | |
| cationthatisinitiallyinvisibletodevice-sideexecution. (Thedefaultallocationwouldbevisibletoall | |
| GPUkernelsonallstreams.) Thisensuresthatthereisnoaccidentalinteractionwithanotherthread’s | |
| execution in the interval between the data allocation and when the data is acquired for a specific | |
| stream. | |
| Without this flag, a new allocation would be considered in-use on the GPU if a kernel launched by | |
| another thread happens to be running. This might impact the thread’s ability to access the newly | |
| allocateddatafromtheCPUbeforeitisabletoexplicitlyattachittoaprivatestream. Toenablesafe | |
| independencebetweenthreads,therefore,allocationsshouldbemadespecifyingthisflag. | |
| Analternativewouldbetoplaceaprocess-widebarrieracrossallthreadsaftertheallocationhasbeen | |
| attachedto the stream. Thiswouldensure thatall threadscomplete their data/streamassociations | |
| before any kernels are launched, avoiding the hazard. A second barrier would be needed before the | |
| stream is destroyed because stream destruction causes allocations to revert to their default visibil- | |
| ity. The cudaMemAttachHost flag exists both to simplify this process, and because it is not always | |
| possibletoinsertglobalbarrierswhererequired. | |
| | 154 | | | | | | | | | Chapter4. | CUDAFeatures | | |
| | --- | --- | --- | --- | --- | --- | --- | --- | --- | --------- | ------------ | | |
| CUDAProgrammingGuide,Release13.1 | |
| 4.1.3.3.4 DataMovementofStreamAssociatedUnifiedMemory | |
| Memcpy()/Memset()withstreamassociatedunifiedmemorybehavesdifferentondeviceswherecon- | |
| currentManagedAccessisnotset,thefollowingrulesapply: | |
| If cudaMemcpyHostTo* is specified and the source data is unified memory, then it will be accessed | |
| fromthehostifitiscoherentlyaccessiblefromthehostinthecopystream(1);otherwiseitwillbeac- | |
| cessedfromthedevice. SimilarrulesapplytothedestinationwhencudaMemcpy*ToHostisspecified | |
| andthedestinationisunifiedmemory. | |
| IfcudaMemcpyDeviceTo*isspecifiedandthesourcedataisunifiedmemory,thenitwillbeaccessed | |
| from the device. The source must be coherently accessible from the device in the copy stream (2); | |
| otherwise,anerrorisreturned. SimilarrulesapplytothedestinationwhencudaMemcpy*ToDeviceis | |
| specifiedandthedestinationisunifiedmemory. | |
| If cudaMemcpyDefault is specified, then unified memory will be accessed from the host either if it | |
| cannotbecoherentlyaccessedfromthedeviceinthecopystream(2)orifthepreferredlocationfor | |
| thedataiscudaCpuDeviceIdanditcanbecoherentlyaccessedfromthehostinthecopystream(1); | |
| otherwise,itwillbeaccessedfromthedevice. | |
| WhenusingcudaMemset*()withunifiedmemory,thedatamustbecoherentlyaccessiblefromthe | |
| deviceinthestreambeingusedforthecudaMemset()operation(2);otherwise,anerrorisreturned. | |
| WhendataisaccessedfromthedeviceeitherbycudaMemcpy*orcudaMemset*,thestreamofopera- | |
| tionisconsideredtobeactiveontheGPU.Duringthistime,anyCPUaccessofdatathatisassociated | |
| with that stream or data that has global visibility, will result in a segmentation fault if the GPU has | |
| a zero value for the device attribute concurrentManagedAccess. The program must synchronize | |
| appropriatelytoensuretheoperationhascompletedbeforeaccessinganyassociateddatafromthe | |
| CPU. | |
| 1. Coherentlyaccessiblefromthehostinagivenstreammeansthatthememoryneitherhasglobal | |
| visibilitynorisitassociatedwiththegivenstream. | |
| 2. Coherentlyaccessiblefromthedeviceinagivenstreammeansthatthememoryeitherhasglobal | |
| visibilityorisassociatedwiththegivenstream. | |
| 4.1.4. Performance Hints | |
| PerformancehintsallowprogrammerstoprovideCUDAwithmoreinformationaboutunifiedmemory | |
| usage. CUDA uses performance hints to managed memory more efficiently and improve application | |
| performance. Performancehintsneverimpactthecorrectnessofanapplication. Performancehints | |
| onlyaffectperformance. | |
| Note | |
| Applicationsshouldonlyuseunifiedmemoryperformancehintsiftheyimproveperformance. | |
| Performancehintsmaybeusedonanyunifiedmemoryallocation,includingCUDAmanagedmemory. | |
| OnsystemswithfullCUDAunifiedmemorysupport,performancehintscanbeappliedtoallsystem- | |
| allocatedmemory. | |
| 4.1. UnifiedMemory 155 | |
| CUDAProgrammingGuide,Release13.1 | |
| 4.1.4.1 DataPrefetching | |
| The cudaMemPrefetchAsync API is an asynchronous stream-ordered API that may migrate data to | |
| resideclosertothespecifiedprocessor. Thedatamaybeaccessedwhileitisbeingprefetched. The | |
| migrationdoesnotbeginuntilallprioroperationsinthestreamhavecompleted,andcompletesbefore | |
| anysubsequentoperationinthestream. | |
| | cudaError_t | cudaMemPrefetchAsync(const | | | void | *devPtr, | | | | |
| | ----------- | -------------------------- | --- | ------------ | --------------- | ---------- | --------- | --- | | |
| | | | | size_t | count, | | | | | |
| | | | | struct | cudaMemLocation | | location, | | | |
| | | | | unsigned | int | flags, | | | | |
| | | | | cudaStream_t | | stream=0); | | | | |
| Amemoryregioncontaining[devPtr, count)maybemigratedtothedestinationde- | |
| | | | | devPtr | + | | | | | |
| | --- | --- | --- | ------ | --- | --- | --- | --- | | |
| vicelocation.idiflocation.type iscudaMemLocationTypeDevice, orCPUiflocation.type | |
| iscudaMemLocationTypeHost,whentheprefetchtaskisexecutedinthegivenstream. Fordetails | |
| onflags,seethecurrentCUDARuntimeAPIdocumentation. | |
| Considerthesimplecodeexamplebelow: | |
| SystemAllocator | |
| | void test_prefetch_sam(const | | | cudaStream_t& | s) | { | | | | |
| | ---------------------------- | ------------------------------- | --------------- | ------------- | --- | --- | --- | --- | | |
| | ∕∕ initialize | data | on CPU | | | | | | | |
| | char *data | = (char*)malloc(dataSizeBytes); | | | | | | | | |
| | init_data(data, | | dataSizeBytes); | | | | | | | |
| cudaMemLocation location = {.type = cudaMemLocationTypeDevice, .id = | |
| ,→myGpuId}; | |
| | ∕∕ encourage | data | to move | to GPU before | use | | | | | |
| | ------------ | -------- | --------- | ------------- | --- | --- | --- | --- | | |
| | const | unsigned | int flags | = 0; | | | | | | |
| cudaMemPrefetchAsync(data, dataSizeBytes, location, flags, s); | |
| | ∕∕ use | data on | GPU | | | | | | | |
| | ------ | ------- | --- | --- | --- | --- | --- | --- | | |
| const unsigned num_blocks = (dataSizeBytes + threadsPerBlock - 1) ∕ | |
| ,→threadsPerBlock; | |
| mykernel<<<num_blocks, threadsPerBlock, 0, s>>>(data, dataSizeBytes); | |
| | ∕∕ encourage | data | to move | back to CPU | | | | | | |
| | ------------ | -------- | --------------------------- | ----------- | --- | --- | --- | --- | | |
| | location | = {.type | = cudaMemLocationTypeHost}; | | | | | | | |
| cudaMemPrefetchAsync(data, dataSizeBytes, location, flags, s); | |
| cudaStreamSynchronize(s); | |
| | ∕∕ use | data on | CPU | | | | | | | |
| | -------------- | ------- | --------------- | --- | --- | --- | --- | --- | | |
| | use_data(data, | | dataSizeBytes); | | | | | | | |
| free(data); | |
| } | |
| Managed | |
| | void test_prefetch_managed(const | | | cudaStream_t& | | s) { | | | | |
| | -------------------------------- | ---- | ------ | ------------- | --- | ---- | --- | --- | | |
| | ∕∕ initialize | data | on CPU | | | | | | | |
| (continuesonnextpage) | |
| | 156 | | | | | | Chapter4. | CUDAFeatures | | |
| | --- | --- | --- | --- | --- | --- | --------- | ------------ | | |
| CUDAProgrammingGuide,Release13.1 | |
| (continuedfrompreviouspage) | |
| | char *data; | | | | | | | | |
| | ------------------------ | --- | --------------- | --- | --------------- | --- | --- | | |
| | cudaMallocManaged(&data, | | | | dataSizeBytes); | | | | |
| | init_data(data, | | dataSizeBytes); | | | | | | |
| cudaMemLocation location = {.type = cudaMemLocationTypeDevice, .id = | |
| ,→myGpuId}; | |
| | ∕∕ encourage | | data | to move | to | GPU before | use | | |
| | ------------ | -------- | ---- | ------- | --- | ---------- | --- | | |
| | const | unsigned | int | flags | = | 0; | | | |
| cudaMemPrefetchAsync(data, dataSizeBytes, location, flags, s); | |
| | ∕∕ use | data | on GPU | | | | | | |
| | ------ | ---- | ------ | --- | --- | --- | --- | | |
| const uinsigned num_blocks = (dataSizeBytes + threadsPerBlock - 1) ∕ | |
| ,→threadsPerBlock; | |
| mykernel<<<num_blocks, threadsPerBlock, 0, s>>>(data, dataSizeBytes); | |
| | ∕∕ encourage | | data | to move | back | to CPU | | | |
| | ------------ | -------- | ---- | --------------------------- | ---- | ------ | --- | | |
| | location | = {.type | | = cudaMemLocationTypeHost}; | | | | | |
| cudaMemPrefetchAsync(data, dataSizeBytes, location, flags, s); | |
| cudaStreamSynchronize(s); | |
| | ∕∕ use | data | on CPU | | | | | | |
| | -------------- | ---- | --------------- | --- | --- | --- | --- | | |
| | use_data(data, | | dataSizeBytes); | | | | | | |
| cudaFree(data); | |
| } | |
| 4.1.4.2 DataUsageHints | |
| Whenmultipleprocessorssimultaneouslyaccessthesamedata,cudaMemAdvisemaybeusedtohint | |
| | howthedataat[devPtr, | | | devPtr | | + count)willbeaccessed: | | | |
| | -------------------- | ------------------- | --- | ------ | ------ | ----------------------- | ---------- | | |
| | cudaError_t | cudaMemAdvise(const | | | | void *devPtr, | | | |
| | | | | | size_t | count, | | | |
| | | | | | enum | cudaMemoryAdvise | advice, | | |
| | | | | | struct | cudaMemLocation | location); | | |
| TheexampleshowshowtousecudaMemAdvise: | |
| | init_data(data, | | dataSizeBytes); | | | | | | |
| | --------------- | --- | --------------- | --- | --- | --- | --- | | |
| cudaMemLocation location = {.type = cudaMemLocationTypeDevice, .id = | |
| ,→myGpuId}; | |
| | ∕∕ encourage | | data | to move | to | GPU before | use | | |
| | ------------ | -------- | ---- | ------- | --- | ---------- | --- | | |
| | const | unsigned | int | flags | = | 0; | | | |
| cudaMemPrefetchAsync(data, dataSizeBytes, location, flags, s); | |
| | ∕∕ use | data | on GPU | | | | | | |
| | ------ | ---- | ------ | --- | --- | --- | --- | | |
| const uinsigned num_blocks = (dataSizeBytes + threadsPerBlock - 1) ∕ | |
| ,→threadsPerBlock; | |
| mykernel<<<num_blocks, threadsPerBlock, 0, s>>>(data, dataSizeBytes); | |
| | ∕∕ encourage | | data | to move | back | to CPU | | | |
| | ------------ | --- | ---- | ------- | ---- | ------ | --- | | |
| (continuesonnextpage) | |
| 4.1. UnifiedMemory 157 | |
| CUDAProgrammingGuide,Release13.1 | |
| (continuedfrompreviouspage) | |
| | location | = {.type | = cudaMemLocationTypeHost}; | | | | | | | | |
| | -------- | -------- | --------------------------- | --- | --- | --- | --- | --- | --- | | |
| cudaMemPrefetchAsync(data, dataSizeBytes, location, flags, s); | |
| cudaStreamSynchronize(s); | |
| | ∕∕ use | data on | CPU | | | | | | | | |
| | -------------- | ------- | --------------- | --- | --- | --- | --- | --- | --- | | |
| | use_data(data, | | dataSizeBytes); | | | | | | | | |
| cudaFree(data); | |
| } | |
| ∕∕ test-prefetch-managed-end | |
| | static | const int | maxDevices | = | 1; | | | | | | |
| | ------ | --------- | ---------------- | --- | ---- | --- | --- | --- | --- | | |
| | static | const int | maxOuterLoopIter | | = 3; | | | | | | |
| | static | const int | maxInnerLoopIter | | = 4; | | | | | | |
| ∕∕ test-advise-managed-begin | |
| | void test_advise_managed(cudaStream_t | | | | | stream) { | | | | | |
| | ------------------------------------- | --------- | ------ | ---------------- | --- | --------- | --- | --- | --- | | |
| | char | *dataPtr; | | | | | | | | | |
| | size_t | dataSize | = 64 * | threadsPerBlock; | | ∕∕ 16 KiB | | | | | |
| Whereadvicemaytakethefollowingvalues: | |
| ▶cudaMemAdviseSetReadMostly: | |
| Thisimpliesthatthedataismostlygoingtobereadfromandonlyoccasionallywrittento. | |
| Ingeneral,itallowstradingoffreadbandwidthforwritebandwidthonthisregion. | |
| ▶cudaMemAdviseSetPreferredLocation: | |
| Thishintsetsthepreferredlocationforthedatatobethespecifieddevice’sphysicalmem- | |
| ory. Thishintencouragesthesystemtokeepthedataatthepreferredlocation,butdoesnot | |
| guaranteeit. PassinginavalueofcudaMemLocationTypeHost forlocation.typesetsthe | |
| preferredlocationasCPUmemory. Otherhints,likecudaMemPrefetchAsync,mayoverride | |
| thishintandallowthememorytomigrateawayfromitspreferredlocation. | |
| ▶cudaMemAdviseSetAccessedBy: | |
| Insomesystems,itmaybebeneficialforperformancetoestablishamappingintomemory | |
| beforeaccessingthedatafromagivenprocessor. Thishinttellsthesystemthatthedata | |
| willbefrequentlyaccessedbylocation.idwhenlocation.typeiscudaMemLocation- | |
| TypeDevice, enabling the system to assume that creating these mappings pays off. This | |
| hintdoesnotimplywherethedatashouldreside,butitcanbecombinedwithcudaMemAd- | |
| | | | | | to specify | that. On hardware-coherent | | systems, this | hint | | |
| | --- | --- | --- | --- | ---------- | -------------------------- | --- | ------------- | ---- | | |
| viseSetPreferredLocation | |
| switchesonaccesscountermigration,seeAccessCounterMigration. | |
| | Each advice | can be | also unset | by using | one of | the following values: | | | | | |
| | ----------- | ------ | ---------- | -------- | ------ | --------------------- | --- | --- | --- | | |
| cudaMemAdviseUnsetRead- | |
| Mostly,cudaMemAdviseUnsetPreferredLocationandcudaMemAdviseUnsetAccessedBy. | |
| TheexampleshowshowtousecudaMemAdvise: | |
| SystemAllocator | |
| | void test_advise_sam(cudaStream_t | | | | stream) | { | | | | | |
| | --------------------------------- | --------- | ------ | ---------------- | -------------------- | --------- | --- | --- | --- | | |
| | char | *dataPtr; | | | | | | | | | |
| | size_t | dataSize | = 64 * | threadsPerBlock; | | ∕∕ 16 KiB | | | | | |
| | ∕∕ Allocate | memory | using | malloc | or cudaMallocManaged | | | | | | |
| (continuesonnextpage) | |
| | 158 | | | | | | Chapter4. | CUDAFeatures | | | |
| | --- | --- | --- | --- | --- | --- | --------- | ------------ | --- | | |
| CUDAProgrammingGuide,Release13.1 | |
| (continuedfrompreviouspage) | |
| | dataPtr | = (char*)malloc(dataSize); | | | | | | | | | | |
| | ---------- | -------------------------- | --- | ------ | ------ | ------ | --- | --- | --- | --- | | |
| | ∕∕ Set the | advice | | on the | memory | region | | | | | | |
| cudaMemLocation loc = {.type = cudaMemLocationTypeDevice, .id = myGpuId}; | |
| cudaMemAdvise(dataPtr, dataSize, cudaMemAdviseSetReadMostly, loc); | |
| | int outerLoopIter | | | = 0; | | | | | | | | |
| | -------------------- | -------- | --------- | -------------------------- | ----------------- | ------- | ----------- | ----- | --------------- | --------- | | |
| | while (outerLoopIter | | | < | maxOuterLoopIter) | | | { | | | | |
| | ∕∕ The | data | is | written | by | the CPU | each | outer | loop | iteration | | |
| | init_data(dataPtr, | | | dataSize); | | | | | | | | |
| | ∕∕ The | data | is | made | available | to | all | GPUs | by prefetching. | | | |
| | ∕∕ Prefetching | | | here | causes | read | duplication | | of data | instead | | |
| | ∕∕ of | data | migration | | | | | | | | | |
| | cudaMemLocation | | | location; | | | | | | | | |
| | location.type | | = | cudaMemLocationTypeDevice; | | | | | | | | |
| | for (int | device | | = 0; | device | < | maxDevices; | | device++) | { | | |
| | location.id | | = | device; | | | | | | | | |
| | const | unsigned | | int | flags | = 0; | | | | | | |
| cudaMemPrefetchAsync(dataPtr, dataSize, location, flags, stream); | |
| } | |
| | ∕∕ The | kernel | only | reads | | this data | in | the | inner loop | | | |
| | ----------------- | -------------- | ---- | ----- | ------------------- | --------- | --- | --- | ---------- | --- | | |
| | int innerLoopIter | | | = | 0; | | | | | | | |
| | while | (innerLoopIter | | | < maxInnerLoopIter) | | | | { | | | |
| mykernel<<<32, threadsPerBlock, 0, stream>>>((const char *)dataPtr, | |
| ,→dataSize); | |
| innerLoopIter++; | |
| } | |
| outerLoopIter++; | |
| } | |
| free(dataPtr); | |
| } | |
| Managed | |
| | void test_advise_managed(cudaStream_t | | | | | | stream) | | { | | | |
| | ------------------------------------- | --- | --- | --- | --- | --- | ------- | --- | --- | --- | | |
| char *dataPtr; | |
| | size_t dataSize | | = | 64 * | threadsPerBlock; | | | ∕∕ | 16 KiB | | | |
| | --------------- | ------ | --- | ----- | ----------------- | --- | --- | --- | ------ | --- | | |
| | ∕∕ Allocate | memory | | using | cudaMallocManaged | | | | | | | |
| ∕∕ (malloc may be used on systems with full CUDA Unified memory support) | |
| | cudaMallocManaged(&dataPtr, | | | | | dataSize); | | | | | | |
| | --------------------------- | ------ | --- | ------ | ------ | ---------- | --- | --- | --- | --- | | |
| | ∕∕ Set the | advice | | on the | memory | region | | | | | | |
| cudaMemLocation loc = {.type = cudaMemLocationTypeDevice, .id = myGpuId}; | |
| cudaMemAdvise(dataPtr, dataSize, cudaMemAdviseSetReadMostly, loc); | |
| | int outerLoopIter | | | = 0; | | | | | | | | |
| | -------------------- | --- | --- | ---- | ----------------- | --- | --- | --- | --- | --- | | |
| | while (outerLoopIter | | | < | maxOuterLoopIter) | | | { | | | | |
| (continuesonnextpage) | |
| 4.1. UnifiedMemory 159 | |
| CUDAProgrammingGuide,Release13.1 | |
| (continuedfrompreviouspage) | |
| | | ∕∕ The | data | is | written | by | the CPU | each | outer | loop | iteration | | | |
| | --- | ------------------ | -------- | --------- | -------------------------- | --------- | ------- | ----------- | --------- | ------------ | --------- | --- | | |
| | | init_data(dataPtr, | | | dataSize); | | | | | | | | | |
| | | ∕∕ The | data | is | made | available | to | all GPUs | by | prefetching. | | | | |
| | | ∕∕ Prefetching | | | here | causes | read | duplication | of | data | instead | | | |
| | | ∕∕ of | data | migration | | | | | | | | | | |
| | | cudaMemLocation | | | location; | | | | | | | | | |
| | | location.type | | = | cudaMemLocationTypeDevice; | | | | | | | | | |
| | | for (int | device | | = 0; | device | < | maxDevices; | device++) | | { | | | |
| | | location.id | | = | device; | | | | | | | | | |
| | | const | unsigned | | int | flags | = 0; | | | | | | | |
| cudaMemPrefetchAsync(dataPtr, dataSize, location, flags, stream); | |
| } | |
| | | ∕∕ The | kernel | only | reads | | this data | in the | inner | loop | | | | |
| | --- | ----------------- | -------------- | ---- | ----- | ------------------- | --------- | ------ | ----- | ---- | --- | --- | | |
| | | int innerLoopIter | | | = | 0; | | | | | | | | |
| | | while | (innerLoopIter | | | < maxInnerLoopIter) | | | { | | | | | |
| mykernel<<<32, threadsPerBlock, 0, stream>>>((const char *)dataPtr, | |
| ,→dataSize); | |
| innerLoopIter++; | |
| } | |
| outerLoopIter++; | |
| } | |
| cudaFree(dataPtr); | |
| } | |
| 4.1.4.3 QueryingDataUsageAttributesonManagedMemory | |
| A program can query memory range attributes assigned through cudaMemAdvise or cudaMem- | |
| PrefetchAsynconCUDAmanagedmemorybyusingthefollowingAPI: | |
| | cudaMemRangeGetAttribute(void | | | | | | *data, | | | | | | | |
| | ----------------------------- | --- | --- | --- | --- | ------ | --------------------- | -------- | --- | ---------- | --- | --- | | |
| | | | | | | size_t | dataSize, | | | | | | | |
| | | | | | | enum | cudaMemRangeAttribute | | | attribute, | | | | |
| | | | | | | const | void | *devPtr, | | | | | | |
| | | | | | | size_t | count); | | | | | | | |
| ThisfunctionqueriesanattributeofthememoryrangestartingatdevPtrwithasizeofcountbytes. | |
| ThememoryrangemustrefertomanagedmemoryallocatedviacudaMallocManagedordeclaredvia | |
| | __managed__variables. | | | | Itispossibletoquerythefollowingattributes: | | | | | | | | | |
| | --------------------- | --- | --- | --- | ------------------------------------------ | --- | --- | --- | --- | --- | --- | --- | | |
| ▶ | |
| cudaMemRangeAttributeReadMostly: returns 1 if the entire memory range has the cud- | |
| aMemAdviseSetReadMostlyattributeset,or0otherwise. | |
| ▶ | |
| cudaMemRangeAttributePreferredLocation: theresultreturnedwillbeaGPUdeviceidor | |
| cudaCpuDeviceIdiftheentirememoryrangehasthecorrespondingprocessoraspreferredlo- | |
| cation,otherwisecudaInvalidDeviceIdwillbereturned. AnapplicationcanusethisqueryAPI | |
| tomakedecisionaboutstagingdatathroughCPUorGPUdependingonthepreferredlocation | |
| attributeofthemanagedpointer. Notethattheactuallocationofthememoryrangeatthetime | |
| ofthequerymaybedifferentfromthepreferredlocation. | |
| ▶ cudaMemRangeAttributeAccessedBy: willreturnthelistofdevicesthathavethatadviseset | |
| forthatmemoryrange. | |
| | 160 | | | | | | | | | | Chapter4. | CUDAFeatures | | |
| | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --------- | ------------ | | |
| CUDAProgrammingGuide,Release13.1 | |
| ▶ cudaMemRangeAttributeLastPrefetchLocation: will return the last location to which the | |
| memoryrangewasprefetchedexplicitlyusingcudaMemPrefetchAsync. Notethatthissimply | |
| returnsthelastlocationthattheapplicationrequestedtoprefetchthememoryrangeto. Itgives | |
| noindicationastowhethertheprefetchoperationtothatlocationhascompletedorevenbegun. | |
| ▶ cudaMemRangeAttributePreferredLocationType: it returns the location type of the pre- | |
| ferredlocationwiththefollowingvalues: | |
| ▶ cudaMemLocationTypeDevice: if all pages in the memory range have the same GPU as | |
| theirpreferredlocation, | |
| ▶ cudaMemLocationTypeHost: ifallpagesinthememoryrangehavetheCPUastheirpre- | |
| ferredlocation, | |
| ▶ cudaMemLocationTypeHostNuma: ifallthepagesinthememoryrangehavethesamehost | |
| NUMAnodeIDastheirpreferredlocation, | |
| ▶ cudaMemLocationTypeInvalid: ifeitherallthepagesdon’thavethesamepreferredlo- | |
| cationorsomeofthepagesdon’thaveapreferredlocationatall. | |
| ▶ cudaMemRangeAttributePreferredLocationId: returnsthedeviceordinalifthecudaMem- | |
| RangeAttributePreferredLocationType query for the same address range returns cud- | |
| aMemLocationTypeDevice. IfthepreferredlocationtypeisahostNUMAnode, itreturnsthe | |
| hostNUMAnodeID.Otherwise,theidshouldbeignored. | |
| ▶ cudaMemRangeAttributeLastPrefetchLocationType: returns the last location type to | |
| which all pages in the memory range were prefetched explicitly via cudaMemPrefetchAsync. | |
| Thefollowingvaluesarereturned: | |
| ▶ cudaMemLocationTypeDevice: if all pages in the memory range were prefetched to the | |
| sameGPU, | |
| ▶ cudaMemLocationTypeHost: ifallpagesinthememoryrangewereprefetchedtotheCPU, | |
| ▶ cudaMemLocationTypeHostNuma: ifallthepagesinthememoryrangewereprefetchedto | |
| thesamehostNUMAnodeID, | |
| ▶ cudaMemLocationTypeInvalid: if either all the pages were not prefetched to the same | |
| locationorsomeofthepageswereneverprefetchedatall. | |
| ▶ cudaMemRangeAttributeLastPrefetchLocationId: if the cudaMemRangeAttribute- | |
| LastPrefetchLocationType query for the same address range returns cudaMemLocation- | |
| TypeDevice,itwillbeavaliddeviceordinalorifitreturnscudaMemLocationTypeHostNuma,it | |
| willbeavalidhostNUMAnodeID.Otherwise,theidshouldbeignored. | |
| Additionally, multiple attributes can be queried by using corresponding cudaMemRangeGetAt- | |
| tributesfunction. | |
| 4.1.4.4 GPUMemoryOversubscription | |
| Unifiedmemoryenablesapplicationstooversubscribethememoryofanyindividualprocessor: inother | |
| wordstheycanallocateandsharearrayslargerthanthememorycapacityofanyindividualprocessor | |
| inthesystem,enablingamongothersout-of-coreprocessingofdatasetsthatdonotfitwithinasingle | |
| GPU,withoutaddingsignificantcomplexitytotheprogrammingmodel. | |
| Additionally, multiple attributes can be queried by using corresponding cudaMemRangeGetAt- | |
| tributesfunction. | |
| 4.1. UnifiedMemory 161 | |
| CUDAProgrammingGuide,Release13.1 | |
| 4.2. CUDA Graphs | |
| CUDAGraphspresentanothermodelforworksubmissioninCUDA.Agraphisaseriesofoperations | |
| such as kernel launches, data movement, etc., connected by dependencies, which is defined sepa- | |
| ratelyfromitsexecution. Thisallowsagraphtobedefinedonceandthenlaunchedrepeatedly. Sep- | |
| aratingoutthedefinitionofagraphfromitsexecutionenablesanumberofoptimizations: first,CPU | |
| launchcostsarereducedcomparedtostreams,becausemuchofthesetupisdoneinadvance; sec- | |
| ond,presentingthewholeworkflowtoCUDAenablesoptimizationswhichmightnotbepossiblewith | |
| thepiecewiseworksubmissionmechanismofstreams. | |
| Toseetheoptimizationspossiblewithgraphs,considerwhathappensinastream: whenyouplacea | |
| kernelintoastream,thehostdriverperformsasequenceofoperationsinpreparationfortheexecu- | |
| tion of the kernel on the GPU. These operations, necessary for setting up and launching the kernel, | |
| areanoverheadcostwhichmustbepaidforeachkernelthatisissued. ForaGPUkernelwithashort | |
| execution time, this overhead cost can be a significant fraction of the overall end-to-end execution | |
| time. BycreatingaCUDAgraphthatencompassesaworkflowthatwillbelaunchedmanytimes,these | |
| overheadcostscanbepaidoncefortheentiregraphduringinstantiation,andthegraphitselfcanthen | |
| belaunchedrepeatedlywithverylittleoverhead. | |
| 4.2.1. Graph Structure | |
| Anoperationformsanodeinagraph. Thedependenciesbetweentheoperationsaretheedges. These | |
| dependenciesconstraintheexecutionsequenceoftheoperations. | |
| Anoperationmaybescheduledatanytimeoncethenodesonwhichitdependsarecomplete. Schedul- | |
| ingisleftuptotheCUDAsystem. | |
| 4.2.1.1 NodeTypes | |
| Agraphnodecanbeoneof: | |
| ▶ kernel | |
| ▶ CPUfunctioncall | |
| ▶ memorycopy | |
| ▶ memset | |
| ▶ emptynode | |
| ▶ waitingonaCUDAEvent | |
| ▶ recordingaCUDAEvent | |
| ▶ signallinganexternalsemaphore | |
| ▶ waitingonanexternalsemaphore | |
| ▶ conditionalnode | |
| ▶ memorynode | |
| ▶ childgraph: Toexecuteaseparatenestedgraph,asshowninthefollowingfigure. | |
| 162 Chapter4. CUDAFeatures | |
| CUDAProgrammingGuide,Release13.1 | |
| Figure21: ChildGraphExample | |
| 4.2.1.2 EdgeData | |
| CUDA12.3introducededgedataonCUDAGraphs. Atthistime,theonlyusefornon-defaultedgedata | |
| isenablingProgrammaticDependentLaunch. | |
| Generallyspeaking,edgedatamodifiesadependencyspecifiedbyanedgeandconsistsofthreeparts: | |
| anoutgoingport,anincomingport,andatype. Anoutgoingportspecifieswhenanassociatededge | |
| istriggered. Anincomingportspecifieswhatportionofanodeisdependentonanassociatededge. | |
| Atypemodifiestherelationbetweentheendpoints. | |
| Portvaluesarespecifictonodetypeanddirection,andedgetypesmayberestrictedtospecificnode | |
| types. Inallcases,zero-initializededgedatarepresentsdefaultbehavior. Outgoingport0waitsonan | |
| entiretask,incomingport0blocksanentiretask,andedgetype0isassociatedwithafulldependency | |
| withmemorysynchronizingbehavior. | |
| EdgedataisoptionallyspecifiedinvariousgraphAPIsviaaparallelarraytotheassociatednodes. If | |
| it is omitted as an input parameter, zero-initialized data is used. If it is omitted as an output (query) | |
| parameter,theAPIacceptsthisiftheedgedatabeingignoredisallzero-initialized,andreturnscud- | |
| aErrorLossyQueryifthecallwoulddiscardinformation. | |
| EdgedataisalsoavailableinsomestreamcaptureAPIs: cudaStreamBeginCaptureToGraph(),cu- | |
| daStreamGetCaptureInfo(), and cudaStreamUpdateCaptureDependencies(). In these cases, | |
| there is not yet a downstream node. The data is associated with a dangling edge (half edge) which | |
| will either be connected to a future captured node or discarded at termination of stream capture. | |
| Note that some edge types do not wait on full completion of the upstream node. These edges are | |
| ignoredwhenconsideringifastreamcapturehasbeenfullyrejoinedtotheoriginstream,andcannot | |
| bediscardedattheendofcapture. SeeStreamCapture. | |
| No node types define additional incoming ports, and only kernel nodes define additional outgoing | |
| ports. Thereisonenon-defaultdependencytype,cudaGraphDependencyTypeProgrammatic,which | |
| isusedtoenableProgrammaticDependentLaunchbetweentwokernelnodes. | |
| 4.2. CUDAGraphs 163 | |
| CUDAProgrammingGuide,Release13.1 | |
| | 4.2.2. | Building | and Running | | Graphs | | | | |
| | ------ | -------- | ----------- | --- | ------ | --- | --- | | |
| Work submission using graphs is separated into three distinct stages: definition, instantiation, and | |
| execution. | |
| ▶ Duringthedefinitionorcreationphase,aprogramcreatesadescriptionoftheoperationsinthe | |
| graphalongwiththedependenciesbetweenthem. | |
| ▶ Instantiation takes a snapshot of the graph template, validates it, and performs much of the | |
| setupandinitializationofworkwiththeaimofminimizingwhatneedstobedoneatlaunch. The | |
| resultinginstanceisknownasanexecutablegraph. | |
| ▶ | |
| Anexecutablegraphmaybelaunchedintoastream,similartoanyotherCUDAwork. Itmaybe | |
| launchedanynumberoftimeswithoutrepeatingtheinstantiation. | |
| 4.2.2.1 GraphCreation | |
| Graphscanbecreatedviatwomechanisms: usingtheexplicitGraphAPIandviastreamcapture. | |
| | 4.2.2.1.1 | GraphAPIs | | | | | | | |
| | --------- | --------- | --- | --- | --- | --- | --- | | |
| The following is an example (omitting declarations and other boilerplate code) of creating the below | |
| graph. NotetheuseofcudaGraphCreate()tocreatethegraphandcudaGraphAddNode()toadd | |
| thekernelnodesandtheirdependencies. TheCUDARuntimeAPIdocumentationlistsallthefunctions | |
| availableforaddingnodesanddependencies. | |
| | | | Figure22: | CreatingaGraphUsingGraphAPIsExample | | | | | |
| | ----------------------- | --------- | ----------- | ----------------------------------- | ----- | --- | --- | | |
| | ∕∕ Create | the graph | - it starts | out | empty | | | | |
| | cudaGraphCreate(&graph, | | 0); | | | | | | |
| | ∕∕ Create | the nodes | and their | dependencies | | | | | |
| | cudaGraphNode_t | nodes[4]; | | | | | | | |
| | cudaGraphNodeParams | | kParams | = { cudaGraphNodeTypeKernel | | }; | | | |
| (continuesonnextpage) | |
| | 164 | | | | | Chapter4. | CUDAFeatures | | |
| | --- | --- | --- | --- | --- | --------- | ------------ | | |
| CUDAProgrammingGuide,Release13.1 | |
| (continuedfrompreviouspage) | |
| | kParams.kernel.func | | = (void *)kernelName; | | | | |
| | ------------------- | --- | --------------------- | --- | --- | | |
| kParams.kernel.gridDim.x = kParams.kernel.gridDim.y = kParams.kernel. | |
| | ,→gridDim.z | = 1; | | | | | |
| | ----------- | ---- | --- | --- | --- | | |
| kParams.kernel.blockDim.x = kParams.kernel.blockDim.y = kParams.kernel. | |
| | ,→blockDim.z | = 1; | | | | | |
| | --------------------------- | ---- | ------------ | ----- | ------------- | | |
| | cudaGraphAddNode(&nodes[0], | | graph, NULL, | NULL, | 0, &kParams); | | |
| cudaGraphAddNode(&nodes[1], graph, &nodes[0], NULL, 1, &kParams); | |
| cudaGraphAddNode(&nodes[2], graph, &nodes[0], NULL, 1, &kParams); | |
| cudaGraphAddNode(&nodes[3], graph, &nodes[1], NULL, 2, &kParams); | |
| Theexampleaboveshowsfourkernelnodeswithdependenciesbetweenthemtoillustratethecreation | |
| of a very simple graph. In a typical user application there would also need to be nodes added for | |
| memoryoperations,suchascudaGraphAddMemcpyNode()andthelike. Forfullreferenceofallgraph | |
| APIfunctionstoaddnodes,seetheTheCUDARuntimeAPIdocumentation. | |
| | 4.2.2.1.2 StreamCapture | | | | | | |
| | ----------------------- | --- | --- | --- | --- | | |
| Streamcaptureprovidesamechanismtocreateagraphfromexistingstream-basedAPIs. Asection | |
| of code which launches work into streams, including existing code, can be bracketed with calls to | |
| | cudaStreamBeginCapture()andcudaStreamEndCapture(). | | | | Seebelow. | | |
| | -------------------------------------------------- | ------ | --- | --- | --------- | | |
| | cudaGraph_t | graph; | | | | | |
| cudaStreamBeginCapture(stream); | |
| | kernel_A<<< | ..., stream | >>>(...); | | | | |
| | ----------- | ----------- | --------- | --- | --- | | |
| | kernel_B<<< | ..., stream | >>>(...); | | | | |
| libraryCall(stream); | |
| | kernel_C<<< | ..., stream | >>>(...); | | | | |
| | ---------------------------- | ----------- | --------- | --- | --- | | |
| | cudaStreamEndCapture(stream, | | &graph); | | | | |
| A call to cudaStreamBeginCapture() places a stream in capture mode. When a stream is being | |
| captured, work launched into the stream is not enqueued for execution. It is instead appended to | |
| an internal graph that is progressively being built up. This graph is then returned by calling cudaS- | |
| treamEndCapture(),whichalsoendscapturemodeforthestream. Agraphwhichisactivelybeing | |
| constructedbystreamcaptureisreferredtoasacapturegraph. | |
| Stream capture can be used on any CUDA stream except cudaStreamLegacy (the “NULL stream”). | |
| NotethatitcanbeusedoncudaStreamPerThread. Ifaprogramisusingthelegacystream,itmay | |
| bepossibletoredefinestream0tobetheper-threadstreamwithnofunctionalchange. SeeBlocking | |
| andnon-blockingstreamsandthedefaultstream. | |
| WhetherastreamisbeingcapturedcanbequeriedwithcudaStreamIsCapturing(). | |
| WorkcanbecapturedtoanexistinggraphusingcudaStreamBeginCaptureToGraph(). Insteadof | |
| capturingtoaninternalgraph,workiscapturedtoagraphprovidedbytheuser. | |
| 4.2. CUDAGraphs 165 | |
| CUDAProgrammingGuide,Release13.1 | |
| | 4.2.2.1.2.1 | Cross-streamDependenciesandEvents | | | | | | | |
| | ----------- | --------------------------------- | --- | --- | --- | --- | --- | | |
| Streamcapturecanhandlecross-streamdependenciesexpressedwithcudaEventRecord()andcu- | |
| daStreamWaitEvent(), providedtheeventbeingwaiteduponwasrecordedintothesamecapture | |
| graph. | |
| Whenaneventisrecordedinastreamthatisincapturemode,itresultsinacapturedevent. Acaptured | |
| eventrepresentsasetofnodesinacapturegraph. | |
| Whenacapturedeventiswaitedonbyastream,itplacesthestreamincapturemodeifitisnotalready, | |
| andthenextiteminthestreamwillhaveadditionaldependenciesonthenodesinthecapturedevent. | |
| Thetwostreamsarethenbeingcapturedtothesamecapturegraph. | |
| When cross-stream dependencies are present in stream capture, cudaStreamEndCapture() must | |
| still be called in the same stream where cudaStreamBeginCapture() was called; this is the origin | |
| stream. Anyotherstreamswhicharebeingcapturedtothesamecapturegraph,duetoevent-based | |
| dependencies,mustalsobejoinedbacktotheoriginstream. Thisisillustratedbelow. Allstreamsbeing | |
| capturedtothesamecapturegrapharetakenoutofcapturemodeuponcudaStreamEndCapture(). | |
| Failuretorejointotheoriginstreamwillresultinfailureoftheoverallcaptureoperation. | |
| | ∕∕ stream1 | is the origin | stream | | | | | | |
| | ---------- | ------------- | ------ | --- | --- | --- | --- | | |
| cudaStreamBeginCapture(stream1); | |
| | kernel_A<<< | ..., stream1 | >>>(...); | | | | | | |
| | ----------------------------- | ------------ | ------------- | ---------------- | ---- | --- | --- | | |
| | ∕∕ Fork | into stream2 | | | | | | | |
| | cudaEventRecord(event1, | | stream1); | | | | | | |
| | cudaStreamWaitEvent(stream2, | | | event1); | | | | | |
| | kernel_B<<< | ..., stream1 | >>>(...); | | | | | | |
| | kernel_C<<< | ..., stream2 | >>>(...); | | | | | | |
| | ∕∕ Join | stream2 back | to origin | stream (stream1) | | | | | |
| | cudaEventRecord(event2, | | stream2); | | | | | | |
| | cudaStreamWaitEvent(stream1, | | | event2); | | | | | |
| | kernel_D<<< | ..., stream1 | >>>(...); | | | | | | |
| | ∕∕ End capture | in the | origin stream | | | | | | |
| | cudaStreamEndCapture(stream1, | | | &graph); | | | | | |
| | ∕∕ stream1 | and stream2 | no longer | in capture | mode | | | | |
| ThegraphreturnedbytheabovecodeisshowninFigure22. | |
| Note | |
| Whenastreamistakenoutofcapturemode,thenextnon-capturediteminthestream(ifany)will | |
| still have a dependency on the most recent prior non-captured item, despite intermediate items | |
| havingbeenremoved. | |
| | 166 | | | | | Chapter4. | CUDAFeatures | | |
| | --- | --- | --- | --- | --- | --------- | ------------ | | |
| CUDAProgrammingGuide,Release13.1 | |
| 4.2.2.1.2.2 ProhibitedandUnhandledOperations | |
| It is invalid to synchronize or query the execution status of a stream which is being captured or a | |
| captured event, because they do not represent items scheduled for execution. It is also invalid to | |
| querytheexecutionstatusoforsynchronizeabroaderhandlewhichencompassesanactivestream | |
| capture,suchasadeviceorcontexthandlewhenanyassociatedstreamisincapturemode. | |
| When any stream in the same context is being captured, and it was not created with cudaStream- | |
| NonBlocking, any attempted use of the legacy stream is invalid. This is because the legacy stream | |
| handle at all times encompasses these other streams; enqueueing to the legacy stream would cre- | |
| ateadependencyonthestreamsbeingcaptured,andqueryingitorsynchronizingitwouldqueryor | |
| synchronizethestreamsbeingcaptured. | |
| ItisthereforealsoinvalidtocallsynchronousAPIsinthiscase. OneexampleofasynchronousAPIsis | |
| cudaMemcpy()whichenqueuesworktothelegacystreamandsynchronizesonitbeforereturning. | |
| Note | |
| As a general rule, when a dependency relation would connect something that is captured with | |
| somethingthatwasnotcapturedandinsteadenqueuedforexecution,CUDApreferstoreturnan | |
| errorratherthanignorethedependency. Anexceptionismadeforplacingastreamintooroutof | |
| capturemode;thisseversadependencyrelationbetweenitemsaddedtothestreamimmediately | |
| beforeandafterthemodetransition. | |
| Itisinvalidtomergetwoseparatecapturegraphsbywaitingonacapturedeventfromastreamwhich | |
| isbeingcapturedandisassociatedwithadifferentcapturegraphthantheevent. Itisinvalidtowait | |
| onanon-capturedeventfromastreamwhichisbeingcapturedwithoutspecifyingthecudaEvent- | |
| WaitExternalflag. | |
| A small number of APIs that enqueue asynchronous operations into streams are not currently sup- | |
| portedingraphsandwillreturnanerrorifcalledwithastreamwhichisbeingcaptured,suchascud- | |
| aStreamAttachMemAsync(). | |
| 4.2.2.1.2.3 Invalidation | |
| When an invalid operation is attempted during stream capture, any associated capture graphs are | |
| invalidated. Whenacapturegraphisinvalidated,furtheruseofanystreamswhicharebeingcaptured | |
| or captured events associated with the graph is invalid and will return an error, until stream capture | |
| is ended with cudaStreamEndCapture(). This call will take the associated streams out of capture | |
| mode,butwillalsoreturnanerrorvalueandaNULLgraph. | |
| 4.2.2.1.2.4 CaptureIntrospection | |
| Active stream capture operations can be inspected using cudaStreamGetCaptureInfo(). This al- | |
| lowstheusertoobtainthestatusofthecapture,aunique(per-process)IDforthecapture,theunder- | |
| lyinggraphobject,anddependencies/edgedataforthenextnodetobecapturedinthestream. This | |
| dependencyinformationcanbeusedtoobtainahandletothenode(s)whichwerelastcapturedinthe | |
| stream. | |
| 4.2.2.1.3 PuttingItAllTogether | |
| The example in Figure 22 is a simplistic example intended to show a small graph conceptually. In an | |
| applicationthatutilizesCUDAgraphs,thereismorecomplexitytousingeitherthegraphAPIorstream | |
| 4.2. CUDAGraphs 167 | |
| CUDAProgrammingGuide,Release13.1 | |
| capture. The following code snippet shows a side by side comparison of the Graph API and Stream | |
| CapturetocreateaCUDAgraphtoexecuteasimpletwostagereductionalgorithm. | |
| Figure 23 is an illustration of this CUDA graph and was generated using the cudaGraphDebugDot- | |
| Printfunctionappliedtothecodebelow,withsmalladjustmentsforreadability,andthenrendered | |
| usingGraphviz. | |
| Figure23: CUDAgraphexampleusingtwostagereductionkernel | |
| GraphAPI | |
| | void cudaGraphsManual(float | *inputVec_h, | | | | | |
| | --------------------------- | ------------ | --- | --- | --- | | |
| float *inputVec_d, | |
| double *outputVec_d, | |
| double *result_d, | |
| size_t inputSize, | |
| size_t numOfBlocks) | |
| { | |
| cudaStream_t streamForGraph; | |
| cudaGraph_t graph; | |
| std::vector<cudaGraphNode_t> nodeDependencies; | |
| | cudaGraphNode_t | memcpyNode, | kernelNode, | memsetNode; | | | |
| | --------------- | ----------- | ----------- | ----------- | --- | | |
| | double | result_h | = 0.0; | | | | |
| cudaStreamCreate(&streamForGraph)); | |
| | cudaKernelNodeParams | kernelNodeParams | = {0}; | | | | |
| | -------------------- | ---------------- | ------ | --- | --- | | |
| | cudaMemcpy3DParms | memcpyParams | = {0}; | | | | |
| | cudaMemsetParams | memsetParams | = {0}; | | | | |
| (continuesonnextpage) | |
| | 168 | | | Chapter4. | CUDAFeatures | | |
| | --- | --- | --- | --------- | ------------ | | |
| CUDAProgrammingGuide,Release13.1 | |
| (continuedfrompreviouspage) | |
| | memcpyParams.srcArray | | = NULL; | | | | |
| | --------------------- | --- | ----------------- | ------ | --- | | |
| | memcpyParams.srcPos | | = make_cudaPos(0, | 0, 0); | | | |
| memcpyParams.srcPtr = make_cudaPitchedPtr(inputVec_h, sizeof(float) * | |
| | ,→inputSize, | inputSize, | 1); | | | | |
| | --------------------- | ---------- | ----------------- | ------ | --- | | |
| | memcpyParams.dstArray | | = NULL; | | | | |
| | memcpyParams.dstPos | | = make_cudaPos(0, | 0, 0); | | | |
| memcpyParams.dstPtr = make_cudaPitchedPtr(inputVec_d, sizeof(float) * | |
| | ,→inputSize, | inputSize, | 1); | | | | |
| | ------------ | ---------- | --- | --- | --- | | |
| memcpyParams.extent = make_cudaExtent(sizeof(float) * inputSize, 1, 1); | |
| | memcpyParams.kind | | = cudaMemcpyHostToDevice; | | | | |
| | ------------------ | --- | ------------------------- | -------------- | --- | | |
| | memsetParams.dst | | = (void | *)outputVec_d; | | | |
| | memsetParams.value | | = 0; | | | | |
| | memsetParams.pitch | | = 0; | | | | |
| memsetParams.elementSize = sizeof(float); ∕∕ elementSize can be max 4 bytes | |
| | memsetParams.width | | = numOfBlocks | * 2; | | | |
| | ----------------------- | --- | ------------- | ---- | --- | | |
| | memsetParams.height | | = 1; | | | | |
| | cudaGraphCreate(&graph, | | 0); | | | | |
| cudaGraphAddMemcpyNode(&memcpyNode, graph, NULL, 0, &memcpyParams); | |
| cudaGraphAddMemsetNode(&memsetNode, graph, NULL, 0, &memsetParams); | |
| nodeDependencies.push_back(memsetNode); | |
| nodeDependencies.push_back(memcpyNode); | |
| void *kernelArgs[4] = {(void *)&inputVec_d, (void *)&outputVec_d, & | |
| | ,→inputSize, | &numOfBlocks}; | | | | | |
| | ------------------------------- | -------------- | --- | ------------------------- | ------ | | |
| | kernelNodeParams.func | | | = (void *)reduce; | | | |
| | kernelNodeParams.gridDim | | | = dim3(numOfBlocks, | 1, 1); | | |
| | kernelNodeParams.blockDim | | | = dim3(THREADS_PER_BLOCK, | 1, 1); | | |
| | kernelNodeParams.sharedMemBytes | | | = 0; | | | |
| | kernelNodeParams.kernelParams | | | = (void **)kernelArgs; | | | |
| | kernelNodeParams.extra | | | = NULL; | | | |
| cudaGraphAddKernelNode( | |
| &kernelNode, graph, nodeDependencies.data(), nodeDependencies.size(), & | |
| ,→kernelNodeParams); | |
| nodeDependencies.clear(); | |
| nodeDependencies.push_back(kernelNode); | |
| | memset(&memsetParams, | | 0, sizeof(memsetParams)); | | | | |
| | ------------------------ | --- | ------------------------- | --- | --- | | |
| | memsetParams.dst | | = result_d; | | | | |
| | memsetParams.value | | = 0; | | | | |
| | memsetParams.elementSize | | = sizeof(float); | | | | |
| | memsetParams.width | | = 2; | | | | |
| | memsetParams.height | | = 1; | | | | |
| cudaGraphAddMemsetNode(&memsetNode, graph, NULL, 0, &memsetParams); | |
| nodeDependencies.push_back(memsetNode); | |
| (continuesonnextpage) | |
| 4.2. CUDAGraphs 169 | |
| CUDAProgrammingGuide,Release13.1 | |
| (continuedfrompreviouspage) | |
| | memset(&kernelNodeParams, | 0, | sizeof(kernelNodeParams)); | | | | | |
| | ------------------------------- | --- | -------------------------- | -------------- | --- | --- | | |
| | kernelNodeParams.func | | = (void | *)reduceFinal; | | | | |
| | kernelNodeParams.gridDim | | = dim3(1, | 1, 1); | | | | |
| | kernelNodeParams.blockDim | | = dim3(THREADS_PER_BLOCK, | | 1, | 1); | | |
| | kernelNodeParams.sharedMemBytes | | = 0; | | | | | |
| void *kernelArgs2[3] = {(void *)&outputVec_d, (void *)&result_d, | |
| ,→ &numOfBlocks}; | |
| | kernelNodeParams.kernelParams | | = kernelArgs2; | | | | | |
| | ----------------------------- | --- | -------------- | --- | --- | --- | | |
| | kernelNodeParams.extra | | = NULL; | | | | | |
| cudaGraphAddKernelNode( | |
| &kernelNode, graph, nodeDependencies.data(), nodeDependencies.size(), & | |
| ,→kernelNodeParams); | |
| nodeDependencies.clear(); | |
| nodeDependencies.push_back(kernelNode); | |
| | memset(&memcpyParams, | 0, sizeof(memcpyParams)); | | | | | | |
| | --------------------- | ------------------------- | --- | ------ | --- | --- | | |
| | memcpyParams.srcArray | = NULL; | | | | | | |
| | memcpyParams.srcPos | = make_cudaPos(0, | | 0, 0); | | | | |
| memcpyParams.srcPtr = make_cudaPitchedPtr(result_d, sizeof(double), 1, | |
| ,→1); | |
| | memcpyParams.dstArray | = NULL; | | | | | | |
| | --------------------- | ----------------- | --- | ------ | --- | --- | | |
| | memcpyParams.dstPos | = make_cudaPos(0, | | 0, 0); | | | | |
| memcpyParams.dstPtr = make_cudaPitchedPtr(&result_h, sizeof(double), 1, | |
| ,→1); | |
| | memcpyParams.extent | = make_cudaExtent(sizeof(double), | | | 1, 1); | | | |
| | ------------------- | --------------------------------- | --- | --- | ------ | --- | | |
| | memcpyParams.kind | = cudaMemcpyDeviceToHost; | | | | | | |
| cudaGraphAddMemcpyNode(&memcpyNode, graph, nodeDependencies.data(), | |
| | ,→nodeDependencies.size(), | &memcpyParams); | | | | | | |
| | -------------------------- | --------------- | --- | --- | --- | --- | | |
| nodeDependencies.clear(); | |
| nodeDependencies.push_back(memcpyNode); | |
| | cudaGraphNode_t | hostNode; | | | | | | |
| | -------------------------- | --------------------- | --------------------- | --- | --- | --- | | |
| | cudaHostNodeParams | hostParams | = {0}; | | | | | |
| | hostParams.fn | | = myHostNodeCallback; | | | | | |
| | callBackData_t hostFnData; | | | | | | | |
| | hostFnData.data | = &result_h; | | | | | | |
| | hostFnData.fn_name | = "cudaGraphsManual"; | | | | | | |
| | hostParams.userData | = &hostFnData; | | | | | | |
| cudaGraphAddHostNode(&hostNode, graph, nodeDependencies.data(), | |
| | ,→nodeDependencies.size(), | &hostParams); | | | | | | |
| | -------------------------- | ------------- | --- | --- | --- | --- | | |
| } | |
| | 170 | | | | Chapter4. | CUDAFeatures | | |
| | --- | --- | --- | --- | --------- | ------------ | | |
| CUDAProgrammingGuide,Release13.1 | |
| StreamCapture | |
| | void cudaGraphsUsingStreamCapture(float | | | | *inputVec_h, | | | |
| | --------------------------------------- | --- | --- | --- | ------------ | --- | | |
| float *inputVec_d, | |
| double *outputVec_d, | |
| double *result_d, | |
| size_t inputSize, | |
| size_t numOfBlocks) | |
| { | |
| | cudaStream_t | stream1, | stream2, | stream3, | streamForGraph; | | | |
| | ------------ | ---------------- | -------- | ------------- | --------------- | ------------- | | |
| | cudaEvent_t | forkStreamEvent, | | memsetEvent1, | | memsetEvent2; | | |
| | cudaGraph_t | graph; | | | | | | |
| | double | result_h | = 0.0; | | | | | |
| cudaStreamCreate(&stream1); | |
| cudaStreamCreate(&stream2); | |
| cudaStreamCreate(&stream3); | |
| cudaStreamCreate(&streamForGraph); | |
| cudaEventCreate(&forkStreamEvent); | |
| cudaEventCreate(&memsetEvent1); | |
| cudaEventCreate(&memsetEvent2); | |
| cudaStreamBeginCapture(stream1, cudaStreamCaptureModeGlobal); | |
| | cudaEventRecord(forkStreamEvent, | | | stream1); | | | | |
| | -------------------------------- | --- | --- | ---------------- | --- | --- | | |
| | cudaStreamWaitEvent(stream2, | | | forkStreamEvent, | | 0); | | |
| | cudaStreamWaitEvent(stream3, | | | forkStreamEvent, | | 0); | | |
| cudaMemcpyAsync(inputVec_d, inputVec_h, sizeof(float) * inputSize, | |
| | ,→cudaMemcpyDefault, | stream1); | | | | | | |
| | -------------------- | --------- | --- | --- | --- | --- | | |
| cudaMemsetAsync(outputVec_d, 0, sizeof(double) * numOfBlocks, stream2); | |
| | cudaEventRecord(memsetEvent1, | | | stream2); | | | | |
| | ----------------------------- | --- | --- | ------------------ | --- | --------- | | |
| | cudaMemsetAsync(result_d, | | | 0, sizeof(double), | | stream3); | | |
| | cudaEventRecord(memsetEvent2, | | | stream3); | | | | |
| | cudaStreamWaitEvent(stream1, | | | memsetEvent1, | | 0); | | |
| reduce<<<numOfBlocks, THREADS_PER_BLOCK, 0, stream1>>>(inputVec_d, | |
| | ,→outputVec_d, | inputSize, | numOfBlocks); | | | | | |
| | ---------------------------- | ---------- | ------------- | ------------- | --- | --- | | |
| | cudaStreamWaitEvent(stream1, | | | memsetEvent2, | | 0); | | |
| reduceFinal<<<1, THREADS_PER_BLOCK, 0, stream1>>>(outputVec_d, result_d, | |
| ,→numOfBlocks); | |
| cudaMemcpyAsync(&result_h, result_d, sizeof(double), cudaMemcpyDefault, | |
| ,→stream1); | |
| | callBackData_t | hostFnData | | = {0}; | | | | |
| | --------------- | ---------- | --- | ------------ | --- | --- | | |
| | hostFnData.data | | | = &result_h; | | | | |
| (continuesonnextpage) | |
| 4.2. CUDAGraphs 171 | |
| CUDAProgrammingGuide,Release13.1 | |
| (continuedfrompreviouspage) | |
| | hostFnData.fn_name | | | = | "cudaGraphsUsingStreamCapture"; | | | | | |
| | ----------------------------- | --- | --- | --- | ------------------------------- | --- | --- | --- | | |
| | cudaHostFn_t | | fn | = | myHostNodeCallback; | | | | | |
| | cudaLaunchHostFunc(stream1, | | | | fn, &hostFnData); | | | | | |
| | cudaStreamEndCapture(stream1, | | | | &graph); | | | | | |
| } | |
| 4.2.2.2 GraphInstantiation | |
| Onceagraphhasbeencreated,eitherbytheuseofthegraphAPIorstreamcapture,thegraphmustbe | |
| instantiatedtocreateanexecutablegraph,whichcanthenbelaunched. AssumingthecudaGraph_t | |
| graph has been created successfully, the following code will instantiate the graph and create the | |
| | executablegraphcudaGraphExec_t | | | | graphExec: | | | | | |
| | -------------------------------- | --- | ---------- | --- | ------------ | --------- | --- | --- | | |
| | cudaGraphExec_t | | graphExec; | | | | | | | |
| | cudaGraphInstantiate(&graphExec, | | | | graph, NULL, | NULL, 0); | | | | |
| 4.2.2.3 GraphExecution | |
| Afteragraphhasbeencreatedandinstantiatedtocreateanexecutablegraph,itcanbelaunched. As- | |
| sumingthecudaGraphExec_t graphExechasbeencreatedsuccessfully,thefollowingcodesnippet | |
| willlaunchthegraphintothespecifiedstream: | |
| | cudaGraphLaunch(graphExec, | | | stream); | | | | | | |
| | -------------------------- | --- | --- | -------- | --- | --- | --- | --- | | |
| PullingitalltogetherandusingthestreamcaptureexamplefromSection4.2.2.1.2,thefollowingcode | |
| snippetwillcreateagraph,instantiateit,andlaunchit: | |
| | cudaGraph_t | graph; | | | | | | | | |
| | ----------- | ------ | --- | --- | --- | --- | --- | --- | | |
| cudaStreamBeginCapture(stream); | |
| | kernel_A<<< | ..., | stream | >>>(...); | | | | | | |
| | ----------- | ---- | ------ | --------- | --- | --- | --- | --- | | |
| | kernel_B<<< | ..., | stream | >>>(...); | | | | | | |
| libraryCall(stream); | |
| | kernel_C<<< | ..., | stream | >>>(...); | | | | | | |
| | -------------------------------- | -------- | ---------- | ------------ | ------------ | --------- | --- | --- | | |
| | cudaStreamEndCapture(stream, | | | &graph); | | | | | | |
| | cudaGraphExec_t | | graphExec; | | | | | | | |
| | cudaGraphInstantiate(&graphExec, | | | | graph, NULL, | NULL, 0); | | | | |
| | cudaGraphLaunch(graphExec, | | | stream); | | | | | | |
| | 4.2.3. | Updating | | Instantiated | Graphs | | | | | |
| Whenaworkflowchanges,thegraphbecomesoutofdateandmustbemodified. Majorchangesto | |
| graph structure (such as topology or node types) require re-instantiation because topology-related | |
| optimizations must be re-applied. However, it is common for only node parameters (such as kernel | |
| parametersandmemoryaddresses)tochangewhilethegraphtopologyremainsthesame. Forthis | |
| case, CUDA provides a lightweight “Graph Update” mechanism that allows certain node parameters | |
| to be modified in-place without rebuilding the entire graph, which is much more efficient than re- | |
| instantiation. | |
| | 172 | | | | | | Chapter4. | CUDAFeatures | | |
| | --- | --- | --- | --- | --- | --- | --------- | ------------ | | |
| CUDAProgrammingGuide,Release13.1 | |
| Updates take effect the next time the graph is launched, so they do not impact previous graph | |
| launches,eveniftheyarerunningatthetimeoftheupdate. Agraphmaybeupdatedandrelaunched | |
| repeatedly,somultipleupdates/launchescanbequeuedonastream. | |
| CUDAprovidestwomechanismsforupdatinginstantiatedgraphparameters,wholegraphupdateand | |
| individual node update. Whole graph update allows the user to supply a topologically identical cud- | |
| aGraph_t object whose nodes contain updated parameters. Individual node update allows the user | |
| toexplicitlyupdatetheparametersofindividualnodes. UsinganupdatedcudaGraph_tismorecon- | |
| venientwhenalargenumberofnodesarebeingupdated,orwhenthegraphtopologyisunknownto | |
| thecaller(i.e.,Thegraphresultedfromstreamcaptureofalibrarycall). Usingindividualnodeupdate | |
| ispreferredwhenthenumberofchangesissmallandtheuserhasthehandlestothenodesrequiring | |
| updates. Individualnodeupdateskipsthetopologychecksandcomparisonsforunchangednodes,so | |
| itcanbemoreefficientinmanycases. | |
| CUDAalso providesa mechanismfor enabling and disabling individual nodes withoutaffecting their | |
| currentparameters. | |
| Thefollowingsectionsexplaineachapproachinmoredetail. | |
| 4.2.3.1 WholeGraphUpdate | |
| cudaGraphExecUpdate()allowsaninstantiatedgraph(the“originalgraph”)tobeupdatedwiththe | |
| parametersfromatopologicallyidenticalgraph(the“updating”graph). Thetopologyoftheupdating | |
| graphmustbeidenticaltotheoriginalgraphusedtoinstantiatethecudaGraphExec_t. Inaddition, | |
| the order in which the dependencies are specified must match. Finally, CUDA needs to consistently | |
| orderthesinknodes(nodeswithnodependencies). CUDAreliesontheorderofspecificapicallsto | |
| achieveconsistentsinknodeordering. | |
| Moreexplicitly,followingthefollowingruleswillcausecudaGraphExecUpdate()topairthenodesin | |
| theoriginalgraphandtheupdatinggraphdeterministically: | |
| 1. Foranycapturingstream,theAPIcallsoperatingonthatstreammustbemadeinthesameorder, | |
| includingeventwaitandotherapicallsnotdirectlycorrespondingtonodecreation. | |
| 2. TheAPIcallswhichdirectlymanipulateagivengraphnode’sincomingedges(includingcaptured | |
| stream APIs, node add APIs, and edge addition / removal APIs) must be made in the same or- | |
| der. Moreover, whendependenciesarespecifiedinarraystotheseAPIs, theorderinwhichthe | |
| dependenciesarespecifiedinsidethosearraysmustmatch. | |
| 3. Sinknodesmustbeconsistentlyordered. Sinknodesarenodeswithoutdependentnodes/out- | |
| goingedgesinthefinalgraphatthetimeofthecudaGraphExecUpdate()invocation. Thefol- | |
| lowingoperationsaffectsinknodeordering(ifpresent)andmust(asacombinedset)bemade | |
| inthesameorder: | |
| ▶ NodeaddAPIsresultinginasinknode. | |
| ▶ Edgeremovalresultinginanodebecomingasinknode. | |
| ▶ cudaStreamUpdateCaptureDependencies(),ifitremovesasinknodefromacapturing | |
| stream’sdependencyset. | |
| ▶ cudaStreamEndCapture(). | |
| ThefollowingexampleshowshowtheAPIcouldbeusedtoupdateaninstantiatedgraph: | |
| cudaGraphExec_t graphExec = NULL; | |
| for (int i = 0; i < 10; i++) { | |
| cudaGraph_t graph; | |
| (continuesonnextpage) | |
| 4.2. CUDAGraphs 173 | |
| CUDAProgrammingGuide,Release13.1 | |
| (continuedfrompreviouspage) | |
| | cudaGraphExecUpdateResult | | | | | | updateResult; | | | | | | | |
| | ------------------------- | --- | --- | --- | ---------- | --- | ------------- | --- | --- | --- | --- | --- | | |
| | cudaGraphNode_t | | | | errorNode; | | | | | | | | | |
| ∕∕ In this example we use stream capture to create the graph. | |
| | ∕∕ | You | can | also | use the | Graph | API | to produce | | a graph. | | | | |
| | ------------------------------ | ---- | --------------- | ---- | ------- | ------ | ----------------------------- | ---------- | --- | -------- | ------- | --- | | |
| | cudaStreamBeginCapture(stream, | | | | | | cudaStreamCaptureModeGlobal); | | | | | | | |
| | ∕∕ | Call | a user-defined, | | | stream | based | workload, | | for | example | | | |
| do_cuda_work(stream); | |
| | cudaStreamEndCapture(stream, | | | | | | &graph); | | | | | | | |
| | ---------------------------- | --- | --- | --- | --- | --- | -------- | --- | --- | --- | --- | --- | | |
| ∕∕ If we've already instantiated the graph, try to update it directly | |
| | ∕∕ | and | avoid | the | instantiation | | overhead | | | | | | | |
| | --- | ---------- | ----- | --- | ------------- | --- | -------- | --- | --- | --- | --- | --- | | |
| | if | (graphExec | | != | NULL) | { | | | | | | | | |
| ∕∕ If the graph fails to update, errorNode will be set to the | |
| ∕∕ node causing the failure and updateResult will be set to a | |
| | | ∕∕ | reason | code. | | | | | | | | | | |
| | --- | --- | ------ | ----- | --- | --- | --- | --- | --- | --- | --- | --- | | |
| cudaGraphExecUpdate(graphExec, graph, &errorNode, &updateResult); | |
| } | |
| ∕∕ Instantiate during the first iteration or whenever the update | |
| | ∕∕ | fails | for | any | reason | | | | | | | | | |
| | --- | ----- | --- | --- | ------ | --- | --- | --- | --- | --- | --- | --- | | |
| if (graphExec == NULL || updateResult != cudaGraphExecUpdateSuccess) { | |
| | | ∕∕ | If a | previous | | update | failed, | destroy | | the cudaGraphExec_t | | | | |
| | --- | --- | ---------- | ---------------- | --- | ------ | ------- | ------- | --- | ------------------- | --- | --- | | |
| | | ∕∕ | before | re-instantiating | | | it | | | | | | | |
| | | if | (graphExec | | != | NULL) | { | | | | | | | |
| cudaGraphExecDestroy(graphExec); | |
| } | |
| | | ∕∕ | Instantiate | | graphExec | | from | graph. | The | error | node and | | | |
| | --- | -------------------------------- | ----------- | ------- | --------- | ---------- | ---- | ------ | ------ | ----- | --------- | --- | | |
| | | ∕∕ | error | message | | parameters | are | unused | here. | | | | | |
| | | cudaGraphInstantiate(&graphExec, | | | | | | | graph, | NULL, | NULL, 0); | | | |
| } | |
| cudaGraphDestroy(graph); | |
| | cudaGraphLaunch(graphExec, | | | | | | stream); | | | | | | | |
| | -------------------------- | --- | --- | --- | --- | --- | -------- | --- | --- | --- | --- | --- | | |
| cudaStreamSynchronize(stream); | |
| } | |
| AtypicalworkflowistocreatetheinitialcudaGraph_tusingeitherthestreamcaptureorgraphAPI. | |
| The cudaGraph_t is then instantiated and launched as normal. After the initial launch, a new cud- | |
| aGraph_t is created using the same method as the initial graph and cudaGraphExecUpdate() is | |
| called. If the graph update is successful, indicated by the updateResult parameter in the above | |
| example, the updated cudaGraphExec_t is launched. If the update fails for any reason, the cud- | |
| aGraphExecDestroy() and cudaGraphInstantiate() are called to destroy the original cuda- | |
| GraphExec_tandinstantiateanewone. | |
| ItisalsopossibletoupdatethecudaGraph_tnodesdirectly(i.e.,UsingcudaGraphKernelNodeSet- | |
| Params()) and subsequently update the cudaGraphExec_t, however it is more efficient to use the | |
| explicitnodeupdateAPIscoveredinthenextsection. | |
| Conditionalhandleflagsanddefaultvaluesareupdatedaspartofthegraphupdate. | |
| | 174 | | | | | | | | | | Chapter4. | CUDAFeatures | | |
| | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --------- | ------------ | | |
| CUDAProgrammingGuide,Release13.1 | |
| PleaseseetheGraphAPIformoreinformationonusageandcurrentlimitations. | |
| 4.2.3.2 IndividualNodeUpdate | |
| Instantiatedgraphnodeparameterscanbeupdateddirectly. Thiseliminatestheoverheadofinstanti- | |
| ationaswellastheoverheadofcreatinganewcudaGraph_t. Ifthenumberofnodesrequiringupdate | |
| issmallrelativetothetotalnumberofnodesinthegraph,itisbettertoupdatethenodesindividually. | |
| ThefollowingmethodsareavailableforupdatingcudaGraphExec_tnodes: | |
| Table8: IndividualNodeUpdateAPIs | |
| API NodeType | |
| cudaGraphExecKernelNodeSetParams() Kernelnode | |
| cudaGraphExecMemcpyNodeSetParams() Memorycopynode | |
| cudaGraphExecMemsetNodeSetParams() Memorysetnode | |
| cudaGraphExecHostNodeSetParams() Hostnode | |
| cudaGraphExecChildGraphNodeSetParams() Childgraphnode | |
| cudaGraphExecEventRecordNodeSetEvent() Eventrecordnode | |
| cudaGraphExecEventWaitNodeSetEvent() Eventwaitnode | |
| cudaGraphExecExternalSemaphoresSignalNodeSet- External semaphore signal | |
| Params() node | |
| cudaGraphExecExternalSemaphoresWaitNodeSetParams() Externalsemaphorewaitnode | |
| PleaseseetheGraphAPIformoreinformationonusageandcurrentlimitations. | |
| 4.2.3.3 IndividualNodeEnable | |
| Kernel, memset and memcpy nodes in an instantiated graph can be enabled or disabled using the | |
| cudaGraphNodeSetEnabled()API.Thisallowsthecreationofagraphwhichcontainsasupersetof | |
| thedesiredfunctionalitywhichcanbecustomizedforeachlaunch. Theenablestateofanodecanbe | |
| queriedusingthecudaGraphNodeGetEnabled()API. | |
| Adisablednodeisfunctionallyequivalenttoemptynodeuntilitisre-enabled. Nodeparametersare | |
| not affected by enabling/disabling a node. Enable state is unaffected by individual node update or | |
| whole graph update with cudaGraphExecUpdate(). Parameter updates while the node is disabled | |
| willtakeeffectwhenthenodeisre-enabled. | |
| RefertotheGraphAPIformoreinformationonusageandcurrentlimitations. | |
| 4.2.3.4 GraphUpdateLimitations | |
| Kernelnodes: | |
| ▶ Theowningcontextofthefunctioncannotchange. | |
| ▶ AnodewhosefunctionoriginallydidnotuseCUDAdynamicparallelismcannotbeupdatedtoa | |
| functionwhichusesCUDAdynamicparallelism. | |
| cudaMemsetandcudaMemcpynodes: | |
| ▶ TheCUDAdevice(s)towhichtheoperand(s)wasallocated/mappedcannotchange. | |
| 4.2. CUDAGraphs 175 | |
| CUDAProgrammingGuide,Release13.1 | |
| ▶ The source/destination memory must be allocated from the same context as the original | |
| source/destinationmemory. | |
| ▶ Only1DcudaMemset/cudaMemcpynodescanbechanged. | |
| Additionalmemcpynoderestrictions: | |
| ▶ Changing either the source or destination memory type (i.e., cudaPitchedPtr, cudaArray_t, | |
| etc.),orthetypeoftransfer(i.e.,cudaMemcpyKind)isnotsupported. | |
| Externalsemaphorewaitnodesandrecordnodes: | |
| ▶ Changingthenumberofsemaphoresisnotsupported. | |
| Conditionalnodes: | |
| ▶ Theorderofhandlecreationandassignmentmustmatchbetweenthegraphs. | |
| ▶ Changingnodeparametersisnotsupported(i.e. numberofgraphsintheconditional,nodecon- | |
| text,etc). | |
| ▶ Changingparametersofnodeswithintheconditionalbodygraphissubjecttotherulesabove. | |
| Memorynodes: | |
| ▶ It is not possible to update a cudaGraphExec_t with a cudaGraph_t if the cudaGraph_t is | |
| currentlyinstantiatedasadifferentcudaGraphExec_t. | |
| Therearenorestrictionsonupdatestohostnodes,eventrecordnodes,oreventwaitnodes. | |
| 4.2.4. Conditional Graph Nodes | |
| Conditionalnodesallowconditionalexecutionandloopingofagraphcontainedwithintheconditional | |
| node. This allows dynamic and iterative workflows to be represented completely within a graph and | |
| freesupthehostCPUtoperformotherworkinparallel. | |
| Evaluationoftheconditionvalueisperformedonthedevicewhenthedependenciesoftheconditional | |
| nodehavebeenmet. Conditionalnodescanbeoneofthefollowingtypes: | |
| ▶ ConditionalIFnodesexecutetheirbodygraphonceiftheconditionvalueisnon-zerowhenthe | |
| nodeisexecuted. Anoptionalsecondbodygraphcanbeprovidedandthiswillbeexecutedonce | |
| iftheconditionvalueiszerowhenthenodeisexecuted. | |
| ▶ Conditional WHILEnodes execute their body graph if the condition value is non-zero when the | |
| nodeisexecutedandwillcontinuetoexecutetheirbodygraphuntiltheconditionvalueiszero. | |
| ▶ ConditionalSWITCHnodesexecutethezero-indexednthbodygraphonceiftheconditionvalueis | |
| equalton. Iftheconditionvaluedoesnotcorrespondtoabodygraph,nobodygraphislaunched. | |
| Aconditionvalueisaccessedbyaconditionalhandle,whichmustbecreatedbeforethenode. Thecon- | |
| ditionvaluecanbesetbydevicecodeusingcudaGraphSetConditional(). Adefaultvalue,applied | |
| oneachgraphlaunch,canalsobespecifiedwhenthehandleiscreated. | |
| When the conditional node is created, an empty graph is created and the handle is returned to the | |
| usersothatthegraphcanbepopulated. Thisconditionalbodygraphcanbepopulatedusingeither | |
| thegraphAPIsorcudaStreamBeginCaptureToGraph(). | |
| Conditionalnodescanbenested. | |
| 176 Chapter4. CUDAFeatures | |
| CUDAProgrammingGuide,Release13.1 | |
| 4.2.4.1 ConditionalHandles | |
| AconditionvalueisrepresentedbycudaGraphConditionalHandleandiscreatedbycudaGraph- | |
| ConditionalHandleCreate(). | |
| The handle must be associated with a single conditional node. Handles cannot be destroyed and as | |
| suchthereisnoneedtokeeptrackofthem. | |
| IfcudaGraphCondAssignDefaultisspecifiedwhenthehandleiscreated,theconditionvaluewillbe | |
| initializedtothespecifieddefaultatthebeginningofeachgraphexecution. Ifthisflagisnotprovided, | |
| theconditionvalueisundefinedatthestartofeachgraphexecutionandcodeshouldnotassumethat | |
| theconditionvaluepersistsacrossexecutions. | |
| Thedefaultvalueandflagsassociatedwithahandlewillbeupdatedduringwholegraphupdate. | |
| 4.2.4.2 ConditionalNodeBodyGraphRequirements | |
| Generalrequirements: | |
| ▶ Thegraph’snodesmustallresideonasingledevice. | |
| ▶ The graph can only contain kernel nodes, empty nodes, memcpy nodes, memset nodes, child | |
| graphnodes,andconditionalnodes. | |
| Kernelnodes: | |
| ▶ UseofCUDADynamicParallelismorDeviceGraphLaunchbykernelsinthegraphisnotpermitted. | |
| ▶ CooperativelaunchesarepermittedsolongasMPSisnotinuse. | |
| Memcpy/Memsetnodes: | |
| ▶ Onlycopies/memsetsinvolvingdevicememoryand/orpinneddevice-mappedhostmemoryare | |
| permitted. | |
| ▶ Copies/memsetsinvolvingCUDAarraysarenotpermitted. | |
| ▶ Both operands must be accessible from the current device at time of instantiation. Note that | |
| the copy operation will be performed from the device on which the graph resides, even if it is | |
| targetingmemoryonanotherdevice. | |
| 4.2.4.3 ConditionalIFNodes | |
| The body graph of an IF node will be executed once if the condition is non-zero when the node is | |
| executed. The following diagram depicts a 3 node graph where the middle node, B, is a conditional | |
| node: | |
| Figure24: ConditionalIFNode | |
| 4.2. CUDAGraphs 177 | |
| CUDAProgrammingGuide,Release13.1 | |
| The following code illustrates the creation of a graph containing an IF conditional node. The default | |
| valueoftheconditionissetusinganupstreamkernel. Thebodyoftheconditionalispopulatedusing | |
| thegraphAPI. | |
| __global__ void setHandle(cudaGraphConditionalHandle handle, int value) | |
| { | |
| ... | |
| | ∕∕ Set | the condition | value | to the | value | passed | | to the | kernel | | | |
| | ------------------------------- | ------------- | ----- | ------ | ------- | ------ | --- | ------ | ------ | --- | | |
| | cudaGraphSetConditional(handle, | | | | value); | | | | | | | |
| ... | |
| } | |
| | void graphSetup() | { | | | | | | | | | | |
| | ----------------------- | ---------- | --- | --- | --- | --- | --- | --- | --- | --- | | |
| | cudaGraph_t | graph; | | | | | | | | | | |
| | cudaGraphExec_t | graphExec; | | | | | | | | | | |
| | cudaGraphNode_t | node; | | | | | | | | | | |
| | void *kernelArgs[2]; | | | | | | | | | | | |
| | int value | = 1; | | | | | | | | | | |
| | ∕∕ Create | the graph | | | | | | | | | | |
| | cudaGraphCreate(&graph, | | 0); | | | | | | | | | |
| ∕∕ Create the conditional handle; because no default value is provided, | |
| ,→the condition value is undefined at the start of each graph execution | |
| | cudaGraphConditionalHandle | | | handle; | | | | | | | | |
| | ----------------------------------------- | --- | --- | ------- | --- | --- | ------- | --- | --- | --- | | |
| | cudaGraphConditionalHandleCreate(&handle, | | | | | | graph); | | | | | |
| ∕∕ Use a kernel upstream of the conditional to set the handle value | |
| | cudaGraphNodeParams | | params | = { | cudaGraphNodeTypeKernel | | | | }; | | | |
| | ------------------- | --- | ------- | ------------ | ----------------------- | --- | --- | --- | --- | --- | | |
| | params.kernel.func | | = (void | *)setHandle; | | | | | | | | |
| params.kernel.gridDim.x = params.kernel.gridDim.y = params.kernel.gridDim. | |
| ,→z = 1; | |
| params.kernel.blockDim.x = params.kernel.blockDim.y = params.kernel. | |
| | ,→blockDim.z | = 1; | | | | | | | | | | |
| | -------------------------- | ---------- | --------------- | ------------- | ----- | ------------ | --- | --- | --- | --- | | |
| | params.kernel.kernelParams | | | = kernelArgs; | | | | | | | | |
| | kernelArgs[0] | = &handle; | | | | | | | | | | |
| | kernelArgs[1] | = &value; | | | | | | | | | | |
| | cudaGraphAddNode(&node, | | graph, | | NULL, | 0, ¶ms); | | | | | | |
| | ∕∕ Create | and add | the conditional | | node | | | | | | | |
| cudaGraphNodeParams cParams = { cudaGraphNodeTypeConditional }; | |
| | cParams.conditional.handle | | | = handle; | | | | | | | | |
| | -------------------------- | --- | --- | ---------------------- | --- | --- | --- | --- | --- | --- | | |
| | cParams.conditional.type | | | = cudaGraphCondTypeIf; | | | | | | | | |
| cParams.conditional.size = 1; ∕∕ There is only an "if" body graph | |
| | cudaGraphAddNode(&node, | | graph, | | &node, | 1, | &cParams); | | | | | |
| | ----------------------- | -------------- | ------------------------------------- | ----------- | ------ | ----------- | ---------- | ---- | --- | --- | | |
| | ∕∕ Get | the body graph | of the | conditional | | node | | | | | | |
| | cudaGraph_t | bodyGraph | = cParams.conditional.phGraph_out[0]; | | | | | | | | | |
| | ∕∕ Populate | the body | graph | of the | IF | conditional | | node | | | | |
| ... | |
| | cudaGraphAddNode(&node, | | bodyGraph, | | | NULL, | 0, ¶ms); | | | | | |
| | ----------------------- | --- | ---------- | --- | --- | ----- | ------------ | --- | --- | --- | | |
| (continuesonnextpage) | |
| | 178 | | | | | | | | Chapter4. | CUDAFeatures | | |
| | --- | --- | --- | --- | --- | --- | --- | --- | --------- | ------------ | | |
| CUDAProgrammingGuide,Release13.1 | |
| (continuedfrompreviouspage) | |
| | ∕∕ Instantiate | | and | launch | the graph | | | | | | |
| | -------------------------------- | --- | --- | ------ | --------- | ------ | ----- | ----- | --- | | |
| | cudaGraphInstantiate(&graphExec, | | | | | graph, | NULL, | NULL, | 0); | | |
| | cudaGraphLaunch(graphExec, | | | | 0); | | | | | | |
| cudaDeviceSynchronize(); | |
| | ∕∕ Clean | up | | | | | | | | | |
| | -------- | --- | --- | --- | --- | --- | --- | --- | --- | | |
| cudaGraphExecDestroy(graphExec); | |
| cudaGraphDestroy(graph); | |
| } | |
| IFnodescanalsohaveanoptionalsecondbodygraphwhichisexecutedoncewhenthenodeisexe- | |
| cutediftheconditionvalueiszero. | |
| | void graphSetup() | | { | | | | | | | | |
| | ----------------- | --- | ---------- | --- | --- | --- | --- | --- | --- | | |
| | cudaGraph_t | | graph; | | | | | | | | |
| | cudaGraphExec_t | | graphExec; | | | | | | | | |
| | cudaGraphNode_t | | node; | | | | | | | | |
| void *kernelArgs[2]; | |
| | int value | = | 1; | | | | | | | | |
| | ----------------------- | --- | ----- | --- | --- | --- | --- | --- | --- | | |
| | ∕∕ Create | the | graph | | | | | | | | |
| | cudaGraphCreate(&graph, | | | | 0); | | | | | | |
| ∕∕ Create the conditional handle; because no default value is provided, | |
| ,→the condition value is undefined at the start of each graph execution | |
| | cudaGraphConditionalHandle | | | | handle; | | | | | | |
| | ----------------------------------------- | --- | --- | --- | ------- | --- | ------- | --- | --- | | |
| | cudaGraphConditionalHandleCreate(&handle, | | | | | | graph); | | | | |
| ∕∕ Use a kernel upstream of the conditional to set the handle value | |
| | cudaGraphNodeParams | | | params | = { | cudaGraphNodeTypeKernel | | | }; | | |
| | ------------------- | --- | --- | ------- | ------------ | ----------------------- | --- | --- | --- | | |
| | params.kernel.func | | | = (void | *)setHandle; | | | | | | |
| params.kernel.gridDim.x = params.kernel.gridDim.y = params.kernel.gridDim. | |
| ,→z = 1; | |
| params.kernel.blockDim.x = params.kernel.blockDim.y = params.kernel. | |
| | ,→blockDim.z | = 1; | | | | | | | | | |
| | -------------------------- | ---- | ---------- | ------ | ------------- | ----- | ------------ | --- | --- | | |
| | params.kernel.kernelParams | | | | = kernelArgs; | | | | | | |
| | kernelArgs[0] | | = &handle; | | | | | | | | |
| | kernelArgs[1] | | = &value; | | | | | | | | |
| | cudaGraphAddNode(&node, | | | | graph, | NULL, | 0, ¶ms); | | | | |
| | ∕∕ Create | and | add | the IF | conditional | | node | | | | |
| cudaGraphNodeParams cParams = { cudaGraphNodeTypeConditional }; | |
| | cParams.conditional.handle | | | | = handle; | | | | | | |
| | -------------------------- | --- | --- | --- | ---------------------- | --- | --- | --- | --- | | |
| | cParams.conditional.type | | | | = cudaGraphCondTypeIf; | | | | | | |
| cParams.conditional.size = 2; ∕∕ There is both an "if" and an "else" | |
| ,→body graph | |
| | cudaGraphAddNode(&node, | | | | graph, | &node, | 1, &cParams); | | | | |
| | ----------------------- | -------- | ------ | --- | ------------------ | ------ | ------------- | --- | --- | | |
| | ∕∕ Get | the body | graphs | | of the conditional | | node | | | | |
| cudaGraph_t ifBodyGraph = cParams.conditional.phGraph_out[0]; | |
| cudaGraph_t elseBodyGraph = cParams.conditional.phGraph_out[1]; | |
| (continuesonnextpage) | |
| 4.2. CUDAGraphs 179 | |
| CUDAProgrammingGuide,Release13.1 | |
| (continuedfrompreviouspage) | |
| | ∕∕ Populate | the body | graphs | of the | IF conditional | node | | | | |
| | ----------- | -------- | ------ | ------ | -------------- | ---- | --- | --- | | |
| ... | |
| | cudaGraphAddNode(&node, | | ifBodyGraph, | | NULL, | 0, ¶ms); | | | | |
| | ----------------------- | --- | ------------ | --- | ----- | ------------ | --- | --- | | |
| ... | |
| | cudaGraphAddNode(&node, | | elseBodyGraph, | | NULL, | 0, ¶ms); | | | | |
| | -------------------------------- | --- | -------------- | ------ | ----- | ------------ | --- | --- | | |
| | ∕∕ Instantiate | and | launch the | graph | | | | | | |
| | cudaGraphInstantiate(&graphExec, | | | graph, | NULL, | NULL, | 0); | | | |
| | cudaGraphLaunch(graphExec, | | | 0); | | | | | | |
| cudaDeviceSynchronize(); | |
| | ∕∕ Clean | up | | | | | | | | |
| | -------- | --- | --- | --- | --- | --- | --- | --- | | |
| cudaGraphExecDestroy(graphExec); | |
| cudaGraphDestroy(graph); | |
| } | |
| 4.2.4.4 ConditionalWHILENodes | |
| ThebodygraphofaWHILEnodewillbeexecutedaslongastheconditionisnon-zero. Thecondition | |
| will be evaluated when the node is executed and after completion of the body graph. The following | |
| diagramdepictsa3nodegraphwherethemiddlenode,B,isaconditionalnode: | |
| | | | Figure25: | ConditionalWHILENode | | | | | | |
| | --- | --- | --------- | -------------------- | --- | --- | --- | --- | | |
| ThefollowingcodeillustratesthecreationofagraphcontainingaWHILEconditionalnode. Thehandle | |
| iscreatedusingcudaGraphCondAssignDefaulttoavoidtheneedforanupstreamkernel. Thebodyof | |
| theconditionalispopulatedusingthegraphAPI. | |
| __global__ void loopKernel(cudaGraphConditionalHandle handle, char *dPtr) | |
| { | |
| ∕∕ Decrement the value of dPtr and set the condition value to 0 once dPtr | |
| ,→is 0 | |
| | if (--(*dPtr) | == 0) | { | | | | | | | |
| | ------------------------------- | ----- | --- | --- | --- | --- | --- | --- | | |
| | cudaGraphSetConditional(handle, | | | | 0); | | | | | |
| } | |
| } | |
| | void graphSetup() | { | | | | | | | | |
| | ----------------- | ---------- | --- | --- | --- | --- | --- | --- | | |
| | cudaGraph_t | graph; | | | | | | | | |
| | cudaGraphExec_t | graphExec; | | | | | | | | |
| (continuesonnextpage) | |
| | 180 | | | | | | Chapter4. | CUDAFeatures | | |
| | --- | --- | --- | --- | --- | --- | --------- | ------------ | | |
| CUDAProgrammingGuide,Release13.1 | |
| (continuedfrompreviouspage) | |
| | cudaGraphNode_t | | node; | | | | | | | | |
| | --------------- | --- | ----- | --- | --- | --- | --- | --- | --- | | |
| void *kernelArgs[2]; | |
| | ∕∕ Allocate | a | byte | of device | memory | to | use | as input | | | |
| | ----------- | --- | ---- | --------- | ------ | --- | --- | -------- | --- | | |
| char *dPtr; | |
| | cudaMalloc((void | | **)&dPtr, | | 1); | | | | | | |
| | ----------------------------------------- | --- | ----------- | --- | ------- | ---- | --------- | ---------- | ---- | | |
| | ∕∕ Create | the | graph | | | | | | | | |
| | cudaGraphCreate(&graph, | | | | 0); | | | | | | |
| | ∕∕ Create | the | conditional | | handle | with | a default | value | of 1 | | |
| | cudaGraphConditionalHandle | | | | handle; | | | | | | |
| | cudaGraphConditionalHandleCreate(&handle, | | | | | | | graph, 1, | | | |
| ,→cudaGraphCondAssignDefault); | |
| | ∕∕ Create | and | add the | WHILE | conditional | | node | | | | |
| | --------- | --- | ------- | ----- | ----------- | --- | ---- | --- | --- | | |
| cudaGraphNodeParams cParams = { cudaGraphNodeTypeConditional }; | |
| | cParams.conditional.handle | | | | = handle; | | | | | | |
| | -------------------------- | --------- | ----- | ------- | ----------------------------------- | ----------------------- | ------------- | ---- | --- | | |
| | cParams.conditional.type | | | | = cudaGraphCondTypeWhile; | | | | | | |
| | cParams.conditional.size | | | | = 1; | | | | | | |
| | cudaGraphAddNode(&node, | | | | graph, | NULL, | 0, &cParams); | | | | |
| | ∕∕ Get | the body | graph | of | the conditional | | node | | | | |
| | cudaGraph_t | bodyGraph | | = | cParams.conditional.phGraph_out[0]; | | | | | | |
| | ∕∕ Populate | the | body | graph | of the | conditional | | node | | | |
| | cudaGraphNodeParams | | | params | = { | cudaGraphNodeTypeKernel | | | }; | | |
| | params.kernel.func | | | = (void | *)loopKernel; | | | | | | |
| params.kernel.gridDim.x = params.kernel.gridDim.y = params.kernel.gridDim. | |
| ,→z = 1; | |
| params.kernel.blockDim.x = params.kernel.blockDim.y = params.kernel. | |
| | ,→blockDim.z | = 1; | | | | | | | | | |
| | -------------------------- | ---- | ---------- | --- | ------------- | ----- | --- | ------------ | --- | | |
| | params.kernel.kernelParams | | | | = kernelArgs; | | | | | | |
| | kernelArgs[0] | | = &handle; | | | | | | | | |
| | kernelArgs[1] | | = &dPtr; | | | | | | | | |
| | cudaGraphAddNode(&node, | | | | bodyGraph, | NULL, | | 0, ¶ms); | | | |
| ∕∕ Initialize device memory, instantiate, and launch the graph | |
| cudaMemset(dPtr, 10, 1); ∕∕ Set dPtr to 10; the loop will run until dPtr | |
| ,→is 0 | |
| | cudaGraphInstantiate(&graphExec, | | | | | graph, | NULL, | NULL, | 0); | | |
| | -------------------------------- | --- | --- | --- | --- | ------ | ----- | ----- | --- | | |
| | cudaGraphLaunch(graphExec, | | | | 0); | | | | | | |
| cudaDeviceSynchronize(); | |
| | ∕∕ Clean | up | | | | | | | | | |
| | -------- | --- | --- | --- | --- | --- | --- | --- | --- | | |
| cudaGraphExecDestroy(graphExec); | |
| cudaGraphDestroy(graph); | |
| cudaFree(dPtr); | |
| } | |
| 4.2. CUDAGraphs 181 | |
| CUDAProgrammingGuide,Release13.1 | |
| 4.2.4.5 ConditionalSWITCHNodes | |
| Thezero-indexednthbodygraphofaSWITCHnodewillbeexecutedonceiftheconditionisequalto | |
| nwhenthenodeisexecuted. Thefollowingdiagramdepictsa3nodegraphwherethemiddlenode, | |
| B,isaconditionalnode: | |
| | | | Figure26: | ConditionalSWITCHNode | | | | | |
| | --- | --- | --------- | --------------------- | --- | --- | --- | | |
| ThefollowingcodeillustratesthecreationofagraphcontainingaSWITCHconditionalnode. Thevalue | |
| of the condition is set using an upstream kernel. The bodies of the conditional are populated using | |
| thegraphAPI. | |
| __global__ void setHandle(cudaGraphConditionalHandle handle, int value) | |
| { | |
| ... | |
| | ∕∕ Set the | condition | value | to the value | passed | to the kernel | | | |
| | ---------- | --------- | ----- | ------------ | ------ | ------------- | --- | | |
| (continuesonnextpage) | |
| | 182 | | | | | Chapter4. | CUDAFeatures | | |
| | --- | --- | --- | --- | --- | --------- | ------------ | | |
| CUDAProgrammingGuide,Release13.1 | |
| (continuedfrompreviouspage) | |
| | cudaGraphSetConditional(handle, | | | | | value); | | | | | |
| | ------------------------------- | --- | --- | --- | --- | ------- | --- | --- | --- | | |
| ... | |
| } | |
| | void graphSetup() | | { | | | | | | | | |
| | ----------------- | --- | ---------- | --- | --- | --- | --- | --- | --- | | |
| | cudaGraph_t | | graph; | | | | | | | | |
| | cudaGraphExec_t | | graphExec; | | | | | | | | |
| | cudaGraphNode_t | | node; | | | | | | | | |
| void *kernelArgs[2]; | |
| | int value | = | 1; | | | | | | | | |
| | ----------------------- | --- | ----- | --- | --- | --- | --- | --- | --- | | |
| | ∕∕ Create | the | graph | | | | | | | | |
| | cudaGraphCreate(&graph, | | | | 0); | | | | | | |
| ∕∕ Create the conditional handle; because no default value is provided, | |
| ,→the condition value is undefined at the start of each graph execution | |
| | cudaGraphConditionalHandle | | | | | handle; | | | | | |
| | ----------------------------------------- | --- | --- | --- | --- | ------- | ------- | --- | --- | | |
| | cudaGraphConditionalHandleCreate(&handle, | | | | | | graph); | | | | |
| ∕∕ Use a kernel upstream of the conditional to set the handle value | |
| | cudaGraphNodeParams | | | params | | = { cudaGraphNodeTypeKernel | | | }; | | |
| | ------------------- | --- | --- | ------- | --- | --------------------------- | --- | --- | --- | | |
| | params.kernel.func | | | = (void | | *)setHandle; | | | | | |
| params.kernel.gridDim.x = params.kernel.gridDim.y = params.kernel.gridDim. | |
| ,→z = 1; | |
| params.kernel.blockDim.x = params.kernel.blockDim.y = params.kernel. | |
| | ,→blockDim.z | = 1; | | | | | | | | | |
| | -------------------------- | ---- | ---------- | --------------- | ------ | ------------- | ------------ | --- | --- | | |
| | params.kernel.kernelParams | | | | | = kernelArgs; | | | | | |
| | kernelArgs[0] | | = &handle; | | | | | | | | |
| | kernelArgs[1] | | = &value; | | | | | | | | |
| | cudaGraphAddNode(&node, | | | | graph, | NULL, | 0, ¶ms); | | | | |
| | ∕∕ Create | and | add | the conditional | | SWITCH | node | | | | |
| cudaGraphNodeParams cParams = { cudaGraphNodeTypeConditional }; | |
| | cParams.conditional.handle | | | | | = handle; | | | | | |
| | -------------------------- | -------- | ----------- | ------ | ---------------------------------- | -------------------------- | ------------------ | --- | ---- | | |
| | cParams.conditional.type | | | | | = cudaGraphCondTypeSwitch; | | | | | |
| | cParams.conditional.size | | | | | = 5; | | | | | |
| | cudaGraphAddNode(&node, | | | | graph, | &node, | 1, &cParams); | | | | |
| | ∕∕ Get | the body | graphs | | of the | conditional | node | | | | |
| | cudaGraph_t | | *bodyGraphs | | = cParams.conditional.phGraph_out; | | | | | | |
| | ∕∕ Populate | | the body | graphs | | of the | SWITCH conditional | | node | | |
| ... | |
| | cudaGraphAddNode(&node, | | | | bodyGraphs[0], | | NULL, | 0, ¶ms); | | | |
| | ----------------------- | --- | --- | --- | -------------- | --- | ----- | ------------ | --- | | |
| ... | |
| | cudaGraphAddNode(&node, | | | | bodyGraphs[4], | | NULL, | 0, ¶ms); | | | |
| | -------------------------------- | --- | --- | ------ | -------------- | ------ | ----- | ------------ | --- | | |
| | ∕∕ Instantiate | | and | launch | the | graph | | | | | |
| | cudaGraphInstantiate(&graphExec, | | | | | graph, | NULL, | NULL, | 0); | | |
| | cudaGraphLaunch(graphExec, | | | | | 0); | | | | | |
| cudaDeviceSynchronize(); | |
| (continuesonnextpage) | |
| 4.2. CUDAGraphs 183 | |
| CUDAProgrammingGuide,Release13.1 | |
| (continuedfrompreviouspage) | |
| ∕∕ Clean up | |
| cudaGraphExecDestroy(graphExec); | |
| cudaGraphDestroy(graph); | |
| } | |
| 4.2.5. Graph Memory Nodes | |
| 4.2.5.1 Introduction | |
| Graphmemorynodesallowgraphstocreateandownmemoryallocations. Graphmemorynodeshave | |
| GPUorderedlifetimesemantics,whichdictatewhenmemoryisallowedtobeaccessedonthedevice. | |
| TheseGPUorderedlifetimesemanticsenabledriver-managedmemoryreuse,andmatchthoseofthe | |
| streamorderedallocationAPIscudaMallocAsyncandcudaFreeAsync,whichmaybecapturedwhen | |
| creatingagraph. | |
| Graphallocationshavefixedaddressesoverthelifeofagraphincludingrepeatedinstantiationsand | |
| launches. Thisallowsthememorytobedirectlyreferencedbyotheroperationswithinthegraphwith- | |
| out the need of a graph update, even when CUDA changes the backing physical memory. Within a | |
| graph,allocationswhosegraphorderedlifetimesdonotoverlapmayusethesameunderlyingphysical | |
| memory. | |
| CUDAmayreusethesamephysicalmemoryforallocationsacrossmultiplegraphs,aliasingvirtualad- | |
| dressmappingsaccordingtotheGPUorderedlifetimesemantics. Forexamplewhendifferentgraphs | |
| arelaunchedintothesamestream,CUDAmayvirtuallyaliasthesamephysicalmemorytosatisfythe | |
| needsofallocationswhichhavesingle-graphlifetimes. | |
| 4.2.5.2 APIFundamentals | |
| Graph memory nodes are graph nodes representing either memory allocation or free actions. As a | |
| shorthand,nodesthatallocatememoryarecalledallocationnodes. Likewise,nodesthatfreememory | |
| arecalledfreenodes. Allocationscreatedbyallocationnodesarecalledgraphallocations. CUDAas- | |
| signsvirtualaddressesforthegraphallocationatnodecreationtime. Whilethesevirtualaddresses | |
| are fixed for the lifetime of the allocation node, the allocation contents are not persistent past the | |
| freeingoperationandmaybeoverwrittenbyaccessesreferringtoadifferentallocation. | |
| Graphallocationsareconsideredrecreatedeverytimeagraphruns. Agraphallocation’slifetime,which | |
| differs from the node’s lifetime, begins when GPU execution reaches the allocating graph node and | |
| endswhenoneofthefollowingoccurs: | |
| ▶ GPUexecutionreachesthefreeinggraphnode | |
| ▶ GPUexecutionreachesthefreeingcudaFreeAsync()streamcall | |
| ▶ immediatelyuponthefreeingcalltocudaFree() | |
| Note | |
| Graph destruction does not automatically free any live graph-allocated memory, even though it | |
| ends the lifetime of the allocation node. The allocation must subsequently be freed in another | |
| graph,orusingcudaFreeAsync()∕cudaFree(). | |
| Justlikeothergraph-structure,graphmemorynodesareorderedwithinagraphbydependencyedges. | |
| Aprogrammustguaranteethatoperationsaccessinggraphmemory: | |
| 184 Chapter4. CUDAFeatures | |
| CUDAProgrammingGuide,Release13.1 | |
| ▶ | |
| areorderedaftertheallocationnode | |
| ▶ areorderedbeforetheoperationfreeingthememory | |
| GraphallocationlifetimesbeginandusuallyendaccordingtoGPUexecution(asopposedtoAPIinvo- | |
| cation). GPUorderingistheorderthatworkrunsontheGPUasopposedtotheorderthatthework | |
| isenqueuedordescribed. Thus,graphallocationsareconsidered‘GPUordered.’ | |
| | 4.2.5.2.1 | GraphNodeAPIs | | | | | | | | | |
| | --------- | ------------- | --- | --- | --- | --- | --- | --- | --- | | |
| GraphmemorynodesmaybeexplicitlycreatedwiththenodecreationAPI,cudaGraphAddNode. The | |
| address allocated when adding a cudaGraphNodeTypeMemAlloc node is returned to the user in the | |
| alloc::dptr field of the passed cudaGraphNodeParams structure. All operations using graph al- | |
| locations inside the allocating graph must be ordered after the allocating node. Similarly, any free | |
| nodesmustbeorderedafterallusesoftheallocationwithinthegraph. Freenodesarecreatedusing | |
| cudaGraphAddNodeandanodetypeofcudaGraphNodeTypeMemFree. | |
| In the following figure, there is an example graph with an alloc and a free node. Kernel nodes a, b, | |
| andcareorderedaftertheallocationnodeandbeforethefreenodesuchthatthekernelscanaccess | |
| the allocation. Kernel node e is not ordered after the alloc node and therefore cannot safely access | |
| thememory. Kernelnodedisnotorderedbeforethefreenode,thereforeitcannotsafelyaccessthe | |
| memory. | |
| Thefollowingcodesnippetestablishesthegraphinthisfigure: | |
| | ∕∕ Create | the graph | - it | starts | out | empty | | | | | |
| | ----------------------- | --------- | ------- | ---------- | --------------------------- | ----- | --- | --- | --- | | |
| | cudaGraphCreate(&graph, | | | 0); | | | | | | | |
| | ∕∕ parameters | for | a basic | allocation | | | | | | | |
| | cudaGraphNodeParams | | params | = | { cudaGraphNodeTypeMemAlloc | | | | }; | | |
| params.alloc.poolProps.allocType = cudaMemAllocationTypePinned; | |
| params.alloc.poolProps.location.type = cudaMemLocationTypeDevice; | |
| | ∕∕ specify | device | 0 as | the resident | | device | | | | | |
| | ---------------------------------- | -------- | ---- | ------------ | -------- | ------ | ---------- | ------------ | --- | | |
| | params.alloc.poolProps.location.id | | | | | = 0; | | | | | |
| | params.alloc.bytesize | | = | size; | | | | | | | |
| | cudaGraphAddNode(&allocNode, | | | | graph, | NULL, | NULL, | 0, ¶ms); | | | |
| | ∕∕ create | a kernel | node | that | uses the | graph | allocation | | | | |
| cudaGraphNodeParams nodeParams = { cudaGraphNodeTypeKernel }; | |
| | nodeParams.kernel.kernelParams[0] | | | | | = params.alloc.dptr; | | | | | |
| | --------------------------------- | ---------- | ------ | ------------------ | ----- | -------------------- | --- | --- | --- | | |
| | ∕∕ ...set | other | kernel | node parameters... | | | | | | | |
| | ∕∕ add | the kernel | node | to the | graph | | | | | | |
| cudaGraphAddNode(&a, graph, &allocNode, 1, NULL, &nodeParams); | |
| | cudaGraphAddNode(&b, | | graph, | | &a, 1, | NULL, | &nodeParams); | | | | |
| | -------------------- | --- | ---------------- | --- | ------ | ----- | ------------- | --- | --- | | |
| | cudaGraphAddNode(&c, | | graph, | | &a, 1, | NULL, | &nodeParams); | | | | |
| | cudaGraphNode_t | | dependencies[2]; | | | | | | | | |
| ∕∕ kernel nodes b and c are using the graph allocation, so the freeing node | |
| ,→must depend on them. Since the dependency of node b on node a establishes | |
| ,→an indirect dependency, the free node does not need to explicitly depend on | |
| | ,→node | a. | | | | | | | | | |
| | --------------- | --- | ---- | --- | --- | --- | --- | --- | --- | | |
| | dependencies[0] | | = b; | | | | | | | | |
| | dependencies[1] | | = c; | | | | | | | | |
| cudaGraphNodeParams freeNodeParams = { cudaGraphNodeTypeMemFree }; | |
| (continuesonnextpage) | |
| 4.2. CUDAGraphs 185 | |
| CUDAProgrammingGuide,Release13.1 | |
| Figure27: KernelNodes | |
| | 186 | Chapter4. | CUDAFeatures | | |
| | --- | --------- | ------------ | | |
| CUDAProgrammingGuide,Release13.1 | |
| (continuedfrompreviouspage) | |
| | freeNodeParams.free.dptr | | | | = params.alloc.dptr; | | | | | |
| | ------------------------ | --- | --- | --- | -------------------- | --- | --- | --- | | |
| cudaGraphAddNode(&freeNode, graph, dependencies, NULL, 2, freeNodeParams); | |
| ∕∕ free node does not depend on kernel node d, so it must not access the freed | |
| | ,→graph | allocation. | | | | | | | | |
| | -------------------- | ----------- | --- | ------ | --- | ----- | --- | ------------- | | |
| | cudaGraphAddNode(&d, | | | graph, | &c, | NULL, | 1, | &nodeParams); | | |
| ∕∕ node e does not depend on the allocation node, so it must not access the | |
| ,→allocation. This would be true even if the freeNode depended on kernel node | |
| ,→e. | |
| | cudaGraphAddNode(&e, | | | graph, | NULL, | | NULL, | 0, &nodeParams); | | |
| | -------------------- | ------------- | --- | ------ | ----- | --- | ----- | ---------------- | | |
| | 4.2.5.2.2 | StreamCapture | | | | | | | | |
| Graph memory nodes can be created by capturing the corresponding stream ordered allocation and | |
| freecallscudaMallocAsyncandcudaFreeAsync. Inthiscase,thevirtualaddressesreturnedbythe | |
| captured allocation API can be used by other operations inside the graph. Since the stream ordered | |
| dependencies will be captured into the graph, the ordering requirements of the stream ordered al- | |
| location APIs guarantee that the graph memory nodes will be properly ordered with respect to the | |
| capturedstreamoperations(forcorrectlywrittenstreamcode). | |
| Ignoringkernelnodesdande,forclarity,thefollowingcodesnippetshowshowtousestreamcapture | |
| tocreatethegraphfromthepreviousfigure: | |
| | cudaMallocAsync(&dptr, | | | size, | | stream1); | | | | |
| | ---------------------------- | ---- | ------------ | --------- | --------- | --------- | ----- | --- | | |
| | kernel_A<<< | | ..., stream1 | | >>>(dptr, | | ...); | | | |
| | ∕∕ Fork | into | stream2 | | | | | | | |
| | cudaEventRecord(event1, | | | stream1); | | | | | | |
| | cudaStreamWaitEvent(stream2, | | | | | event1); | | | | |
| | kernel_B<<< | | ..., stream1 | | >>>(dptr, | | ...); | | | |
| ∕∕ event dependencies translated into graph dependencies, so the kernel node | |
| ,→created by the capture of kernel C will depend on the allocation node | |
| | ,→created | by | capturing | the | cudaMallocAsync | | | call. | | |
| | ----------------------------- | ------- | ------------ | --------- | --------------- | -------- | --------- | ------- | | |
| | kernel_C<<< | | ..., stream2 | | >>>(dptr, | | ...); | | | |
| | ∕∕ Join | stream2 | back | to origin | | stream | (stream1) | | | |
| | cudaEventRecord(event2, | | | stream2); | | | | | | |
| | cudaStreamWaitEvent(stream1, | | | | | event2); | | | | |
| | ∕∕ Free | depends | on | all work | accessing | | the | memory. | | |
| | cudaFreeAsync(dptr, | | | stream1); | | | | | | |
| | ∕∕ End | capture | in the | origin | stream | | | | | |
| | cudaStreamEndCapture(stream1, | | | | | &graph); | | | | |
| 4.2.5.2.3 AccessingandFreeingGraphMemoryOutsideoftheAllocatingGraph | |
| Graph allocations do not have to be freed by the allocating graph. When a graph does not free an | |
| allocation, that allocation persists beyond the execution of the graph and can be accessed by sub- | |
| sequent CUDA operations. These allocations may be accessed in another graph or directly using a | |
| stream operation as long as the accessing operation is ordered after the allocation through CUDA | |
| 4.2. CUDAGraphs 187 | |
| CUDAProgrammingGuide,Release13.1 | |
| events and other stream ordering mechanisms. An allocation may subsequently be freed by regu- | |
| larcallstocudaFree,cudaFreeAsync,orbythelaunchofanothergraphwithacorrespondingfree | |
| node, or a subsequent launch of the allocating graph (if it was instantiated with the graph-memory- | |
| nodes-cudagraphinstantiateflagautofreeonlaunchflag). Itisillegaltoaccessmemoryafterithasbeen | |
| freed - the free operation must be ordered after all operations accessing the memory using graph | |
| dependencies,CUDAevents,andotherstreamorderingmechanisms. | |
| Note | |
| Since graph allocations may share underlying physical memory, free operations must be ordered | |
| after all device operations complete. Out-of-band synchronization (such as memory-based syn- | |
| chronizationwithinacomputekernel)isinsufficientfororderingbetweenmemorywritesandfree | |
| operations. Formoreinformation,seetheVirtualAliasingSupportrulesrelatingtoconsistencyand | |
| coherency. | |
| Thethreefollowingcodesnippetsdemonstrateaccessinggraphallocationsoutsideoftheallocating | |
| graphwithorderingproperlyestablishedby: usingasinglestream,usingeventsbetweenstreams,and | |
| usingeventsbakedintotheallocatingandfreeinggraph. | |
| First,orderingestablishedbyusingasinglestream: | |
| | ∕∕ Contents | of allocating | graph | | | | | | |
| | ----------- | ------------- | ----- | --- | --- | --- | --- | | |
| void *dptr; | |
| | cudaGraphNodeParams | | params | = { cudaGraphNodeTypeMemAlloc | | }; | | | |
| | ------------------- | --- | ------ | ----------------------------- | --- | --- | --- | | |
| params.alloc.poolProps.allocType = cudaMemAllocationTypePinned; | |
| params.alloc.poolProps.location.type = cudaMemLocationTypeDevice; | |
| | params.alloc.bytesize | | = size; | | | | | | |
| | --------------------- | --- | ------- | --- | --- | --- | --- | | |
| cudaGraphAddNode(&allocNode, allocGraph, NULL, NULL, 0, ¶ms); | |
| | dptr = | params.alloc.dptr; | | | | | | | |
| | ------ | ------------------ | --- | --- | --- | --- | --- | | |
| cudaGraphInstantiate(&allocGraphExec, allocGraph, NULL, NULL, 0); | |
| | cudaGraphLaunch(allocGraphExec, | | | stream); | | | | | |
| | ------------------------------- | ----------- | --------- | -------- | --- | --- | --- | | |
| | kernel<<< | ..., stream | >>>(dptr, | ...); | | | | | |
| | cudaFreeAsync(dptr, | | stream); | | | | | | |
| Second,orderingestablishedbyrecordingandwaitingonCUDAevents: | |
| | ∕∕ Contents | of allocating | graph | | | | | | |
| | ----------- | ------------- | ----- | --- | --- | --- | --- | | |
| void *dptr; | |
| | ∕∕ Contents | of allocating | graph | | | | | | |
| | ----------- | ------------- | ----- | --- | --- | --- | --- | | |
| cudaGraphAddNode(&allocNode, allocGraph, NULL, NULL, 0, &allocNodeParams); | |
| | dptr = | allocNodeParams.alloc.dptr; | | | | | | | |
| | ----------- | --------------------------- | --- | ----- | --- | --- | --- | | |
| | ∕∕ contents | of consuming∕freeing | | graph | | | | | |
| kernelNodeParams.kernel.kernelParams[0] = allocNodeParams.alloc.dptr; | |
| | cudaGraphAddNode(&freeNode, | | | freeGraph, | NULL, NULL, | 1, dptr); | | | |
| | --------------------------- | --- | --- | ---------- | ----------- | --------- | --- | | |
| cudaGraphInstantiate(&allocGraphExec, allocGraph, NULL, NULL, 0); | |
| cudaGraphInstantiate(&freeGraphExec, freeGraph, NULL, NULL, 0); | |
| | cudaGraphLaunch(allocGraphExec, | | | allocStream); | | | | | |
| | ------------------------------- | --- | --- | ------------- | --- | --- | --- | | |
| (continuesonnextpage) | |
| | 188 | | | | | Chapter4. | CUDAFeatures | | |
| | --- | --- | --- | --- | --- | --------- | ------------ | | |
| CUDAProgrammingGuide,Release13.1 | |
| (continuedfrompreviouspage) | |
| ∕∕ establish the dependency of stream2 on the allocation node | |
| ∕∕ note: the dependency could also have been established with a stream | |
| | ,→synchronize | operation | | | | | | |
| | ---------------------------- | ------------ | ---------- | ------------- | --- | --- | | |
| | cudaEventRecord(allocEvent, | | | allocStream); | | | | |
| | cudaStreamWaitEvent(stream2, | | | allocEvent); | | | | |
| | kernel<<< | ..., stream2 | >>> (dptr, | ...); | | | | |
| ∕∕ establish the dependency between the stream 3 and the allocation use | |
| | cudaStreamRecordEvent(streamUseDoneEvent, | | | | stream2); | | | |
| | ----------------------------------------- | --- | --- | -------------------- | --------- | --- | | |
| | cudaStreamWaitEvent(stream3, | | | streamUseDoneEvent); | | | | |
| ∕∕ it is now safe to launch the freeing graph, which may also access the memory | |
| | cudaGraphLaunch(freeGraphExec, | | | stream3); | | | | |
| | ------------------------------ | --- | --- | --------- | --- | --- | | |
| Third,orderingestablishedbyusinggraphexternaleventnodes: | |
| | ∕∕ Contents | of allocating | graph | | | | | |
| | ----------- | ------------- | ----- | --- | --- | --- | | |
| void *dptr; | |
| cudaEvent_t allocEvent; ∕∕ event indicating when the allocation will be ready | |
| ,→for use. | |
| cudaEvent_t streamUseDoneEvent; ∕∕ event indicating when the stream operations | |
| | ,→are | done with the | allocation. | | | | | |
| | ----------- | ------------- | ----------- | ---------- | ------ | ---- | | |
| | ∕∕ Contents | of allocating | graph | with event | record | node | | |
| cudaGraphAddNode(&allocNode, allocGraph, NULL, NULL, 0, &allocNodeParams); | |
| | dptr = | allocNodeParams.alloc.dptr; | | | | | | |
| | -------- | --------------------------- | ----------- | ------- | ------ | ---------- | | |
| | ∕∕ note: | this event | record node | depends | on the | alloc node | | |
| cudaGraphNodeParams allocEventNodeParams = { cudaGraphNodeTypeEventRecord }; | |
| | allocEventParams.eventRecord.event | | | = allocEvent; | | | | |
| | ---------------------------------- | --- | --- | ------------- | --- | --- | | |
| cudaGraphAddNode(&recordNode, allocGraph, &allocNode, NULL, 1, | |
| ,→allocEventNodeParams); | |
| cudaGraphInstantiate(&allocGraphExec, allocGraph, NULL, NULL, 0); | |
| | ∕∕ contents | of consuming∕freeing | | graph with | event | wait nodes | | |
| | ----------- | -------------------- | --- | ---------- | ----- | ---------- | | |
| cudaGraphNodeParams streamWaitEventNodeParams = { cudaGraphNodeTypeEventWait } | |
| ,→; | |
| streamWaitEventNodeParams.eventWait.event = streamUseDoneEvent; | |
| cudaGraphAddNode(&streamUseDoneEventNode, waitAndFreeGraph, NULL, NULL, 0, | |
| ,→streamWaitEventNodeParams); | |
| cudaGraphNodeParams allocWaitEventNodeParams = { cudaGraphNodeTypeEventWait }; | |
| | allocWaitEventNodeParams.eventWait.event | | | | = allocEvent; | | | |
| | ---------------------------------------- | --- | --- | --- | ------------- | --- | | |
| cudaGraphAddNode(&allocReadyEventNode, waitAndFreeGraph, NULL, NULL, 0, | |
| ,→allocWaitEventNodeParams); | |
| kernelNodeParams->kernelParams[0] = allocNodeParams.alloc.dptr; | |
| ∕∕ The allocReadyEventNode provides ordering with the alloc node for use in a | |
| | ,→consuming | graph. | | | | | | |
| | ----------- | ------ | --- | --- | --- | --- | | |
| (continuesonnextpage) | |
| 4.2. CUDAGraphs 189 | |
| CUDAProgrammingGuide,Release13.1 | |
| (continuedfrompreviouspage) | |
| cudaGraphAddNode(&kernelNode, waitAndFreeGraph, &allocReadyEventNode, NULL, 1, | |
| &kernelNodeParams); | |
| ,→ | |
| ∕∕ The free node has to be ordered after both external and internal users. | |
| | ∕∕ Thus | the node must | depend | on both the | kernelNode | and the | | | |
| | ------- | ------------- | ------ | ----------- | ---------- | -------- | --- | | |
| ,→streamUseDoneEventNode. | |
| | dependencies[0] | = kernelNode; | | | | | | | |
| | --------------- | ------------------------- | --- | --- | --- | --- | --- | | |
| | dependencies[1] | = streamUseDoneEventNode; | | | | | | | |
| cudaGraphNodeParams freeNodeParams = { cudaGraphNodeTypeMemFree }; | |
| | freeNodeParams.free.dptr | | = dptr; | | | | | | |
| | ------------------------ | --- | ------- | --- | --- | --- | --- | | |
| cudaGraphAddNode(&freeNode, waitAndFreeGraph, &dependencies, NULL, 2, | |
| ,→freeNodeParams); | |
| cudaGraphInstantiate(&waitAndFreeGraphExec, waitAndFreeGraph, NULL, NULL, 0); | |
| | cudaGraphLaunch(allocGraphExec, | | | allocStream); | | | | | |
| | ------------------------------- | --- | --- | ------------- | --- | --- | --- | | |
| ∕∕ establish the dependency of stream2 on the event node satisfies the ordering | |
| ,→requirement | |
| | cudaStreamWaitEvent(stream2, | | | allocEvent); | | | | | |
| | ----------------------------------------- | ------------ | ---------- | ------------ | --------- | --- | --- | | |
| | kernel<<< | ..., stream2 | >>> (dptr, | ...); | | | | | |
| | cudaStreamRecordEvent(streamUseDoneEvent, | | | | stream2); | | | | |
| ∕∕ the event wait node in the waitAndFreeGraphExec establishes the dependency | |
| ,→on the "readyForFreeEvent" that is needed to prevent the kernel running in | |
| ,→stream two from accessing the allocation after the free node in execution | |
| ,→order. | |
| | cudaGraphLaunch(waitAndFreeGraphExec, | | | | stream3); | | | | |
| | ------------------------------------- | ---------------------------------------- | --- | --- | --------- | --- | --- | | |
| | 4.2.5.2.4 | cudaGraphInstantiateFlagAutoFreeOnLaunch | | | | | | | |
| Under normal circumstances, CUDA will prevent a graph from being relaunched if it has unfreed | |
| memory allocations because multiple allocations at the same address will leak memory. Instantiat- | |
| ing a graph with the cudaGraphInstantiateFlagAutoFreeOnLaunch flag allows the graph to be | |
| relaunchedwhileitstillhasunfreedallocations. Inthiscase,thelaunchautomaticallyinsertsanasyn- | |
| chronousfreeoftheunfreedallocations. | |
| Autofree on launch is usefulfor single-producer multiple-consumeralgorithms. At each iteration, a | |
| producergraphcreatesseveralallocations,and,dependingonruntimeconditions,avaryingsetofcon- | |
| sumersaccessesthoseallocations. Thistypeofvariableexecutionsequencemeansthatconsumers | |
| cannotfreetheallocationsbecauseasubsequentconsumermayrequireaccess. Autofreeonlaunch | |
| meansthatthelaunchloopdoesnotneedtotracktheproducer’sallocations-instead,thatinforma- | |
| tionremainsisolatedtotheproducer’screationanddestructionlogic. Ingeneral,autofreeonlaunch | |
| simplifiesanalgorithmwhichwouldotherwiseneedtofreealltheallocationsownedbyagraphbefore | |
| eachrelaunch. | |
| Note | |
| The cudaGraphInstantiateFlagAutoFreeOnLaunch flag does not change the behavior of | |
| graphdestruction. Theapplicationmustexplicitlyfreetheunfreedmemoryinordertoavoidmem- | |
| ory leaks, even for graphs instantiated with the flag. The following code shows the use of cud- | |
| aGraphInstantiateFlagAutoFreeOnLaunchtosimplifyasingle-producer/multiple-consumer | |
| | 190 | | | | | Chapter4. | CUDAFeatures | | |
| | --- | --- | --- | --- | --- | --------- | ------------ | | |
| CUDAProgrammingGuide,Release13.1 | |
| algorithm: | |
| ∕∕ Create producer graph which allocates memory and populates it with data | |
| cudaStreamBeginCapture(cudaStreamPerThread, cudaStreamCaptureModeGlobal); | |
| cudaMallocAsync(&data1, blocks * threads, cudaStreamPerThread); | |
| cudaMallocAsync(&data2, blocks * threads, cudaStreamPerThread); | |
| produce<<<blocks, threads, 0, cudaStreamPerThread>>>(data1, data2); | |
| ... | |
| cudaStreamEndCapture(cudaStreamPerThread, &graph); | |
| cudaGraphInstantiateWithFlags(&producer, | |
| graph, | |
| cudaGraphInstantiateFlagAutoFreeOnLaunch); | |
| cudaGraphDestroy(graph); | |
| ∕∕ Create first consumer graph by capturing an asynchronous library call | |
| cudaStreamBeginCapture(cudaStreamPerThread, cudaStreamCaptureModeGlobal); | |
| | consumerFromLibrary(data1, | | cudaStreamPerThread); | | |
| | -------------------------- | --- | --------------------- | | |
| cudaStreamEndCapture(cudaStreamPerThread, &graph); | |
| cudaGraphInstantiateWithFlags(&consumer1, graph, 0); ∕∕regular instantiation | |
| cudaGraphDestroy(graph); | |
| | ∕∕ Create | second consumer | graph | | |
| | --------- | --------------- | ----- | | |
| cudaStreamBeginCapture(cudaStreamPerThread, cudaStreamCaptureModeGlobal); | |
| consume2<<<blocks, threads, 0, cudaStreamPerThread>>>(data2); | |
| ... | |
| cudaStreamEndCapture(cudaStreamPerThread, &graph); | |
| cudaGraphInstantiateWithFlags(&consumer2, graph, 0); | |
| cudaGraphDestroy(graph); | |
| | ∕∕ Launch | in a loop | | | |
| | -------------------- | --------- | ------ | | |
| | bool launchConsumer2 | = | false; | | |
| do { | |
| | cudaGraphLaunch(producer, | | myStream); | | |
| | -------------------------- | ----------------- | ---------- | | |
| | cudaGraphLaunch(consumer1, | | myStream); | | |
| | if | (launchConsumer2) | { | | |
| cudaGraphLaunch(consumer2, myStream); | |
| } | |
| | } while | (determineAction(&launchConsumer2)); | | | |
| | -------------------- | ------------------------------------ | --- | | |
| | cudaFreeAsync(data1, | myStream); | | | |
| | cudaFreeAsync(data2, | myStream); | | | |
| cudaGraphExecDestroy(producer); | |
| cudaGraphExecDestroy(consumer1); | |
| cudaGraphExecDestroy(consumer2); | |
| | 4.2.5.2.5 | MemoryNodesinChildGraphs | | | |
| | --------- | ------------------------ | --- | | |
| CUDA12.9introducestheabilitytomovechildgraphownershiptoaparentgraph. Childgraphswhich | |
| aremovedtotheparentareallowedtocontainmemoryallocationandfreenodes. Thisallowsachild | |
| graph containing allocation or free nodes to be independently constructed prior to its addition in a | |
| parentgraph. | |
| 4.2. CUDAGraphs 191 | |
| CUDAProgrammingGuide,Release13.1 | |
| Thefollowingrestrictionsapplytochildgraphsaftertheyhavebeenmoved: | |
| ▶ Cannotbeindependentlyinstantiatedordestroyed. | |
| ▶ Cannotbeaddedasachildgraphofaseparateparentgraph. | |
| ▶ | |
| CannotbeusedasanargumenttocuGraphExecUpdate. | |
| ▶ Cannothaveadditionalmemoryallocationorfreenodesadded. | |
| | ∕∕ Create | the child | graph | | | | | |
| | ----------------------- | --------- | ------------------ | --- | --- | --- | | |
| | cudaGraphCreate(&child, | | 0); | | | | | |
| | ∕∕ parameters | for | a basic allocation | | | | | |
| cudaGraphNodeParams allocNodeParams = { cudaGraphNodeTypeMemAlloc }; | |
| allocNodeParams.alloc.poolProps.allocType = cudaMemAllocationTypePinned; | |
| allocNodeParams.alloc.poolProps.location.type = cudaMemLocationTypeDevice; | |
| | ∕∕ specify | device | 0 as the resident | device | | | | |
| | ------------------------------------------- | ------ | ----------------- | ------ | --- | --- | | |
| | allocNodeParams.alloc.poolProps.location.id | | | = | 0; | | | |
| | allocNodeParams.alloc.bytesize | | = | size; | | | | |
| cudaGraphAddNode(&allocNode, graph, NULL, NULL, 0, &allocNodeParams); | |
| | ∕∕ Additional | nodes | using the allocation | could | be added here | | | |
| | ------------- | ----- | -------------------- | ----- | ------------- | --- | | |
| cudaGraphNodeParams freeNodeParams = { cudaGraphNodeTypeMemFree }; | |
| | freeNodeParams.free.dptr | | = allocNodeParams.alloc.dptr; | | | | | |
| | ------------------------ | --- | ----------------------------- | --- | --- | --- | | |
| cudaGraphAddNode(&freeNode, graph, &allocNode, NULL, 1, freeNodeParams); | |
| | ∕∕ Create | the parent | graph | | | | | |
| | ------------------------ | ---------- | ------------------- | ----- | --- | --- | | |
| | cudaGraphCreate(&parent, | | 0); | | | | | |
| | ∕∕ Move | the child | graph to the parent | graph | | | | |
| cudaGraphNodeParams childNodeParams = { cudaGraphNodeTypeGraph }; | |
| | childNodeParams.graph.graph | | = child; | | | | | |
| | --------------------------- | --- | -------- | --- | --- | --- | | |
| childNodeParams.graph.ownership = cudaGraphChildGraphOwnershipMove; | |
| cudaGraphAddNode(&parentNode, parent, NULL, NULL, 0, &childNodeParams); | |
| 4.2.5.3 OptimizedMemoryReuse | |
| CUDAreusesmemoryintwoways: | |
| ▶ Virtualandphysicalmemoryreusewithinagraphisbasedonvirtualaddressassignment,likein | |
| thestreamorderedallocator. | |
| ▶ Physical memory reuse between graphs is done with virtual aliasing: different graphs can map | |
| thesamephysicalmemorytotheiruniquevirtualaddresses. | |
| | 4.2.5.3.1 | AddressReusewithinaGraph | | | | | | |
| | --------- | ------------------------ | --- | --- | --- | --- | | |
| CUDA may reuse memory within a graph by assigning the same virtual address ranges to different | |
| allocationswhoselifetimesdonotoverlap. Sincevirtualaddressesmaybereused,pointerstodifferent | |
| allocationswithdisjointlifetimesarenotguaranteedtobeunique. | |
| The following figure shows adding a new allocation node (2) that can reuse the address freed by a | |
| dependentnode(1). | |
| | 192 | | | | Chapter4. | CUDAFeatures | | |
| | --- | --- | --- | --- | --------- | ------------ | | |
| CUDAProgrammingGuide,Release13.1 | |
| Figure28: AddingNewAllocNode2 | |
| Thefollowingfigureshowsaddinganewallocnode(4). Thenewallocnodeisnotdependentonthefreenode(2) | |
| socannotreusetheaddressfromtheassociatedallocnode(2). Iftheallocnode(2)usedtheaddressfreedby | |
| freenode(1),thenewallocnode3wouldneedanewaddress. | |
| 4.2. CUDAGraphs 193 | |
| CUDAProgrammingGuide,Release13.1 | |
| Figure29: AddingNewAllocNode3 | |
| | 194 | Chapter4. | CUDAFeatures | | |
| | --- | --------- | ------------ | | |
| CUDAProgrammingGuide,Release13.1 | |
| 4.2.5.3.2 PhysicalMemoryManagementandSharing | |
| CUDA is responsible for mapping physical memory to the virtual address before the allocating node | |
| is reached in GPU order. As an optimization for memory footprint and mapping overhead, multiple | |
| graphsmayusethesamephysicalmemoryfordistinctallocationsiftheywillnotrunsimultaneously; | |
| however,physicalpagescannotbereusediftheyareboundtomorethanoneexecutinggraphatthe | |
| sametime,ortoagraphallocationwhichremainsunfreed. | |
| CUDAmayupdatephysicalmemorymappingsatanytimeduringgraphinstantiation,launch,orexe- | |
| cution. CUDAmayalsointroducesynchronizationbetweenfuturegraphlaunchesinordertoprevent | |
| live graph allocations from referring to the same physical memory. As for any allocate-free-allocate | |
| pattern,ifaprogramaccessesapointeroutsideofanallocation’slifetime,theerroneousaccessmay | |
| silentlyreadorwritelivedataownedbyanotherallocation(evenifthevirtualaddressoftheallocation | |
| isunique). Useofcomputesanitizertoolscancatchthiserror. | |
| The following figure shows graphs sequentially launched in the same stream. In this example, each | |
| graphfreesallthememoryitallocates. Sincethegraphsinthesamestreamneverrunconcurrently, | |
| CUDAcanandshouldusethesamephysicalmemorytosatisfyalltheallocations. | |
| 4.2.5.4 PerformanceConsiderations | |
| Whenmultiplegraphsarelaunchedintothesamestream,CUDAattemptstoallocatethesamephys- | |
| icalmemorytothembecausetheexecutionofthesegraphscannotoverlap. Physicalmappingsfora | |
| graphareretainedbetweenlaunchesasanoptimizationtoavoidthecostofremapping. If,atalater | |
| time,oneofthegraphsislaunchedsuchthatitsexecutionmayoverlapwiththeothers(forexampleif | |
| itislaunchedintoadifferentstream)thenCUDAmustperformsomeremappingbecauseconcurrent | |
| graphsrequiredistinctmemorytoavoiddatacorruption. | |
| Ingeneral,remappingofgraphmemoryinCUDAislikelycausedbytheseoperations: | |
| ▶ Changingthestreamintowhichagraphislaunched | |
| ▶ Atrimoperationonthegraphmemorypool,whichexplicitlyfreesunusedmemory(discussedin | |
| graph-memory-nodes-physical-memory-footprint) | |
| ▶ Relaunchingagraphwhileanunfreedallocationfromanothergraphismappedtothesamemem- | |
| orywillcausearemapofmemorybeforerelaunch | |
| Remappingmusthappeninexecutionorder,butafteranypreviousexecutionofthatgraphiscomplete | |
| (otherwisememorythatisstillinusecouldbeunmapped). Duetothisorderingdependency,aswellas | |
| becausemappingoperationsareOScalls,mappingoperationscanberelativelyexpensive. Applications | |
| canavoidthiscostbylaunchinggraphscontainingallocationmemorynodesconsistentlyintothesame | |
| stream. | |
| 4.2.5.4.1 FirstLaunch/cudaGraphUpload | |
| Physical memory cannot be allocated or mapped during graph instantiation because the stream in | |
| whichthegraphwillexecuteisunknown. Mappingisdoneinsteadduringgraphlaunch. Callingcud- | |
| aGraphUpload can separate out the cost of allocation from the launch by performing all mappings | |
| for that graph immediately and associating the graph with the upload stream. If the graph is then | |
| launchedintothesamestream,itwilllaunchwithoutanyadditionalremapping. | |
| Using different streams for graph upload and graph launch behaves similarly to switching streams, | |
| likelyresultinginremapoperations. Inaddition,unrelatedmemorypoolmanagementispermittedto | |
| pullmemoryfromanidlestream,whichcouldnegatetheimpactoftheuploads. | |
| 4.2. CUDAGraphs 195 | |
| CUDAProgrammingGuide,Release13.1 | |
| Figure30: SequentiallyLaunchedGraphs | |
| | 196 | Chapter4. | CUDAFeatures | | |
| | --- | --------- | ------------ | | |
| CUDAProgrammingGuide,Release13.1 | |
| 4.2.5.5 PhysicalMemoryFootprint | |
| Thepool-managementbehaviorofasynchronousallocationmeansthatdestroyingagraphwhichcon- | |
| tainsmemorynodes(eveniftheirallocationsarefree)willnotimmediatelyreturnphysicalmemoryto | |
| theOSforusebyotherprocesses. ToexplicitlyreleasememorybacktotheOS,anapplicationshould | |
| usethecudaDeviceGraphMemTrimAPI. | |
| cudaDeviceGraphMemTrimwillunmapandreleaseanyphysicalmemoryreservedbygraphmemory | |
| nodes that is not actively in use. Allocations that have not been freed and graphs that are sched- | |
| uled or running are considered to be actively using the physical memory and will not be impacted. | |
| Use of the trim API will make physical memory available to other allocation APIs and other applica- | |
| tions or processes, but will cause CUDA to reallocate and remap memory when the trimmed graphs | |
| arenextlaunched. NotethatcudaDeviceGraphMemTrimoperatesonadifferentpoolfromcudaMem- | |
| PoolTrimTo(). Thegraphmemorypoolisnotexposedtothesteamorderedmemoryallocator. CUDA | |
| allowsapplicationstoquerytheirgraphmemoryfootprintthroughthecudaDeviceGetGraphMemAt- | |
| tributeAPI.QueryingtheattributecudaGraphMemAttrReservedMemCurrentreturnstheamount | |
| ofphysicalmemoryreservedbythedriverforgraphallocationsinthecurrentprocess. Queryingcud- | |
| | | | returns | the amount | of physical memory | currently mapped | by at | | |
| | --- | --- | ------- | ---------- | ------------------ | ---------------- | ----- | | |
| aGraphMemAttrUsedMemCurrent | |
| least one graph. Either of these attributes can be used to track when new physical memory is ac- | |
| quiredbyCUDAforthesakeofanallocatinggraph. Bothoftheseattributesareusefulforexamining | |
| howmuchmemoryissavedbythesharingmechanism. | |
| 4.2.5.6 PeerAccess | |
| GraphallocationscanbeconfiguredforaccessfrommultipleGPUs,inwhichcaseCUDAwillmapthe | |
| allocations onto the peer GPUs as required. CUDA allows graph allocations requiring different map- | |
| pingstoreusethesamevirtualaddress. Whenthisoccurs,theaddressrangeismappedontoallGPUs | |
| requiredbythedifferentallocations. Thismeansanallocationmaysometimesallowmorepeeraccess | |
| thanwasrequestedduringitscreation;however,relyingontheseextramappingsisstillanerror. | |
| | 4.2.5.6.1 | PeerAccesswithGraphNodeAPIs | | | | | | | |
| | --------- | --------------------------- | --- | --- | --- | --- | --- | | |
| The cudaGraphAddNode API accepts mapping requests in the accessDescs array field of the alloc | |
| nodeparametersstructures. ThepoolProps.locationembeddedstructurespecifiestheresident | |
| devicefortheallocation. AccessfromtheallocatingGPUisassumedtobeneeded,thustheapplication | |
| doesnotneedtospecifyanentryfortheresidentdeviceintheaccessDescsarray. | |
| cudaGraphNodeParams allocNodeParams = { cudaGraphNodeTypeMemAlloc }; | |
| allocNodeParams.alloc.poolProps.allocType = cudaMemAllocationTypePinned; | |
| allocNodeParams.alloc.poolProps.location.type = cudaMemLocationTypeDevice; | |
| | ∕∕ specify | device | 1 as the resident | device | | | | | |
| | ------------------------------------------- | ------ | ----------------- | ------ | ---- | --- | --- | | |
| | allocNodeParams.alloc.poolProps.location.id | | | | = 1; | | | | |
| | allocNodeParams.alloc.bytesize | | = | size; | | | | | |
| ∕∕ allocate an allocation resident on device 1 accessible from device 1 | |
| cudaGraphAddNode(&allocNode, graph, NULL, NULL, 0, &allocNodeParams); | |
| accessDescs[2]; | |
| ∕∕ boilerplate for the access descs (only ReadWrite and Device access supported | |
| | ,→by the | add node | api) | | | | | | |
| | ---------------------------- | -------- | ---------------------------------- | --- | --- | --- | --- | | |
| | accessDescs[0].flags | | = cudaMemAccessFlagsProtReadWrite; | | | | | | |
| | accessDescs[0].location.type | | = cudaMemLocationTypeDevice; | | | | | | |
| | accessDescs[1].flags | | = cudaMemAccessFlagsProtReadWrite; | | | | | | |
| | accessDescs[1].location.type | | = cudaMemLocationTypeDevice; | | | | | | |
| (continuesonnextpage) | |
| | 4.2. CUDAGraphs | | | | | | 197 | | |
| | --------------- | --- | --- | --- | --- | --- | --- | | |
| CUDAProgrammingGuide,Release13.1 | |
| (continuedfrompreviouspage) | |
| ∕∕ access being requested for device 0 & 2. Device 1 access requirement left | |
| ,→implicit. | |
| | accessDescs[0].location.id | | | = 0; | | | | | | |
| | ------------------------------- | ------- | ----- | -------------- | --- | --- | --- | --- | | |
| | accessDescs[1].location.id | | | = 2; | | | | | | |
| | ∕∕ access | request | array | has 2 entries. | | | | | | |
| | allocNodeParams.accessDescCount | | | = | 2; | | | | | |
| | allocNodeParams.accessDescs | | | = accessDescs; | | | | | | |
| ∕∕ allocate an allocation resident on device 1 accessible from devices 0, 1 and | |
| ,→2. (0 & 2 from the descriptors, 1 from it being the resident device). | |
| cudaGraphAddNode(&allocNode, graph, NULL, NULL, 0, &allocNodeParams); | |
| | 4.2.5.6.2 | PeerAccesswithStreamCapture | | | | | | | | |
| | --------- | --------------------------- | --- | --- | --- | --- | --- | --- | | |
| For stream capture, the allocation node records the peer accessibility of the allocating pool at | |
| the time of the capture. Altering the peer accessibility of the allocating pool after a cudaMal- | |
| locFromPoolAsync call is captured does not affect the mappings that the graph will make for the | |
| allocation. | |
| ∕∕ boilerplate for the access descs (only ReadWrite and Device access supported | |
| | ,→by the | add node | api) | | | | | | | |
| | ------------------------ | ---------- | ---------------------------------- | ---------------------------- | --- | --------- | --- | --- | | |
| | accessDesc.flags | | = cudaMemAccessFlagsProtReadWrite; | | | | | | | |
| | accessDesc.location.type | | | = cudaMemLocationTypeDevice; | | | | | | |
| | accessDesc.location.id | | | = 1; | | | | | | |
| | ∕∕ let | memPool be | resident | and accessible | | on device | 0 | | | |
| cudaStreamBeginCapture(stream); | |
| | cudaMallocAsync(&dptr1, | | | size, memPool, | stream); | | | | | |
| | ----------------------------- | --- | --- | -------------- | -------- | --- | --- | --- | | |
| | cudaStreamEndCapture(stream, | | | &graph1); | | | | | | |
| | cudaMemPoolSetAccess(memPool, | | | &accessDesc, | | 1); | | | | |
| cudaStreamBeginCapture(stream); | |
| | cudaMallocAsync(&dptr2, | | | size, memPool, | stream); | | | | | |
| | ---------------------------- | --- | --- | -------------- | -------- | --- | --- | --- | | |
| | cudaStreamEndCapture(stream, | | | &graph2); | | | | | | |
| ∕∕The graph node allocating dptr1 would only have the device 0 accessibility | |
| | ,→even | though memPool | now | has device | 1 accessibility. | | | | | |
| | ------ | -------------- | --- | ---------- | ---------------- | --- | --- | --- | | |
| ∕∕The graph node allocating dptr2 will have device 0 and device 1 accessibility, | |
| since that was the pool accessibility at the time of the cudaMallocAsync | |
| ,→ | |
| ,→call. | |
| | 4.2.6. | Device | Graph | Launch | | | | | | |
| | ------ | ------ | ----- | ------ | --- | --- | --- | --- | | |
| Therearemanyworkflowswhichneedtomakedata-dependentdecisionsduringruntimeandexecute | |
| differentoperationsdependingonthosedecisions. Ratherthanoffloadingthisdecision-makingpro- | |
| cess to the host, which may require a round-trip from the device, users may prefer to perform it on | |
| thedevice. Tothatend,CUDAprovidesamechanismtolaunchgraphsfromthedevice. | |
| | 198 | | | | | | Chapter4. | CUDAFeatures | | |
| | --- | --- | --- | --- | --- | --- | --------- | ------------ | | |
| CUDAProgrammingGuide,Release13.1 | |
| Devicegraphlaunchprovidesaconvenientwaytoperformdynamiccontrolflowfromthedevice,beit | |
| somethingassimpleasalooporascomplexasadevice-sideworkscheduler. | |
| Graphs which can be launched from the device will henceforth be referred to as device graphs, and | |
| graphswhichcannotbelaunchedfromthedevicewillbereferredtoashostgraphs. | |
| Device graphs can be launched from both the host and device, whereas host graphs can only be | |
| launchedfromthehost. Unlikehostlaunches,launchingadevicegraphfromthedevicewhileaprevi- | |
| ouslaunchofthegraphisrunningwillresultinanerror,returningcudaErrorInvalidValue;there- | |
| fore,adevicegraphcannotbelaunchedtwicefromthedeviceatthesametime. Launchingadevice | |
| graphfromthehostanddevicesimultaneouslywillresultinundefinedbehavior. | |
| 4.2.6.1 DeviceGraphCreation | |
| In order for a graph to be launched from the device, it must be instantiated explicitly for device | |
| launch. ThisisachievedbypassingthecudaGraphInstantiateFlagDeviceLaunchflagtothecud- | |
| aGraphInstantiate() call. As is the case for host graphs, device graph structure is fixed at time | |
| of instantiation and cannot be updated without re-instantiation, and instantiation can only be per- | |
| formedonthehost. Inorderforagraphtobeabletobeinstantiatedfordevicelaunch,itmustadhere | |
| tovariousrequirements. | |
| 4.2.6.1.1 DeviceGraphRequirements | |
| Generalrequirements: | |
| ▶ Thegraph’snodesmustallresideonasingledevice. | |
| ▶ Thegraphcanonlycontainkernelnodes,memcpynodes,memsetnodes,andchildgraphnodes. | |
| Kernelnodes: | |
| ▶ UseofCUDADynamicParallelismbykernelsinthegraphisnotpermitted. | |
| ▶ CooperativelaunchesarepermittedsolongasMPSisnotinuse. | |
| Memcpynodes: | |
| ▶ Onlycopiesinvolvingdevicememoryand/orpinneddevice-mappedhostmemoryarepermitted. | |
| ▶ CopiesinvolvingCUDAarraysarenotpermitted. | |
| ▶ Both operands must be accessible from the current device at time of instantiation. Note that | |
| the copy operation will be performed from the device on which the graph resides, even if it is | |
| targetingmemoryonanotherdevice. | |
| 4.2.6.1.2 DeviceGraphUpload | |
| Inordertolaunchagraphonthedevice,itmustfirstbeuploadedtothedevicetopopulatethenec- | |
| essarydeviceresources. Thiscanbeachievedinoneoftwoways. | |
| Firstly,thegraphcanbeuploadedexplicitly,eitherviacudaGraphUpload()orbyrequestinganupload | |
| aspartofinstantiationviacudaGraphInstantiateWithParams(). | |
| Alternatively, thegraphcanfirstbelaunchedfromthehost, whichwillperformthisuploadstepim- | |
| plicitlyaspartofthelaunch. | |
| Examplesofallthreemethodscanbeseenbelow: | |
| 4.2. CUDAGraphs 199 | |
| CUDAProgrammingGuide,Release13.1 | |
| | ∕∕ Explicit | upload | after instantiation | | | | | |
| | --------------------------------------- | ------ | ------------------- | --- | --- | --- | | |
| | cudaGraphInstantiate(&deviceGraphExec1, | | deviceGraph1, | | | | | |
| ,→cudaGraphInstantiateFlagDeviceLaunch); | |
| cudaGraphUpload(deviceGraphExec1, stream); | |
| | ∕∕ Explicit | upload | as part of instantiation | | | | | |
| | -------------------------- | ------ | ------------------------ | ------ | --- | --- | | |
| | cudaGraphInstantiateParams | | instantiateParams | = {0}; | | | | |
| instantiateParams.flags = cudaGraphInstantiateFlagDeviceLaunch | | |
| ,→cudaGraphInstantiateFlagUpload; | |
| instantiateParams.uploadStream = stream; | |
| cudaGraphInstantiateWithParams(&deviceGraphExec2, deviceGraph2, & | |
| ,→instantiateParams); | |
| | ∕∕ Implicit | upload | via host launch | | | | | |
| | --------------------------------------- | ------ | --------------- | --- | --- | --- | | |
| | cudaGraphInstantiate(&deviceGraphExec3, | | deviceGraph3, | | | | | |
| ,→cudaGraphInstantiateFlagDeviceLaunch); | |
| cudaGraphLaunch(deviceGraphExec3, stream); | |
| | 4.2.6.1.3 DeviceGraphUpdate | | | | | | | |
| | --------------------------- | --- | --- | --- | --- | --- | | |
| Devicegraphscanonlybeupdatedfromthehost,andmustbere-uploadedtothedeviceuponexe- | |
| cutable graph update in order for the changes to take effect. This can be achieved using the same | |
| methodsoutlinedinSectiondevice-graph-upload. Unlikehostgraphs,launchingadevicegraphfrom | |
| thedevicewhileanupdateisbeingappliedwillresultinundefinedbehavior. | |
| 4.2.6.2 DeviceLaunch | |
| Device graphs can be launched from both the host and the device via cudaGraphLaunch(), which | |
| hasthesamesignatureonthedeviceasonthehost. Devicegraphsarelaunchedviathesamehandle | |
| onthehostandthedevice. Devicegraphsmustbelaunchedfromanothergraphwhenlaunchedfrom | |
| thedevice. | |
| Device-sidegraphlaunchisper-threadandmultiplelaunchesmayoccurfromdifferentthreadsatthe | |
| sametime,sotheuserwillneedtoselectasinglethreadfromwhichtolaunchagivengraph. | |
| Unlike host launch, device graphs cannot be launched into regular CUDA streams, and can only be | |
| launchedintodistinctnamedstreams,whicheachdenoteaspecificlaunchmode. Thefollowingtable | |
| liststheavailablelaunchmodes. | |
| Table9: Device-onlyGraphLaunchStreams | |
| | | Stream | | LaunchMode | | | | |
| | --- | ------------------------------------- | --- | ------------------- | --------- | ------------ | | |
| | | cudaStreamGraphFireAndForget | | Fireandforgetlaunch | | | | |
| | | cudaStreamGraphTailLaunch | | Taillaunch | | | | |
| | | cudaStreamGraphFireAndForgetAsSibling | | Siblinglaunch | | | | |
| | 200 | | | | Chapter4. | CUDAFeatures | | |
| CUDAProgrammingGuide,Release13.1 | |
| 4.2.6.2.1 FireandForgetLaunch | |
| As the name suggests, a fireand forget launch is submittedto the GPU immediately, and it runs in- | |
| dependentlyofthelaunchinggraph. Inafire-and-forgetscenario,thelaunchinggraphistheparent, | |
| andthelaunchedgraphisthechild. | |
| | | | Figure31: Fireandforgetlaunch | | | | |
| | --- | --- | ----------------------------- | --- | --- | | |
| Theabovediagramcanbegeneratedbythesamplecodebelow: | |
| __global__ void launchFireAndForgetGraph(cudaGraphExec_t graph) { | |
| | cudaGraphLaunch(graph, | | cudaStreamGraphFireAndForget); | | | | |
| | ---------------------- | --- | ------------------------------ | --- | --- | | |
| } | |
| | void graphSetup() | { | | | | | |
| | ----------------- | ------------ | ---------- | ---------- | ------ | | |
| | cudaGraphExec_t | gExec1, | gExec2; | | | | |
| | cudaGraph_t | g1, g2; | | | | | |
| | ∕∕ Create, | instantiate, | and upload | the device | graph. | | |
| create_graph(&g2); | |
| cudaGraphInstantiate(&gExec2, g2, cudaGraphInstantiateFlagDeviceLaunch); | |
| | cudaGraphUpload(gExec2, | | stream); | | | | |
| | ------------------------------ | --------------- | ------------- | ----------------------------- | --- | | |
| | ∕∕ Create | and instantiate | the launching | graph. | | | |
| | cudaStreamBeginCapture(stream, | | | cudaStreamCaptureModeGlobal); | | | |
| | launchFireAndForgetGraph<<<1, | | 1, | 0, stream>>>(gExec2); | | | |
| | cudaStreamEndCapture(stream, | | &g1); | | | | |
| | cudaGraphInstantiate(&gExec1, | | g1); | | | | |
| ∕∕ Launch the host graph, which will in turn launch the device graph. | |
| | cudaGraphLaunch(gExec1, | | stream); | | | | |
| | ----------------------- | --- | -------- | --- | --- | | |
| } | |
| Agraphcanhaveupto120totalfire-and-forgetgraphsduringthecourseofitsexecution. Thistotal | |
| resetsbetweenlaunchesofthesameparentgraph. | |
| 4.2. CUDAGraphs 201 | |
| CUDAProgrammingGuide,Release13.1 | |
| 4.2.6.2.1.1 GraphExecutionEnvironments | |
| Inordertofullyunderstandthedevice-sidesynchronizationmodel,itisfirstnecessarytounderstand | |
| theconceptofanexecutionenvironment. | |
| When a graph is launched from the device, it is launched into its own execution environment. The | |
| executionenvironmentofagivengraphencapsulatesallworkinthegraphaswellasallgeneratedfire | |
| andforgetwork. Thegraphcanbeconsideredcompletewhenithascompletedexecutionandwhen | |
| allgeneratedchildworkiscomplete. | |
| The below diagram shows the environment encapsulation that would be generated by the fire-and- | |
| forgetsamplecodeintheprevioussection. | |
| Figure32: Fireandforgetlaunch,withexecutionenvironments | |
| Theseenvironmentsarealsohierarchical,soagraphenvironmentcanincludemultiplelevelsofchild- | |
| environmentsfromfireandforgetlaunches. | |
| Whenagraphislaunchedfromthehost,thereexistsastreamenvironmentthatparentstheexecution | |
| environmentofthelaunchedgraph. Thestreamenvironmentencapsulatesallworkgeneratedaspart | |
| oftheoveralllaunch. Thestreamlaunchiscomplete(i.e. downstreamdependentworkmaynowrun) | |
| whentheoverallstreamenvironmentismarkedascomplete. | |
| 4.2.6.2.2 TailLaunch | |
| Unlike on the host, it is not possible to synchronize with device graphs from the GPU via traditional | |
| methods such as cudaDeviceSynchronize() or cudaStreamSynchronize(). Rather, in order to | |
| enable serial work dependencies, a different launch mode - tail launch - is offered, to provide similar | |
| functionality. | |
| 202 Chapter4. CUDAFeatures | |
| CUDAProgrammingGuide,Release13.1 | |
| Figure33: Nestedfireandforgetenvironments | |
| Figure34: Thestreamenvironment,visualized | |
| 4.2. CUDAGraphs 203 | |
| CUDAProgrammingGuide,Release13.1 | |
| A tail launch executes when a graph’s environment is considered complete - ie, when the graph and | |
| allitschildrenarecomplete. Whenagraphcompletes, theenvironmentofthenextgraphinthetail | |
| launchlistwillreplacethecompletedenvironmentasachildoftheparentenvironment. Likefire-and- | |
| forgetlaunches,agraphcanhavemultiplegraphsenqueuedfortaillaunch. | |
| | | | Figure35: | Asimpletaillaunch | | | | | |
| | --- | --- | --------- | ----------------- | --- | --- | --- | | |
| Theaboveexecutionflowcanbegeneratedbythecodebelow: | |
| | __global__ | void launchTailGraph(cudaGraphExec_t | | | graph) | { | | | |
| | ---------------------- | ------------------------------------ | --------------------------- | --- | ------ | --- | --- | | |
| | cudaGraphLaunch(graph, | | cudaStreamGraphTailLaunch); | | | | | | |
| } | |
| void graphSetup() { | |
| | cudaGraphExec_t | gExec1, | gExec2; | | | | | | |
| | --------------- | ------------ | ---------- | ---------- | ------ | --- | --- | | |
| | cudaGraph_t | g1, g2; | | | | | | | |
| | ∕∕ Create, | instantiate, | and upload | the device | graph. | | | | |
| create_graph(&g2); | |
| cudaGraphInstantiate(&gExec2, g2, cudaGraphInstantiateFlagDeviceLaunch); | |
| | cudaGraphUpload(gExec2, | | stream); | | | | | | |
| | ------------------------------ | --------------- | ------------------------ | ----------------------------- | --- | --- | --- | | |
| | ∕∕ Create | and instantiate | the launching | graph. | | | | | |
| | cudaStreamBeginCapture(stream, | | | cudaStreamCaptureModeGlobal); | | | | | |
| | launchTailGraph<<<1, | | 1, 0, stream>>>(gExec2); | | | | | | |
| | cudaStreamEndCapture(stream, | | &g1); | | | | | | |
| | cudaGraphInstantiate(&gExec1, | | g1); | | | | | | |
| ∕∕ Launch the host graph, which will in turn launch the device graph. | |
| | cudaGraphLaunch(gExec1, | | stream); | | | | | | |
| | ----------------------- | --- | -------- | --- | --- | --- | --- | | |
| } | |
| Tail launches enqueued by a given graph will execute one at a time, in order of when they were en- | |
| queued. Sothefirstenqueuedgraphwillrunfirst,andthenthesecond,andsoon. | |
| Taillaunchesenqueuedbyatailgraphwillexecutebeforetaillaunchesenqueuedbypreviousgraphs | |
| inthetaillaunchlist. Thesenewtaillauncheswillexecuteintheordertheyareenqueued. | |
| | 204 | | | | | Chapter4. | CUDAFeatures | | |
| | --- | --- | --- | --- | --- | --------- | ------------ | | |
| CUDAProgrammingGuide,Release13.1 | |
| | | | Figure36: Taillaunchordering | | | |
| | --- | ---------------------------------------------------------- | ---------------------------- | --- | | |
| | | Figure37: Taillaunchorderingwhenenqueuedfrommultiplegraphs | | | | |
| Agraphcanhaveupto255pendingtaillaunches. | |
| | 4.2.6.2.2.1 | TailSelf-launch | | | | |
| | ----------- | --------------- | --- | --- | | |
| Itispossibleforadevicegraphtoenqueueitselfforataillaunch,althoughagivengraphcanonlyhave | |
| oneself-launchenqueuedatatime. Inordertoquerythecurrentlyrunningdevicegraphsothatitcan | |
| berelaunched,anewdevice-sidefunctionisadded: | |
| | cudaGraphExec_t | cudaGetCurrentGraphExec(); | | | | |
| | --------------- | -------------------------- | --- | --- | | |
| Thisfunctionreturnsthehandleofthecurrentlyrunninggraphifitisadevicegraph. Ifthecurrently | |
| executingkernelisnotanodewithinadevicegraph,thisfunctionwillreturnNULL. | |
| Belowissamplecodeshowingusageofthisfunctionforarelaunchloop: | |
| | __device__ | int relaunchCount | = 0; | | | |
| | --------------- | ------------------- | -------------- | --- | | |
| | __global__ | void relaunchSelf() | { | | | |
| | int | relaunchMax = 100; | | | | |
| | if (threadIdx.x | == | 0) { | | | |
| | | if (relaunchCount | < relaunchMax) | { | | |
| cudaGraphLaunch(cudaGetCurrentGraphExec(), | |
| ,→cudaStreamGraphTailLaunch); | |
| } | |
| (continuesonnextpage) | |
| 4.2. CUDAGraphs 205 | |
| CUDAProgrammingGuide,Release13.1 | |
| (continuedfrompreviouspage) | |
| relaunchCount++; | |
| } | |
| } | |
| 4.2.6.2.3 SiblingLaunch | |
| Sibling launch is a variation of fire-and-forget launch in which the graph is launched not as a child | |
| ofthelaunchinggraph’sexecutionenvironment,butratherasachildofthelaunchinggraph’sparent | |
| environment. Siblinglaunchisequivalenttoafire-and-forgetlaunchfromthelaunchinggraph’sparent | |
| environment. | |
| Figure38: Asimplesiblinglaunch | |
| Theabovediagramcanbegeneratedbythesamplecodebelow: | |
| __global__ void launchSiblingGraph(cudaGraphExec_t graph) { | |
| cudaGraphLaunch(graph, cudaStreamGraphFireAndForgetAsSibling); | |
| } | |
| void graphSetup() { | |
| (continuesonnextpage) | |
| 206 Chapter4. CUDAFeatures | |
| CUDAProgrammingGuide,Release13.1 | |
| (continuedfrompreviouspage) | |
| | cudaGraphExec_t | | | gExec1, | gExec2; | | | | | | | |
| | --------------- | ------- | ------------ | ------- | ---------- | --- | ------ | ------ | --- | --- | | |
| | cudaGraph_t | | g1, | g2; | | | | | | | | |
| | ∕∕ | Create, | instantiate, | | and upload | the | device | graph. | | | | |
| create_graph(&g2); | |
| cudaGraphInstantiate(&gExec2, g2, cudaGraphInstantiateFlagDeviceLaunch); | |
| | cudaGraphUpload(gExec2, | | | | stream); | | | | | | | |
| | ------------------------------ | ------ | --- | ----------- | ------------- | ----------------------------- | ------ | --- | --- | --- | | |
| | ∕∕ | Create | and | instantiate | the launching | | graph. | | | | | |
| | cudaStreamBeginCapture(stream, | | | | | cudaStreamCaptureModeGlobal); | | | | | | |
| | launchSiblingGraph<<<1, | | | | 1, 0, | stream>>>(gExec2); | | | | | | |
| | cudaStreamEndCapture(stream, | | | | &g1); | | | | | | | |
| | cudaGraphInstantiate(&gExec1, | | | | | g1); | | | | | | |
| ∕∕ Launch the host graph, which will in turn launch the device graph. | |
| | cudaGraphLaunch(gExec1, | | | | stream); | | | | | | | |
| | ----------------------- | --- | --- | --- | -------- | --- | --- | --- | --- | --- | | |
| } | |
| Since sibling launches are not launched into the launching graph’s execution environment, they will | |
| notgatetaillaunchesenqueuedbythelaunchinggraph. | |
| | 4.2.7. | Using | | Graph | APIs | | | | | | | |
| | ------ | ----- | --- | ----- | ---- | --- | --- | --- | --- | --- | | |
| cudaGraph_tobjectsarenotthread-safe. Itistheresponsibilityoftheusertoensurethatmultiple | |
| threadsdonotconcurrentlyaccessthesamecudaGraph_t. | |
| | A | | | cannot | run concurrently | with | itself. | A launch | of a | will be | | |
| | --------------- | --- | --- | ------ | ---------------- | ---- | ------- | -------- | --------------- | ------- | | |
| | cudaGraphExec_t | | | | | | | | cudaGraphExec_t | | | |
| orderedafterpreviouslaunchesofthesameexecutablegraph. | |
| Graphexecutionisdoneinstreamsfororderingwithotherasynchronouswork. However,thestream | |
| isfororderingonly;itdoesnotconstraintheinternalparallelismofthegraph,nordoesitaffectwhere | |
| graphnodesexecute. | |
| SeeGraphAPI. | |
| | 4.2.8. | CUDA | | User | Objects | | | | | | | |
| | ------ | ---- | --- | ---- | ------- | --- | --- | --- | --- | --- | | |
| CUDAUserObjectscanbeusedtohelpmanagethelifetimeofresourcesusedbyasynchronouswork | |
| inCUDA.Inparticular,thisfeatureisusefulforcuda-graphsandstreamcapture. | |
| VariousresourcemanagementschemesarenotcompatiblewithCUDAgraphs. Considerforexample | |
| anevent-basedpoolorasynchronous-create,asynchronous-destroyscheme. | |
| | ∕∕ Library | | API with | pool | allocation | | | | | | | |
| | ----------------------------- | --------- | -------- | -------------------------------- | ---------- | --- | --- | --- | --- | --- | | |
| | void libraryWork(cudaStream_t | | | | stream) | { | | | | | | |
| | auto | &resource | | = pool.claimTemporaryResource(); | | | | | | | | |
| resource.waitOnReadyEventInStream(stream); | |
| | launchWork(stream, | | | | resource); | | | | | | | |
| | ------------------ | --- | --- | --- | ---------- | --- | --- | --- | --- | --- | | |
| resource.recordReadyEvent(stream); | |
| } | |
| | 4.2. CUDAGraphs | | | | | | | | | 207 | | |
| | --------------- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | |
| CUDAProgrammingGuide,Release13.1 | |
| | ∕∕ Library | API with | asynchronous | | resource | deletion | | | | |
| | ----------------------------- | --------- | ------------ | -------------- | -------- | -------- | --- | --- | | |
| | void libraryWork(cudaStream_t | | | stream) | | { | | | | |
| | Resource | *resource | = new | Resource(...); | | | | | | |
| | launchWork(stream, | | resource); | | | | | | | |
| cudaLaunchHostFunc( | |
| stream, | |
| | | [](void *resource) | | { | | | | | | |
| | --- | ------------------ | -------------------- | --- | --- | ------------- | --- | --- | | |
| | | delete | static_cast<Resource | | | *>(resource); | | | | |
| }, | |
| resource, | |
| 0); | |
| | ∕∕ Error | handling | considerations | | not | shown | | | | |
| | -------- | -------- | -------------- | --- | --- | ----- | --- | --- | | |
| } | |
| These schemes are difficult with CUDA graphs because of the non-fixed pointer or handle for the | |
| resourcewhichrequiresindirectionorgraphupdate,andthesynchronousCPUcodeneededeachtime | |
| theworkissubmitted. Theyalsodonotworkwithstreamcaptureiftheseconsiderationsarehidden | |
| fromthecallerofthelibrary,andbecauseofuseofdisallowedAPIsduringcapture. Varioussolutions | |
| existsuchasexposingtheresourcetothecaller. CUDAuserobjectspresentanotherapproach. | |
| ACUDAuserobjectassociatesauser-specifieddestructorcallbackwithaninternalrefcount,similarto | |
| C++shared_ptr. ReferencesmaybeownedbyusercodeontheCPUandbyCUDAgraphs. Notethat | |
| foruser-ownedreferences, unlikeC++smartpointers, thereisnoobjectrepresentingthereference; | |
| usersmusttrackuser-ownedreferencesmanually. Atypicalusecasewouldbetoimmediatelymove | |
| thesoleuser-ownedreferencetoaCUDAgraphaftertheuserobjectiscreated. | |
| When a reference is associated to a CUDA graph, CUDA will manage the graph operations automat- | |
| ically. A cloned cudaGraph_t retains a copy of every reference owned by the source cudaGraph_t, | |
| with the same multiplicity. An instantiated cudaGraphExec_t retains a copy of every reference in | |
| the source cudaGraph_t. When a cudaGraphExec_t is destroyed without being synchronized, the | |
| referencesareretaineduntiltheexecutioniscompleted. | |
| Hereisanexampleuse. | |
| | cudaGraph_t | graph; | ∕∕ Preexisting | | graph | | | | | |
| | ----------- | ------ | -------------- | --- | ----- | --- | --- | --- | | |
| Object *object = new Object; ∕∕ C++ object with possibly nontrivial destructor | |
| | cudaUserObject_t | | cuObject; | | | | | | | |
| | ---------------- | --- | --------- | --- | --- | --- | --- | --- | | |
| cudaUserObjectCreate( | |
| &cuObject, | |
| object, ∕∕ Here we use a CUDA-provided template wrapper for this API, | |
| ∕∕ which supplies a callback to delete the C++ object pointer | |
| | 1, | ∕∕ Initial | refcount | | | | | | | |
| | --- | ---------- | -------- | --- | --- | --- | --- | --- | | |
| cudaUserObjectNoDestructorSync ∕∕ Acknowledge that the callback cannot be | |
| | | | | | ∕∕ | waited on via | CUDA | | | |
| | --- | --- | --- | --- | --- | ------------- | ---- | --- | | |
| ); | |
| cudaGraphRetainUserObject( | |
| graph, | |
| cuObject, | |
| | 1, | ∕∕ Number | of references | | | | | | | |
| | --- | --------- | ------------- | --- | --- | --- | --- | --- | | |
| cudaGraphUserObjectMove ∕∕ Transfer a reference owned by the caller (do | |
| | | | | ∕∕ | not modify | the total | reference | count) | | |
| | --- | --- | --- | --- | ---------- | --------- | --------- | ------ | | |
| ); | |
| ∕∕ No more references owned by this thread; no need to call release API | |
| (continuesonnextpage) | |
| | 208 | | | | | | Chapter4. | CUDAFeatures | | |
| | --- | --- | --- | --- | --- | --- | --------- | ------------ | | |
| CUDAProgrammingGuide,Release13.1 | |
| (continuedfrompreviouspage) | |
| cudaGraphExec_t graphExec; | |
| cudaGraphInstantiate(&graphExec, graph, nullptr, nullptr, 0); ∕∕ Will retain | |
| ,→a | |
| ∕∕ new | |
| ,→reference | |
| cudaGraphDestroy(graph); ∕∕ graphExec still owns a reference | |
| cudaGraphLaunch(graphExec, 0); ∕∕ Async launch has access to the user objects | |
| cudaGraphExecDestroy(graphExec); ∕∕ Launch is not synchronized; the release | |
| ∕∕ will be deferred if needed | |
| cudaStreamSynchronize(0); ∕∕ After the launch is synchronized, the remaining | |
| ∕∕ reference is released and the destructor will | |
| ∕∕ execute. Note this happens asynchronously. | |
| ∕∕ If the destructor callback had signaled a synchronization object, it would | |
| ∕∕ be safe to wait on it at this point. | |
| Referencesownedbygraphsinchildgraphnodesareassociatedtothechildgraphs,nottheparents. If | |
| achildgraphisupdatedordeleted,thereferenceschangeaccordingly. Ifanexecutablegraphorchild | |
| graphisupdatedwithcudaGraphExecUpdateorcudaGraphExecChildGraphNodeSetParams,the | |
| referencesinthenewsourcegraphareclonedandreplacethereferencesinthetargetgraph. Ineither | |
| case,ifpreviouslaunchesarenotsynchronized,anyreferenceswhichwouldbereleasedarehelduntil | |
| thelauncheshavefinishedexecuting. | |
| ThereisnotcurrentlyamechanismtowaitonuserobjectdestructorsviaaCUDAAPI.Usersmaysignal | |
| a synchronization object manually from the destructor code. In addition, it is not legal to call CUDA | |
| APIsfromthedestructor,similartotherestrictiononcudaLaunchHostFunc. Thisistoavoidblocking | |
| aCUDAinternalsharedthreadandpreventingforwardprogress. Itislegaltosignalanotherthreadto | |
| performanAPIcall,ifthedependencyisonewayandthethreaddoingthecallcannotblockforward | |
| progressofCUDAwork. | |
| UserobjectsarecreatedwithcudaUserObjectCreate,whichisagoodstartingpointtobrowsere- | |
| latedAPIs. | |
| 4.3. Stream-Ordered Memory Allocator | |
| 4.3.1. Introduction | |
| Managing memory allocations using cudaMalloc and cudaFree causes the GPU to synchronize | |
| across all executing CUDA streams. The stream-ordered memory allocator enables applications to | |
| ordermemoryallocationanddeallocationwithotherworklaunchedintoaCUDAstreamsuchasker- | |
| nel launches and asynchronous copies. This improves application memory use by taking advantage | |
| ofstream-orderingsemanticstoreusememoryallocations. Theallocatoralsoallowsapplicationsto | |
| controltheallocator’smemorycachingbehavior. Whensetupwithanappropriatereleasethreshold, | |
| the caching behavior allows the allocator to avoid expensive calls into the OS when the application | |
| indicatesitiswillingtoacceptabiggermemoryfootprint. Theallocatoralsosupportseasyandsecure | |
| allocationsharingbetweenprocesses. | |
| Stream-OrderedMemoryAllocator: | |
| ▶ Reducestheneedforcustommemorymanagementabstractions,andmakesiteasiertocreate | |
| high-performancecustommemorymanagementforapplicationsthatneedit. | |
| ▶ Enablesmultiplelibrariestoshareacommonmemorypoolmanagedbythedriver. Thiscanre- | |
| duceexcessmemoryconsumption. | |
| 4.3. Stream-OrderedMemoryAllocator 209 | |
| CUDAProgrammingGuide,Release13.1 | |
| ▶ Allows, the driver to perform optimizations based on its awareness of the allocator and other | |
| streammanagementAPIs. | |
| Note | |
| NsightComputeandtheNext-GenCUDAdebuggerisawareoftheallocatorsinceCUDA11.3. | |
| 4.3.2. Memory Management | |
| cudaMallocAsyncandcudaFreeAsyncaretheAPIswhichenablestream-orderedmemorymanage- | |
| ment. cudaMallocAsync returns an allocation and cudaFreeAsync frees an allocation. Both APIs | |
| acceptstreamargumentstodefinewhentheallocationwillbecomeandstopbeingavailableforuse. | |
| ThesefunctionsallowmemoryoperationstobetiedtospecificCUDAstreams,enablingthemtooc- | |
| curwithoutblockingthehostorotherstreams. Applicationperformancecanbeimprovedbyavoiding | |
| potentiallycostlysynchronizationofcudaMallocandcudaFree. | |
| TheseAPIscanbeusedforfurtherperformanceoptimizationthroughmemorypools,whichmanage | |
| andreuselargeblocksofmemoryformoreefficientallocationanddeallocation. Memorypoolshelp | |
| reduceoverheadandpreventfragmentation,improvingperformanceinscenarioswithfrequentmem- | |
| oryallocationoperations. | |
| 4.3.2.1 AllocatingMemory | |
| The cudaMallocAsync function triggers asynchronous memory allocation on the GPU, linked to a | |
| specific CUDA stream. cudaMallocAsync allows memory allocation to occur without hindering the | |
| hostorotherstreams,eliminatingtheneedforexpensivesynchronization. | |
| Note | |
| cudaMallocAsyncignoresthecurrentdevice/contextwhendeterminingwheretheallocationwill | |
| reside. Instead, cudaMallocAsync determines the appropriate device based on the specified | |
| memorypoolorthesuppliedstream. | |
| Thelistingbelowillustratesafundamentalusepattern: thememoryisallocated,used,andthenfreed | |
| backintothesamestream. | |
| void *ptr; | |
| size_t size = 512; | |
| cudaMallocAsync(&ptr, size, cudaStreamPerThread); | |
| ∕∕ do work using the allocation | |
| kernel<<<..., cudaStreamPerThread>>>(ptr, ...); | |
| ∕∕ An asynchronous free can be specified without synchronizing the cpu and GPU | |
| cudaFreeAsync(ptr, cudaStreamPerThread); | |
| Note | |
| Whenaccessingallocationfromastreamotherthanthestreamthatmadetheallocation,theuser | |
| must guarantee that the access occurs after the allocation operation, otherwise the behavior is | |
| undefined. | |
| 210 Chapter4. CUDAFeatures | |
| CUDAProgrammingGuide,Release13.1 | |
| 4.3.2.2 FreeingMemory | |
| cudaFreeAsync() asynchronously frees device memory in a stream-ordered fashion, meaning the | |
| memory deallocation is assigned to a specific CUDA stream and does not block the host or other | |
| streams. | |
| Theusermustguaranteethatthefreeoperationhappensaftertheallocationoperationandanyuses | |
| oftheallocation. Anyuseoftheallocationafterthefreeoperationstartsresultsinundefinedbehavior. | |
| Eventsand/orstreamsynchronizingoperationsshouldbeusedtoguaranteeanyaccesstothealloca- | |
| tion from other streams is complete before the free operation begins, as illustrated in the following | |
| example. | |
| | cudaMallocAsync(&ptr, | | size, | stream1); | | |
| | ----------------------- | --- | --------- | --------- | | |
| | cudaEventRecord(event1, | | stream1); | | | |
| ∕∕stream2 must wait for the allocation to be ready before accessing | |
| | cudaStreamWaitEvent(stream2, | | | event1); | | |
| | ---------------------------- | --------------- | --------- | -------- | | |
| | kernel<<<..., | stream2>>>(ptr, | | ...); | | |
| | cudaEventRecord(event2, | | stream2); | | | |
| ∕∕ stream3 must wait for stream2 to finish accessing the allocation before | |
| | ∕∕ freeing | the allocation | | | | |
| | ---------------------------- | -------------- | --------- | -------- | | |
| | cudaStreamWaitEvent(stream3, | | | event2); | | |
| | cudaFreeAsync(ptr, | | stream3); | | | |
| Memory allocated with cudaMalloc() can be freed with with cudaFreeAsync(). As above, all ac- | |
| cessestothememorymustbecompletebeforethefreeoperationbegins. | |
| | cudaMalloc(&ptr, | size); | | | | |
| | ------------------ | -------------- | -------- | ----- | | |
| | kernel<<<..., | stream>>>(ptr, | | ...); | | |
| | cudaFreeAsync(ptr, | | stream); | | | |
| Likewise, memory allocated with can be freed with cudaFree(). When freeing | |
| cudaMallocAsync | |
| suchallocationsthroughthecudaFree()API,thedriverassumesthatallaccessestotheallocation | |
| are complete and performs no further synchronization. The user can use cudaStreamQuery / cu- | |
| daStreamSynchronize/cudaEventQuery/cudaEventSynchronize/cudaDeviceSynchronize | |
| to guarantee that the appropriate asynchronous work is complete and that the GPU will not try to | |
| accesstheallocation. | |
| | cudaMallocAsync(&ptr, | | size,stream); | | | |
| | --------------------- | -------------- | ------------- | ----- | | |
| | kernel<<<..., | stream>>>(ptr, | | ...); | | |
| ∕∕ synchronize is needed to avoid prematurely freeing the memory | |
| cudaStreamSynchronize(stream); | |
| cudaFree(ptr); | |
| | 4.3.3. | Memory | Pools | | | |
| | ------ | ------ | ----- | --- | | |
| Memorypoolsencapsulatevirtualaddressandphysicalmemoryresourcesthatareallocatedandman- | |
| aged according to the pools attributes and properties. The primary aspect of a memory pool is the | |
| kindandlocationofmemoryitmanages. | |
| AllcallstocudaMallocAsyncuseresourcesfrommemorypool. Ifamemorypoolisnotspecified,cu- | |
| daMallocAsyncusesthecurrentmemorypoolofthesuppliedstream’sdevice. Thecurrentmemory | |
| pool for a device may be set with cudaDeviceSetMempool and queried with cudaDeviceGetMem- | |
| pool. Each device has a default memory pool, which is active if cudaDeviceSetMempool has not | |
| beencalled. | |
| 4.3. Stream-OrderedMemoryAllocator 211 | |
| CUDAProgrammingGuide,Release13.1 | |
| TheAPIcudaMallocFromPoolAsyncandc++overloadsofcudaMallocAsyncallowausertospecify | |
| thepooltobeusedforanallocationwithoutsettingitasthecurrentpool. TheAPIscudaDeviceGet- | |
| DefaultMempoolandcudaMemPoolCreatereturnhandlestomemorypools. cudaMemPoolSetAt- | |
| tributeandcudaMemPoolGetAttributecontroltheattributesofmemorypools. | |
| Note | |
| The mempool current to a device will be local to that device. So allocating without specifying a | |
| memorypoolwillalwaysyieldanallocationlocaltothestream’sdevice. | |
| 4.3.3.1 Default/ImplicitPools | |
| The default memory pool of a device can be retrieved by calling cudaDeviceGetDefaultMempool. | |
| Allocations from the default memory pool of a device are non-migratable device allocation located | |
| on that device. These allocations will always be accessible from that device. The accessibility of the | |
| defaultmemorypoolcanbemodifiedwithcudaMemPoolSetAccessandqueriedwithcudaMemPool- | |
| GetAccess. Sincethedefaultpoolsdonotneedtobeexplicitlycreated,theyaresometimesreferred | |
| toasimplicitpools. ThedefaultmemorypoolofadevicedoesnotsupportIPC. | |
| 4.3.3.2 ExplicitPools | |
| cudaMemPoolCreatecreatesanexplicitpool. Thisallowsapplicationstorequestpropertiesfortheir | |
| allocation beyond what is provided by the default/implicit pools. These include properties such as | |
| IPC capability, maximum pool size, allocations resident on a specific CPU NUMA node on supported | |
| platformsetc. | |
| | ∕∕ create | a pool similar | to the implicit | pool on device | 0 | | | |
| | --------------------------- | -------------- | ---------------------------- | -------------- | --- | --- | | |
| | int device | = 0; | | | | | | |
| | cudaMemPoolProps | poolProps | = { }; | | | | | |
| | poolProps.allocType | = | cudaMemAllocationTypePinned; | | | | | |
| | poolProps.location.id | | = device; | | | | | |
| | poolProps.location.type | | = cudaMemLocationTypeDevice; | | | | | |
| | cudaMemPoolCreate(&memPool, | | &poolProps)); | | | | | |
| ThefollowingcodesnippetillustratesanexampleofcreatinganIPCcapablememorypoolonavalid | |
| CPUNUMAnode. | |
| ∕∕ create a pool resident on a CPU NUMA node that is capable of IPC sharing | |
| | ,→(via | a file descriptor). | | | | | | |
| | ------------------------------ | ------------------- | ------------------------------------- | --- | --------- | ------------ | | |
| | int cpu_numa_id | = 0; | | | | | | |
| | cudaMemPoolProps | poolProps | = { }; | | | | | |
| | poolProps.allocType | = | cudaMemAllocationTypePinned; | | | | | |
| | poolProps.location.id | | = cpu_numa_id; | | | | | |
| | poolProps.location.type | | = cudaMemLocationTypeHostNuma; | | | | | |
| | poolProps.handleType | = | cudaMemHandleTypePosixFileDescriptor; | | | | | |
| | cudaMemPoolCreate(&ipcMemPool, | | &poolProps)); | | | | | |
| | 212 | | | | Chapter4. | CUDAFeatures | | |
| CUDAProgrammingGuide,Release13.1 | |
| 4.3.3.3 DeviceAccessibilityforMulti-GPUSupport | |
| LikeallocationaccessibilitycontrolledthroughthevirtualmemorymanagementAPIs,memorypoolal- | |
| locationaccessibilitydoesnotfollowcudaDeviceEnablePeerAccessorcuCtxEnablePeerAccess. | |
| For memory pools, the API modifies what devices can access allocations | |
| cudaMemPoolSetAccess | |
| from a pool. By default, allocations are accessible only from the device where the allocations are lo- | |
| cated. Thisaccesscannotberevoked. Toenableaccessfromotherdevices,theaccessingdevicemust | |
| bepeercapablewiththememorypool’sdevice. ThiscanbeverifiedwithcudaDeviceCanAccessPeer. | |
| Ifthepeercapabilityisnotchecked,thesetaccessmayfailwithcudaErrorInvalidDevice. How- | |
| ever, if no allocations had been made from the pool, the call may succeed | |
| cudaMemPoolSetAccess | |
| evenwhenthedevicesarenotpeercapable. Inthiscase,thenextallocationfromthepoolwillfail. | |
| ItisworthnotingthatcudaMemPoolSetAccessaffectsallallocationsfromthememorypool,notjust | |
| futureones. Likewise,theaccessibilityreportedbycudaMemPoolGetAccessappliestoallallocations | |
| from the pool, not just future ones. Changing the accessibility settings of a pool for a given GPU | |
| frequently is not recommended. That is, once a pool is made accessible from a given GPU, it should | |
| remainaccessiblefromthatGPUforthelifetimeofthepool. | |
| | ∕∕ snippet | showing usage | of | cudaMemPoolSetAccess: | | | | |
| | ---------- | ------------- | --- | --------------------- | --- | --- | | |
| cudaError_t setAccessOnDevice(cudaMemPool_t memPool, int residentDevice, | |
| | | int accessingDevice) | | { | | | | |
| | ------------------------ | -------------------- | ---------------------------------- | ---------------------------- | --- | --- | | |
| | cudaMemAccessDesc | | accessDesc | = {}; | | | | |
| | accessDesc.location.type | | | = cudaMemLocationTypeDevice; | | | | |
| | accessDesc.location.id | | | = accessingDevice; | | | | |
| | accessDesc.flags | | = cudaMemAccessFlagsProtReadWrite; | | | | | |
| | int | canAccess = | 0; | | | | | |
| cudaError_t error = cudaDeviceCanAccessPeer(&canAccess, accessingDevice, | |
| residentDevice); | |
| | if (error | != cudaSuccess) | | { | | | | |
| | --------- | --------------- | --- | --- | --- | --- | | |
| return error; | |
| | } else | if (canAccess | == | 0) { | | | | |
| | ------ | ------------- | --- | ---- | --- | --- | | |
| return cudaErrorPeerAccessUnsupported; | |
| } | |
| | ∕∕ Make | the address | accessible | | | | | |
| | ------- | ----------------------------- | ---------- | --- | ------------ | --- | | |
| | return | cudaMemPoolSetAccess(memPool, | | | &accessDesc, | 1); | | |
| } | |
| 4.3.3.4 EnablingMemoryPoolsforIPC | |
| Memorypoolscanbeenabledforinterprocesscommunication(IPC)toalloweasy,efficientandsecure | |
| sharing of GPU memory between processes. CUDA’s IPC memory pools provide the same security | |
| benefitsasCUDA’svirtualmemorymanagementAPIs. | |
| There are two steps to sharing memory between processes with memory pools: the processes first | |
| needs to share access to the pool, then share specific allocations from that pool. The first step es- | |
| tablishesandenforcessecurity. Thesecondstepcoordinateswhatvirtualaddressesareusedineach | |
| processandwhenmappingsneedtobevalidintheimportingprocess. | |
| 4.3. Stream-OrderedMemoryAllocator 213 | |
| CUDAProgrammingGuide,Release13.1 | |
| | 4.3.3.4.1 | CreatingandSharingIPCMemoryPools | | | | | | | | | | |
| | --------- | -------------------------------- | --- | --- | --- | --- | --- | --- | --- | --- | | |
| Sharing access to a pool involves retrieving an OS-native handle to the pool with cudaMemPoolEx- | |
| portToShareableHandle(), transferring the handle to the importing process using OS-native IPC | |
| mechanisms,andthencreatinganimportedmemorypoolwiththecudaMemPoolImportFromShare- | |
| ableHandle() API. For cudaMemPoolExportToShareableHandle to succeed, the memory pool | |
| musthavebeencreatedwiththerequestedhandletypespecifiedinthepoolpropertiesstructure. | |
| Please reference samples for the appropriate IPC mechanisms to transfer the OS-native handle be- | |
| tweenprocesses. Therestoftheprocedurecanbefoundinthefollowingcodesnippets. | |
| | ∕∕ in exporting | | process | | | | | | | | | |
| | ----------------------- | --- | ---------- | ------------------------------ | ---------------------------- | ---- | --------- | --- | --- | --- | | |
| | ∕∕ create | an | exportable | | IPC capable | pool | on device | 0 | | | | |
| | cudaMemPoolProps | | poolProps | | = | { }; | | | | | | |
| | poolProps.allocType | | | = cudaMemAllocationTypePinned; | | | | | | | | |
| | poolProps.location.id | | | = | 0; | | | | | | | |
| | poolProps.location.type | | | | = cudaMemLocationTypeDevice; | | | | | | | |
| ∕∕ Setting handleTypes to a non zero value will make the pool exportable (IPC | |
| ,→capable) | |
| poolProps.handleTypes = CU_MEM_HANDLE_TYPE_POSIX_FILE_DESCRIPTOR; | |
| | cudaMemPoolCreate(&memPool, | | | | | &poolProps)); | | | | | | |
| | --------------------------- | --- | ------------ | --- | ------- | ------------- | --- | --- | --- | --- | | |
| | ∕∕ FD based | | handles | are | integer | types | | | | | | |
| | int fdHandle | | = 0; | | | | | | | | | |
| | ∕∕ Retrieve | | an OS native | | handle | to the pool. | | | | | | |
| ∕∕ Note that a pointer to the handle memory is passed in here. | |
| cudaMemPoolExportToShareableHandle(&fdHandle, | |
| memPool, | |
| CU_MEM_HANDLE_TYPE_POSIX_FILE_DESCRIPTOR, | |
| 0); | |
| ∕∕ The handle must be sent to the importing process with the appropriate | |
| | ∕∕ OS-specific | | APIs. | | | | | | | | | |
| | --------------- | --- | ------- | --- | --- | --- | --- | --- | --- | --- | | |
| | ∕∕ in importing | | process | | | | | | | | | |
| int fdHandle; | |
| ∕∕ The handle needs to be retrieved from the exporting process with the | |
| | ∕∕ appropriate | | OS-specific | | APIs. | | | | | | | |
| | -------------- | ---- | ----------- | ---- | --------- | ------------- | ------- | --- | --- | --- | | |
| | ∕∕ Create | an | imported | pool | from | the shareable | handle. | | | | | |
| | ∕∕ Note | that | the handle | | is passed | by value | here. | | | | | |
| cudaMemPoolImportFromShareableHandle(&importedMemPool, | |
| (void*)fdHandle, | |
| CU_MEM_HANDLE_TYPE_POSIX_FILE_DESCRIPTOR, | |
| 0); | |
| | 214 | | | | | | | | Chapter4. | CUDAFeatures | | |
| | --- | --- | --- | --- | --- | --- | --- | --- | --------- | ------------ | | |
| CUDAProgrammingGuide,Release13.1 | |
| | 4.3.3.4.2 | SetAccessintheImportingProcess | | | | | | | | | |
| | --------- | ------------------------------ | --- | --- | --- | --- | --- | --- | --- | | |
| Importedmemorypoolsareinitiallyonlyaccessiblefromtheirresidentdevice. Theimportedmemory | |
| pooldoesnotinheritanyaccessibilitysetbytheexportingprocess. Theimportingprocessneedsto | |
| enableaccesswithcudaMemPoolSetAccessfromanyGPUitplanstoaccessthememoryfrom. | |
| Iftheimportedmemorypoolbelongstoadevicethatisnotvisibletoimportingprocess,theusermust | |
| usethecudaMemPoolSetAccessAPItoenableaccessfromtheGPUstheallocationswillbeusedon. | |
| (SeeDeviceAccessibilityforMulti-GPUSupport) | |
| | 4.3.3.4.3 | CreatingandSharingAllocationsfromanExportedPool | | | | | | | | | |
| | --------- | ----------------------------------------------- | --- | --- | --- | --- | --- | --- | --- | | |
| Oncethepoolhasbeenshared,allocationsmadewithcudaMallocAsync()fromthepoolintheex- | |
| portingprocesscanbesharedwithprocessesthathaveimportedthepool. Sincethepool’ssecurity | |
| policy is established and verified at the pool level, the OS does not need extra bookkeeping to pro- | |
| videsecurityforspecificpoolallocations. Inotherwords,theopaquecudaMemPoolPtrExportData | |
| requiredtoimportapoolallocationmaybesenttotheimportingprocessusinganymechanism. | |
| Whileallocationsmaybeexportedandimportedwithoutsynchronizingwiththeallocatingstreamin | |
| anyway,theimportingprocessmustfollowthesamerulesastheexportingprocesswhenaccessing | |
| theallocation. Specifically,accesstotheallocationmusthappenaftertheallocationoperationinthe | |
| allocatingstreamexecutes. ThetwofollowingcodesnippetsshowcudaMemPoolExportPointer() | |
| andcudaMemPoolImportPointer()sharingtheallocationwithanIPCeventusedtoguaranteethat | |
| theallocationisn’taccessedintheimportingprocessbeforetheallocationisready. | |
| | ∕∕ preparing | an | allocation | | in the | exporting | | process | | | |
| | ------------------------------- | -------------- | -------------------- | ----------- | ------- | ---------------------- | ----------- | ------- | ------- | | |
| | cudaMemPoolPtrExportData | | | exportData; | | | | | | | |
| | cudaEvent_t | readyIpcEvent; | | | | | | | | | |
| | cudaIpcEventHandle_t | | readyIpcEventHandle; | | | | | | | | |
| | ∕∕ ipc | event for | coordinating | | between | | processes | | | | |
| | ∕∕ cudaEventInterprocess | | | flag | makes | the | event | an ipc | event | | |
| | ∕∕ cudaEventDisableTiming | | | | is set | for | performance | | reasons | | |
| | cudaEventCreate(&readyIpcEvent, | | | | | cudaEventDisableTiming | | | | | | |
| ,→cudaEventInterprocess) | |
| | ∕∕ allocate | from | the exporting | | mem | pool | | | | | |
| | ------------------------------------------- | ----------- | ------------- | ------------------- | ---------- | ---- | ----- | --------------- | --- | | |
| | cudaMallocAsync(&ptr, | | | size,exportMemPool, | | | | stream); | | | |
| | ∕∕ event | for sharing | when | the | allocation | | is | ready. | | | |
| | cudaEventRecord(readyIpcEvent, | | | | stream); | | | | | | |
| | cudaMemPoolExportPointer(&exportData, | | | | | | ptr); | | | | |
| | cudaIpcGetEventHandle(&readyIpcEventHandle, | | | | | | | readyIpcEvent); | | | |
| ∕∕ Share IPC event and pointer export data with the importing process using | |
| | ∕∕ any | mechanism. | Here | we | copy the | data | into | shared | memory | | |
| | -------------------------- | -------------- | ----------- | ----------- | ---------------------- | ---- | ------------------ | ------ | ------ | | |
| | shmem->ptrData | = | exportData; | | | | | | | | |
| | shmem->readyIpcEventHandle | | | | = readyIpcEventHandle; | | | | | | |
| | ∕∕ signal | consumers | data | is | ready | | | | | | |
| | ∕∕ Importing | an | allocation | | | | | | | | |
| | cudaMemPoolPtrExportData | | | *importData | | | = &shmem->prtData; | | | | |
| | cudaEvent_t | readyIpcEvent; | | | | | | | | | |
| (continuesonnextpage) | |
| 4.3. Stream-OrderedMemoryAllocator 215 | |
| CUDAProgrammingGuide,Release13.1 | |
| (continuedfrompreviouspage) | |
| cudaIpcEventHandle_t *readyIpcEventHandle = &shmem->readyIpcEventHandle; | |
| ∕∕ Need to retrieve the ipc event handle and the export data from the | |
| ∕∕ exporting process using any mechanism. Here we are using shmem and just | |
| ∕∕ need synchronization to make sure the shared memory is filled in. | |
| | cudaIpcOpenEventHandle(&readyIpcEvent, | | | | readyIpcEventHandle); | | | | |
| | -------------------------------------- | --- | --- | --- | --------------------- | --- | --- | | |
| ∕∕ import the allocation. The operation does not block on the allocation being | |
| ,→ready. | |
| | cudaMemPoolImportPointer(&ptr, | | | importedMemPool, | importData); | | | | |
| | ------------------------------ | --- | --- | ---------------- | ------------ | --- | --- | | |
| ∕∕ Wait for the prior stream operations in the allocating stream to complete | |
| ,→before | |
| | ∕∕ using | the allocation | in | the importing | process. | | | | |
| | --------------------------- | -------------- | --- | --------------- | -------- | --- | --- | | |
| | cudaStreamWaitEvent(stream, | | | readyIpcEvent); | | | | | |
| | kernel<<<..., | stream>>>(ptr, | | ...); | | | | | |
| When freeingthe allocation, the allocationmustbe freedin the importing process beforeit isfreed | |
| in the exporting process. The following code snippet demonstrates the use of CUDA IPC events to | |
| providetherequiredsynchronizationbetweenthecudaFreeAsyncoperationsinbothprocesses. Ac- | |
| cesstotheallocationfromtheimportingprocessisobviouslyrestrictedbythefreeoperationinthe | |
| importing process side. It is worth noting that can be used to free the allocation in both | |
| cudaFree | |
| processesandthatotherstreamsynchronizationAPIsmaybeusedinsteadofCUDAIPCevents. | |
| ∕∕ The free must happen in importing process before the exporting process | |
| | kernel<<<..., | stream>>>(ptr, | | ...); | | | | | |
| | ------------------ | -------------- | --------- | ------- | --- | --- | --- | | |
| | ∕∕ Last | access in | importing | process | | | | | |
| | cudaFreeAsync(ptr, | | stream); | | | | | | |
| ∕∕ Access not allowed in the importing process after the free | |
| | cudaIpcEventRecord(finishedIpcEvent, | | | | stream); | | | | |
| | ------------------------------------ | ------- | --- | --- | -------- | --- | --- | | |
| | ∕∕ Exporting | process | | | | | | | |
| ∕∕ The exporting process needs to coordinate its free with the stream order | |
| | ∕∕ of | the importing | process’s | free. | | | | | |
| | --------------------------- | -------------------------------- | --------- | ------------------ | ----- | --- | --- | | |
| | cudaStreamWaitEvent(stream, | | | finishedIpcEvent); | | | | | |
| | kernel<<<..., | stream>>>(ptrInExportingProcess, | | | ...); | | | | |
| ∕∕ The free in the importing process doesn’t stop the exporting process | |
| | ∕∕ from | using the | allocation. | | | | | | |
| | ------- | --------- | ----------- | --- | --- | --- | --- | | |
| cudFreeAsync(ptrInExportingProcess,stream); | |
| | 4.3.3.4.4 | IPCExportPoolLimitations | | | | | | | |
| | --------- | ------------------------ | --- | --- | --- | --- | --- | | |
| IPCpoolscurrentlydonotsupportreleasingphysicalblocksbacktotheOS.AsaresultthecudaMem- | |
| PoolTrimToAPIhasnoeffectandthecudaMemPoolAttrReleaseThresholdiseffectivelyignored. | |
| Thisbehavioriscontrolledbythedriver,nottheruntimeandmaychangeinafuturedriverupdate. | |
| | 216 | | | | | Chapter4. | CUDAFeatures | | |
| | --- | --- | --- | --- | --- | --------- | ------------ | | |
| CUDAProgrammingGuide,Release13.1 | |
| | 4.3.3.4.5 | IPCImportPoolLimitations | | | | | |
| | --------- | ------------------------ | --- | --- | --- | | |
| Allocating from an import pool is not allowed; specifically, import pools cannot be set current and | |
| cannotbeusedinthecudaMallocFromPoolAsyncAPI.Assuch,theallocationreusepolicyattributes | |
| donothavemeaningforthesepools. | |
| IPCImportpools,likeIPCexportpools,currentlydonotsupportreleasingphysicalblocksbacktothe | |
| OS. | |
| The resource usage stat attribute queries only reflect the allocations imported into the process and | |
| theassociatedphysicalmemory. | |
| | 4.3.4. | Best | Practices | and Tuning | | | |
| | ------ | ---- | --------- | ---------- | --- | | |
| 4.3.4.1 QueryforSupport | |
| Anapplicationcandeterminewhetherornotadevicesupportsthestream-orderedmemoryallocator | |
| bycallingcudaDeviceGetAttribute()(seedeveloperblog)withthedeviceattributecudaDevAt- | |
| trMemoryPoolsSupported. | |
| IPC memory pool support can be queried with the cudaDevAttrMemoryPoolSupportedHandle- | |
| Types device attribute. This attribute was added in CUDA 11.3, and older drivers will return cud- | |
| aErrorInvalidValuewhenthisattributeisqueried. | |
| | int driverVersion | | = 0; | | | | |
| | ----------------------------- | --- | ---- | ---- | --- | | |
| | int deviceSupportsMemoryPools | | | = 0; | | | |
| | int poolSupportedHandleTypes | | | = 0; | | | |
| cudaDriverGetVersion(&driverVersion); | |
| | if (driverVersion | | >= 11020) | { | | | |
| | ----------------- | --- | --------- | --- | --- | | |
| cudaDeviceGetAttribute(&deviceSupportsMemoryPools, | |
| cudaDevAttrMemoryPoolsSupported, device); | |
| } | |
| | if (deviceSupportsMemoryPools | | | != 0) { | | | |
| | ----------------------------- | -------- | -------- | ------------------ | ---------------- | | |
| | ∕∕ | `device` | supports | the Stream-Ordered | Memory Allocator | | |
| } | |
| | if (driverVersion | | >= 11030) | { | | | |
| | ----------------- | --- | --------- | --- | --- | | |
| cudaDeviceGetAttribute(&poolSupportedHandleTypes, | |
| cudaDevAttrMemoryPoolSupportedHandleTypes, device); | |
| } | |
| if (poolSupportedHandleTypes & cudaMemHandleTypePosixFileDescriptor) { | |
| ∕∕ Pools on the specified device can be created with posix file descriptor- | |
| | ,→based | IPC | | | | | |
| | ------- | --- | --- | --- | --- | | |
| } | |
| PerformingthedriverversioncheckbeforethequeryavoidshittingacudaErrorInvalidValueerror | |
| ondriverswheretheattributewasnotyetdefined. OnecanusecudaGetLastErrortocleartheerror | |
| insteadofavoidingit. | |
| 4.3.4.2 PhysicalPageCachingBehavior | |
| By default, the allocator tries to minimize the physical memory owned by a pool. To mini- | |
| mize the OS calls to allocate and free physical memory, applications must configure a mem- | |
| ory footprint for each pool. Applications can do this with the release threshold attribute | |
| (cudaMemPoolAttrReleaseThreshold). | |
| 4.3. Stream-OrderedMemoryAllocator 217 | |
| CUDAProgrammingGuide,Release13.1 | |
| The release threshold is the amount of memory in bytes a pool should hold onto before trying to | |
| releasememorybacktotheOS.Whenmorethanthereleasethresholdbytesofmemoryareheldby | |
| thememorypool,theallocatorwilltrytoreleasememorybacktotheOSonthenextcalltostream, | |
| eventordevicesynchronize. SettingthereleasethresholdtoUINT64_MAXwillpreventthedriverfrom | |
| attemptingtoshrinkthepoolaftereverysynchronization. | |
| | Cuuint64_t | setVal | = UINT64_MAX; | | | | | | |
| | ---------- | ------ | ------------- | --- | --- | --- | --- | | |
| cudaMemPoolSetAttribute(memPool, cudaMemPoolAttrReleaseThreshold, &setVal); | |
| ApplicationsthatsetcudaMemPoolAttrReleaseThresholdhighenoughtoeffectivelydisablemem- | |
| ory pool shrinking may wish to explicitly shrink a memory pool’s memory footprint. cudaMem- | |
| allows applications to do so. When trimming a memory pool’s footprint, the | |
| PoolTrimTo min- | |
| BytesToKeep parameter allows an application to hold onto a specified amount of memory, for ex- | |
| ampletheamountitexpectstoneedinasubsequentphaseofexecution. | |
| | Cuuint64_t | setVal | = UINT64_MAX; | | | | | | |
| | ---------- | ------ | ------------- | --- | --- | --- | --- | | |
| cudaMemPoolSetAttribute(memPool, cudaMemPoolAttrReleaseThreshold, &setVal); | |
| ∕∕ application phase needing a lot of memory from the stream-ordered allocator | |
| | for (i=0; | i<10; i++) | { | | | | | | |
| | --------- | --------------------------------- | ---- | --- | -------- | --- | --- | | |
| | for | (j=0; j<10; | j++) | { | | | | | |
| | | cudaMallocAsync(&ptrs[j],size[j], | | | stream); | | | | |
| } | |
| kernel<<<...,stream>>>(ptrs,...); | |
| | for | (j=0; j<10; | j++) | { | | | | | |
| | --- | ---------------------- | ---- | -------- | --- | --- | --- | | |
| | | cudaFreeAsync(ptrs[j], | | stream); | | | | | |
| } | |
| } | |
| | ∕∕ Process | does not | need | as much memory for | the next | phase. | | | |
| | ---------- | -------- | ---- | ------------------ | -------- | ------ | --- | | |
| ∕∕ Synchronize so that the trim operation will know that the allocations are no | |
| | ∕∕ longer | in use. | | | | | | | |
| | --------- | ------- | --- | --- | --- | --- | --- | | |
| cudaStreamSynchronize(stream); | |
| | cudaMemPoolTrimTo(mempool, | | | 0); | | | | | |
| | -------------------------- | --- | --- | --- | --- | --- | --- | | |
| ∕∕ Some other process∕allocation mechanism can now use the physical memory | |
| | ∕∕ released | by the | trimming | operation. | | | | | |
| | ----------- | ------ | -------- | ---------- | --- | --- | --- | | |
| 4.3.4.3 ResourceUsageStatistics | |
| QueryingthecudaMemPoolAttrReservedMemCurrent attributeofapoolreportsthecurrenttotal | |
| physicalGPUmemoryconsumedbythepool. QueryingthecudaMemPoolAttrUsedMemCurrentofa | |
| poolreturnsthetotalsizeofallofthememoryallocatedfromthepoolandnotavailableforreuse. | |
| ThecudaMemPoolAttr*MemHighattributesarewatermarksrecordingthemaxvalueachievedbythe | |
| respectivecudaMemPoolAttr*MemCurrentattributesincelastreset. Theycanberesettothecur- | |
| rentvaluebyusingthecudaMemPoolSetAttributeAPI. | |
| ∕∕ sample helper functions for getting the usage statistics in bulk | |
| | struct | usageStatistics | { | | | | | | |
| | ---------- | --------------- | --- | --- | --- | --- | --- | | |
| | cuuint64_t | reserved; | | | | | | | |
| | cuuint64_t | reservedHigh; | | | | | | | |
| | cuuint64_t | used; | | | | | | | |
| | cuuint64_t | usedHigh; | | | | | | | |
| (continuesonnextpage) | |
| | 218 | | | | | Chapter4. | CUDAFeatures | | |
| | --- | --- | --- | --- | --- | --------- | ------------ | | |
| CUDAProgrammingGuide,Release13.1 | |
| (continuedfrompreviouspage) | |
| }; | |
| void getUsageStatistics(cudaMemoryPool_t memPool, struct usageStatistics | |
| ,→*statistics) | |
| { | |
| cudaMemPoolGetAttribute(memPool, cudaMemPoolAttrReservedMemCurrent, | |
| ,→statistics->reserved); | |
| cudaMemPoolGetAttribute(memPool, cudaMemPoolAttrReservedMemHigh, | |
| ,→statistics->reservedHigh); | |
| cudaMemPoolGetAttribute(memPool, cudaMemPoolAttrUsedMemCurrent, | |
| ,→statistics->used); | |
| cudaMemPoolGetAttribute(memPool, cudaMemPoolAttrUsedMemHigh, statistics-> | |
| ,→usedHigh); | |
| } | |
| ∕∕ resetting the watermarks will make them take on the current value. | |
| void resetStatistics(cudaMemoryPool_t memPool) | |
| { | |
| cuuint64_t value = 0; | |
| cudaMemPoolSetAttribute(memPool, cudaMemPoolAttrReservedMemHigh, &value); | |
| cudaMemPoolSetAttribute(memPool, cudaMemPoolAttrUsedMemHigh, &value); | |
| } | |
| 4.3.4.4 MemoryReusePolicies | |
| In order to service an allocation request, the driver attempts to reuse memory that was previously | |
| freedviacudaFreeAsync() beforeattemptingtoallocatemorememoryfromtheOS.Forexample, | |
| memoryfreedinastreamcanbereusedimmediatelyinasubsequentallocationrequestonthesame | |
| stream. WhenastreamissynchronizedwiththeCPU,thememorythatwaspreviouslyfreedinthat | |
| streambecomesavailableforreuseforanallocationinanystream. Reusepoliciescan beappliedto | |
| bothdefaultandexplicitmemorypools. | |
| The stream-ordered allocator has a few controllable allocation policies. The pool attributes cud- | |
| aMemPoolReuseFollowEventDependencies, cudaMemPoolReuseAllowOpportunistic, and cu- | |
| daMemPoolReuseAllowInternalDependencies control these policies and are detailed below. | |
| These policies can be enabled or disabled through a call to cudaMemPoolSetAttribute. Upgrad- | |
| ing to a newer CUDA driver may change, enhance, augment and/or reorder the enumeration of the | |
| reusepolicies. | |
| 4.3.4.4.1 cudaMemPoolReuseFollowEventDependencies | |
| BeforeallocatingmorephysicalGPUmemory,theallocatorexaminesdependencyinformationestab- | |
| lishedbyCUDAeventsandtriestoallocatefrommemoryfreedinanotherstream. | |
| cudaMallocAsync(&ptr, size, originalStream); | |
| kernel<<<..., originalStream>>>(ptr, ...); | |
| cudaFreeAsync(ptr, originalStream); | |
| cudaEventRecord(event,originalStream); | |
| ∕∕ waiting on the event that captures the free in another stream | |
| ∕∕ allows the allocator to reuse the memory to satisfy | |
| (continuesonnextpage) | |
| 4.3. Stream-OrderedMemoryAllocator 219 | |
| CUDAProgrammingGuide,Release13.1 | |
| (continuedfrompreviouspage) | |
| | ∕∕ a new | allocation | request | in the other | stream when | | | | |
| | ------------------------------------------ | ---------------------------------- | ------- | ------------- | ----------- | --- | --- | | |
| | ∕∕ cudaMemPoolReuseFollowEventDependencies | | | | is enabled. | | | | |
| | cudaStreamWaitEvent(otherStream, | | | event); | | | | | |
| | cudaMallocAsync(&ptr2, | | size, | otherStream); | | | | | |
| | 4.3.4.4.2 | cudaMemPoolReuseAllowOpportunistic | | | | | | | |
| WhenthecudaMemPoolReuseAllowOpportunisticpolicyisenabled,theallocatorexaminesfreed | |
| allocationstoseeifthefreeoperationsstreamordersemantichasbeenmet,forexamplethestream | |
| has passed the point of execution indicated by the free operation. When this policy is disabled, the | |
| allocatorwillstillreusememorymadeavailablewhenastreamissynchronizedwiththeCPU.Disabling | |
| thispolicydoesnotstopthecudaMemPoolReuseFollowEventDependenciesfromapplying. | |
| | cudaMallocAsync(&ptr, | | size, | originalStream); | | | | | |
| | --------------------- | ---------------------- | ---------------- | ---------------- | ------- | --- | --- | | |
| | kernel<<<..., | originalStream>>>(ptr, | | | ...); | | | | |
| | cudaFreeAsync(ptr, | | originalStream); | | | | | | |
| | ∕∕ after | some time, | the kernel | finishes | running | | | | |
| wait(10); | |
| ∕∕ When cudaMemPoolReuseAllowOpportunistic is enabled this allocation request | |
| ∕∕ can be fulfilled with the prior allocation based on the progress of | |
| ,→originalStream. | |
| | cudaMallocAsync(&ptr2, | | size, | otherStream); | | | | | |
| | ---------------------- | ----------------------------------------- | ----- | ------------- | --- | --- | --- | | |
| | 4.3.4.4.3 | cudaMemPoolReuseAllowInternalDependencies | | | | | | | |
| FailingtoallocateandmapmorephysicalmemoryfromtheOS,thedriverwilllookformemorywhose | |
| availability depends on another stream’s pending progress. If such memory is found, the driver will | |
| inserttherequireddependencyintotheallocatingstreamandreusethememory. | |
| | cudaMallocAsync(&ptr, | | size, | originalStream); | | | | | |
| | --------------------- | ----------------------------------------- | ---------------- | ---------------- | ---------- | --- | --- | | |
| | kernel<<<..., | originalStream>>>(ptr, | | | ...); | | | | |
| | cudaFreeAsync(ptr, | | originalStream); | | | | | | |
| | ∕∕ When | cudaMemPoolReuseAllowInternalDependencies | | | is enabled | | | | |
| ∕∕ and the driver fails to allocate more physical memory, the driver may | |
| ∕∕ effectively perform a cudaStreamWaitEvent in the allocating stream | |
| ∕∕ to make sure that future work in ‘otherStream’ happens after the work | |
| ∕∕ in the original stream that would be allowed to access the original | |
| ,→allocation. | |
| | cudaMallocAsync(&ptr2, | | size, | otherStream); | | | | | |
| | ---------------------- | ---------------------- | ----- | ------------- | --- | --- | --- | | |
| | 4.3.4.4.4 | DisablingReusePolicies | | | | | | | |
| Whilethecontrollablereusepoliciesimprovememoryreuse,usersmaywanttodisablethem. Allow- | |
| ing opportunistic reuse (such as cudaMemPoolReuseAllowOpportunistic) introduces run to run | |
| varianceinallocationpatternsbasedontheinterleavingofCPUandGPUexecution. Internaldepen- | |
| dency insertion (such as cudaMemPoolReuseAllowInternalDependencies) can serialize work in | |
| | 220 | | | | | Chapter4. | CUDAFeatures | | |
| | --- | --- | --- | --- | --- | --------- | ------------ | | |
| CUDAProgrammingGuide,Release13.1 | |
| unexpectedandpotentiallynon-deterministicwayswhentheuserwouldratherexplicitlysynchronize | |
| aneventorstreamonallocationfailure. | |
| 4.3.4.5 SynchronizationAPIActions | |
| One of the optimizations that comes with the allocator being part of the CUDA driver is integration | |
| withthesynchronizeAPIs. WhentheuserrequeststhattheCUDAdriversynchronize,thedriverwaits | |
| for asynchronous work to complete. Before returning, the driver will determine what frees the syn- | |
| chronizationguaranteedtobecompleted. Theseallocationsaremadeavailableforallocationregard- | |
| lessofspecifiedstreamordisabledallocationpolicies. ThedriveralsocheckscudaMemPoolAttrRe- | |
| leaseThresholdhereandreleasesanyexcessphysicalmemorythatitcan. | |
| 4.3.5. Addendums | |
| 4.3.5.1 cudaMemcpyAsyncCurrentContext/DeviceSensitivity | |
| In the current CUDA driver, any async memcpy involving memory from cudaMallocAsync should be | |
| doneusingthespecifiedstream’scontextasthecallingthread’scurrentcontext. Thisisnotnecessary | |
| forcudaMemcpyPeerAsync,asthedeviceprimarycontextsspecifiedintheAPIarereferencedinstead | |
| ofthecurrentcontext. | |
| 4.3.5.2 cudaPointerGetAttributesQuery | |
| InvokingcudaPointerGetAttributesonanallocationafterinvokingcudaFreeAsynconitresults | |
| in undefined behavior. Specifically, it does not matter if an allocation is still accessible from a given | |
| stream: thebehaviorisstillundefined. | |
| 4.3.5.3 cudaGraphAddMemsetNode | |
| cudaGraphAddMemsetNode does not work with memory allocated via the stream ordered allocator. | |
| However,memsetsoftheallocationscanbestreamcaptured. | |
| 4.3.5.4 PointerAttributes | |
| ThecudaPointerGetAttributesqueryworksonstream-orderedallocations. Sincestream-ordered | |
| allocationsarenotcontextassociated,queryingCU_POINTER_ATTRIBUTE_CONTEXTwillsucceedbut | |
| returnNULLin*data. TheattributeCU_POINTER_ATTRIBUTE_DEVICE_ORDINALcanbeusedtode- | |
| terminethelocationoftheallocation: thiscanbeusefulwhenselectingacontextformakingp2h2p | |
| copies using cudaMemcpyPeerAsync. The attribute CU_POINTER_ATTRIBUTE_MEMPOOL_HANDLE | |
| was added in CUDA 11.3 and can be useful for debugging and for confirming which pool an alloca- | |
| tioncomesfrombeforedoingIPC. | |
| 4.3.5.5 CPUVirtualMemory | |
| WhenusingCUDAstream-orderedmemoryallocatorAPIs,avoidsettingVRAMlimitationswith“ulimit | |
| -v”asthisisnotsupported. | |
| 4.4. Cooperative Groups | |
| 4.4. CooperativeGroups 221 | |
| CUDAProgrammingGuide,Release13.1 | |
| | 4.4.1. | Introduction | | | | | | | |
| | ------ | ------------ | --- | --- | --- | --- | --- | | |
| CooperativeGroupsareanextensiontotheCUDAprogrammingmodelfororganizinggroupsofcol- | |
| laboratingthreads. CooperativeGroupsallowdeveloperstocontrolthegranularityatwhichthreads | |
| arecollaborating,helpingthemtoexpressricher,moreefficientparalleldecompositions. Cooperative | |
| Groupsalsoprovideimplementationsofcommonparallelprimitiveslikescanandparallelreduce. | |
| Historically,theCUDAprogrammingmodelhasprovidedasingle,simpleconstructforsynchronizing | |
| cooperatingthreads: abarrieracrossallthreadsofathreadblock,asimplementedwiththe__sync- | |
| threads() intrinsic function. In an effort to express broader patterns of parallel interaction, many | |
| performance-orientedprogrammershaveresortedtowritingtheirownadhocandunsafeprimitives | |
| for synchronizing threads within a single warp, or across sets of thread blocks running on a single | |
| GPU. Whilst the performance improvements achieved have often been valuable, this has resulted in | |
| anever-growingcollectionofbrittlecodethatisexpensivetowrite,tune,andmaintainovertimeand | |
| acrossGPUgenerations. CooperativeGroupsprovidesasafeandfuture-proofmechanismforwriting | |
| performantcode. | |
| ThefullCooperativeGroupsAPIisavailableintheCooperativeGroupsAPI. | |
| | 4.4.2. | Cooperative | Group | Handle | & Member | Functions | | | |
| | ------ | ----------- | ----- | ------ | -------- | --------- | --- | | |
| CooperativeGroupsaremanagedviaaCooperativeGroupHandle. TheCooperativeGrouphandleal- | |
| lowsparticipatingthreadstolearntheirpositioninthegroup, thegroupsize, andothergroupinfor- | |
| mation. Selectgroupmemberfunctionsareshowninthefollowingtable. | |
| | | | Table10: | SelectMemberFunctions | | | | | |
| | ------------- | --- | ---------------------------------- | --------------------- | --- | --- | --- | | |
| | Accessor | | Returns | | | | | | |
| | thread_rank() | | Therankofthecallingthread. | | | | | | |
| | num_threads() | | Thetotalnumberofthreadsinthegroup. | | | | | | |
| thread_index() A3-Dimensionalindexofthethreadwithinthelaunchedblock. | |
| dim_threads() The3Ddimensionsofthelaunchedblockinunitsofthreads. | |
| AcompletelistpfmemberfunctionsisavailableintheCooperativeGroupsAPI. | |
| | 4.4.3. | Default | Behavior | / Groupless | Execution | | | | |
| | ------ | ------- | -------- | ----------- | --------- | --- | --- | | |
| Groupsrepresentingthegridandthreadblocksareimplicitlycreatedbasedonthekernellaunchcon- | |
| figuration. These“implicit”groupsprovideastartingpointthatdeveloperscanexplicitlydecompose | |
| intofinergrainedgroups. Implicitgroupscanbeaccessedusingthefollowingmethods: | |
| | 222 | | | | | Chapter4. | CUDAFeatures | | |
| | --- | --- | --- | --- | --- | --------- | ------------ | | |
| CUDAProgrammingGuide,Release13.1 | |
| Table 11: Cooperative Groups Implicitly Created by CUDA | |
| Runtime | |
| Accessor GroupScope | |
| this_thread_block() Returnsthehandletoagroupcontainingallthreadsincurrentthread | |
| block. | |
| this_grid() Returnsthehandletoagroupcontainingallthreadsinthegrid. | |
| coalesced_threads()1 Returnsthehandletoagroupofcurrentlyactivethreadsinawarp. | |
| this_cluster()2 Returnsthehandletoagroupofthreadsinthecurrentcluster. | |
| MoreinformationisavailableintheCooperativeGroupsAPI. | |
| 4.4.3.1 CreateImplicitGroupHandlesAsEarlyAsPossible | |
| For best performance it is recommended that you create a handle for the implicit group upfront (as | |
| earlyaspossible,beforeanybranchinghasoccurred)andusethathandlethroughoutthekernel. | |
| 4.4.3.2 OnlyPassGroupHandlesbyReference | |
| Itisrecommendedthatyoupassgrouphandlesbyreferencetofunctionswhenpassingagrouphandle | |
| intoafunction. Grouphandlesmustbeinitializedatdeclarationtime,asthereisnodefaultconstructor. | |
| Copy-constructinggrouphandlesisdiscouraged. | |
| 4.4.4. Creating Cooperative Groups | |
| Groupsarecreatedbypartitioningaparentgroupintosubgroups. Whenagroupispartitioned,agroup | |
| handleiscreatedtomanagetheresultingsubgroup. Thefollowingpartitioningoperationsareavailable | |
| todevelopers: | |
| Table12: CooperativeGroupPartitioningOperations | |
| PartitionType Description | |
| tiled_partition Divides parent group into a series of fixed-size subgroups arranged in a one- | |
| dimensional,row-majorformat. | |
| stride_partition Dividesparentgroupintoequally-sizedsubgroupswherethreadsareassigned | |
| tosubgroupsinaround-robinmanner. | |
| labeled_partition Divides parent group into one-dimensional subgroups based on a conditional | |
| label,whichcanbeanyintegraltype. | |
| binary_partition Specializedformoflabeledpartitioningwherelabelcanonlybe“0”or“1”. | |
| Thefollowingexampleshowshowatiledpartitioniscreated: | |
| 1Thecoalesced_threads()operatorreturnsthesetofactivethreadsatthatpointintime,andmakesnoguaranteeabout | |
| whichthreadsarereturned(aslongastheyareactive)orthattheywillstaycoalescedthroughoutexecution. | |
| 2Thethis_cluster()assumesa1x1x1clusterwhenanon-clustergridislaunched. RequiresComputeCapability9.0or | |
| greater. | |
| 4.4. CooperativeGroups 223 | |
| CUDAProgrammingGuide,Release13.1 | |
| | namespace | cg = cooperative_groups; | | | | | | | | |
| | ---------------- | ------------------------ | -------- | ------------------------ | ----- | --------- | --- | --- | | |
| | ∕∕ Obtain | the current | thread's | cooperative | group | | | | | |
| | cg::thread_block | my_group | = | cg::this_thread_block(); | | | | | | |
| | ∕∕ Partition | the cooperative | | group into | tiles | of size 8 | | | | |
| cg::thread_block_tile<8> my_subgroup = cg::tiled_partition<8>(cta); | |
| | ∕∕ do work | as my_subgroup | | | | | | | | |
| | ---------- | -------------- | --- | --- | --- | --- | --- | --- | | |
| The best partitioning strategy to use depends on the context. More information is available in the | |
| CooperativeGroupsAPI. | |
| 4.4.4.1 AvoidingGroupCreationHazards | |
| Partitioningagroupisacollectiveoperationandallthreadsinthegroupmustparticipate. Ifthegroup | |
| wascreatedinaconditionalbranchthatnotallthreadsreach,thiscanleadtodeadlocksordatacor- | |
| ruption. | |
| | 4.4.5. | Synchronization | | | | | | | | |
| | ------ | --------------- | --- | --- | --- | --- | --- | --- | | |
| PriortotheintroductionofCooperativeGroups,theCUDAprogrammingmodelonlyallowedsynchro- | |
| nizationbetweenthreadblocksatakernelcompletionboundary. Cooperativegroupsallowsdevelop- | |
| erstosynchronizegroupsofcooperatingthreadsatdifferentgranularities. | |
| 4.4.5.1 Sync | |
| You can synchronize a group by calling the collective sync() function. Like __syncthreads(), the | |
| sync()functionmakesthefollowingguarantees: -Allmemoryaccesses(e.g.,readsandwrites)made | |
| by threads in the group before the synchronization point are visible to all threads in the group after | |
| thesynchronizationpoint. -Allthreadsinthegroupreachthesynchronizationpointbeforeanythread | |
| isallowedtoproceedbeyondit. | |
| The following example shows a cooperative_groups::sync() that is equivalent to __sync- | |
| threads(). | |
| | namespace | cg = cooperative_groups; | | | | | | | | |
| | ---------------- | ------------------------ | ------ | ------------------------ | --- | --- | --- | --- | | |
| | cg::thread_block | my_group | = | cg::this_thread_block(); | | | | | | |
| | ∕∕ Synchronize | threads | in the | block | | | | | | |
| cg::sync(my_group); | |
| Cooperativegroupscanbeusedtosynchronizetheentiregrid. AsofCUDA13,cooperativegroupscan | |
| nolongerbeusedformulti-devicesynchronization. FordetailsseetheLargeScaleGroupssection. | |
| MoreinformationaboutsynchronizationisavailableintheCooperativeGroupsAPI. | |
| 4.4.5.2 Barriers | |
| Cooperative Groups provides a barrier API similar to cuda::barrier that can be used for more ad- | |
| vanced synchronization. Cooperative Groups barrier API differs from cuda::barrier in a few key | |
| ways: -CooperativeGroupsbarriersareautomaticallyinitialized-Allthreadsinthegroupmustarrive | |
| and wait at the barrier once per phase. - barrier_arrive returns an arrival_token object that | |
| | 224 | | | | | | Chapter4. | CUDAFeatures | | |
| | --- | --- | --- | --- | --- | --- | --------- | ------------ | | |
| CUDAProgrammingGuide,Release13.1 | |
| must be passed into the corresponding barrier_wait, where it is consumed and cannot be used | |
| again. | |
| Programmers must take care to avoid hazards when using Cooperative Groups barriers: - No collec- | |
| tive operations can be used by a group between after calling barrier_arrive and before calling | |
| barrier_wait. - barrier_wait only guarantees that all threads in the group have called bar- | |
| rier_arrive. barrier_waitdoesNOTguaranteethatallthreadshavecalledbarrier_wait. | |
| | namespace | cg = | cooperative_groups; | | | |
| | ---------------- | --------------------------- | ------------------- | --------------- | | |
| | cg::thread_block | | my_group | = this_block(); | | |
| | auto token | = cluster.barrier_arrive(); | | | | |
| ∕∕ Optional: Do some local processing to hide the synchronization latency | |
| local_processing(block); | |
| ∕∕ Make sure all other blocks in the cluster are running and initialized shared | |
| | ,→data | before | accessing | dsmem | | |
| | ------ | ------ | --------- | ----- | | |
| cluster.barrier_wait(std::move(token)); | |
| | 4.4.6. | Collective | | Operations | | |
| | ------ | ---------- | --- | ---------- | | |
| CooperativeGroupsincludesasetofcollectiveoperationsthatcanbeperformedbyagroupofthreads. | |
| These operations require participation of all threads in the specified group in order to complete the | |
| operation. | |
| All threads in the group must pass the same values for corresponding arguments to each collective | |
| call,unlessdifferentvaluesareexplicitlyallowedintheCooperativeGroupsAPI.Otherwisethebehavior | |
| ofthecallisundefined. | |
| 4.4.6.1 Reduce | |
| The reduce function is used to perform a parallel reduction on the data provided by each thread in | |
| thespecifiedgroup. Thetypeofreductionmustbespecifiedbyprovidingoneoftheoperatorsshown | |
| inthefollowingtable. | |
| | | | Table13: | CooperativeGroupsReductionOperators | | |
| | -------- | --- | --------------------- | ----------------------------------- | | |
| | Operator | | Returns | | | |
| | plus | | Sumofallvaluesingroup | | | |
| | less | | Minimumvalue | | | |
| | greater | | Maximumvalue | | | |
| | bit_and | | BitwiseANDreduction | | | |
| | bit_or | | BitwiseORreduction | | | |
| | bit_xor | | BitwiseXORreduction | | | |
| Hardware acceleration is used for reductions when available (requires Compute Capability 8.0 or | |
| 4.4. CooperativeGroups 225 | |
| CUDAProgrammingGuide,Release13.1 | |
| greater). Asoftwarefallbackisavailableforolderhardwarewherehardwareaccelerationisnotavail- | |
| able. Only4Btypesareacceleratedbyhardware. | |
| MoreinformationaboutreductionsisavailableintheCooperativeGroupsAPI. | |
| Thefollowingexampleshowshowtousecooperative_groups::reduce()toperformablock-wide | |
| sumreduction. | |
| | namespace | cg = cooperative_groups; | | | | | | |
| | -------------------------- | ------------------------ | -------- | -------------------------- | --- | --- | | |
| | cg::thread_block | | my_group | = cg::this_thread_block(); | | | | |
| | int val | = data[threadIdx.x]; | | | | | | |
| | int sum | = cg::reduce(cta, | | val, cg::plus<int>()); | | | | |
| | ∕∕ Store | the result | from the | reduction | | | | |
| | if (my_group.thread_rank() | | | == 0) { | | | | |
| | result[blockIdx.x] | | = sum; | | | | | |
| } | |
| 4.4.6.2 Scans | |
| CooperativeGroupsincludesimplementationsofinclusive_scanandexclusive_scanthatcanbe | |
| usedonarbitrarygroupsizes. Thefunctionsperformascanoperationonthedataprovidedbyeach | |
| threadnamedinthespecifiedgroup. | |
| Programmerscanoptionallyspecifyareductionoperator,aslistedinReductionOperatorsTableabove. | |
| | namespace | cg = cooperative_groups; | | | | | | |
| | ---------------- | ------------------------------- | -------- | -------------------------- | --- | --- | | |
| | cg::thread_block | | my_group | = cg::this_thread_block(); | | | | |
| | int val | = data[my_group.thread_rank()]; | | | | | | |
| int exclusive_sum = cg::exclusive_scan(my_group, val, cg::plus<int>()); | |
| | result[my_group.thread_rank()] | | | = exclusive_sum; | | | | |
| | ------------------------------ | --- | --- | ---------------- | --- | --- | | |
| MoreinformationaboutscansisavailableintheCooperativeGroupsScanAPI. | |
| 4.4.6.3 InvokeOne | |
| Cooperative Groups provides an invoke_one function for use when a single thread must perform a | |
| serialportionofworkonbehalfofagroup. -invoke_one selectsasinglearbitrarythreadfromthe | |
| calling group and uses that thread to call the supplied invocable function using the supplied argu- | |
| ments. - invoke_one_broadcast is the same as invoke_one except the result of the call is also | |
| broadcasttoallthreadsinthegroup. | |
| Thethreadselectionmechanismisnotguaranteedtobedeterministic. | |
| Thefollowingexampleshowsbasicinvoke_oneutilization. | |
| | namespace | cg = cooperative_groups; | | | | | | |
| | ---------------- | ------------------------ | -------- | -------------------------- | --- | --- | | |
| | cg::thread_block | | my_group | = cg::this_thread_block(); | | | | |
| (continuesonnextpage) | |
| | 226 | | | | Chapter4. | CUDAFeatures | | |
| | --- | --- | --- | --- | --------- | ------------ | | |
| CUDAProgrammingGuide,Release13.1 | |
| (continuedfrompreviouspage) | |
| ∕∕ Ensure only one thread in the thread block prints the message | |
| | cg::invoke_one(my_group, | | | []() { | | | | |
| | ------------------------ | --- | ---- | ---------- | ---------------- | --- | | |
| | printf("Hello | | from | one thread | in the block!"); | | | |
| }); | |
| ∕∕ Synchronize to make sure all threads wait until the message is printed | |
| cg::sync(my_group); | |
| Communicationorsynchronizationwithinthecallinggroupisnotallowedinsidetheinvocablefunction. | |
| Communicationwiththreadsoutsideofthecallinggroupisallowed. | |
| | 4.4.7. | Asynchronous | | Data | Movement | | | |
| | ------ | ------------ | --- | ---- | -------- | --- | | |
| Cooperative Groups functionality in CUDA provides a way to perform asynchronous | |
| memcpy_async | |
| memory copies between global memory and shared memory. memcpy_async is particularly useful | |
| foroptimizingmemorytransfersandoverlappingcomputationwithdatatransfertoimproveperfor- | |
| mance. | |
| The memcpy_async function is used to start an asynchronous load from global memory to shared | |
| memory. memcpy_async is intended to be used like a “prefetch” where data is loaded before it is | |
| needed. | |
| The wait function forces all threads in a group to wait until the asynchronous memory transfer is | |
| completed. waitmustbecalledbyallthreadsinthegroupbeforethedatacanbeaccessedinshared | |
| memory. | |
| Thefollowingexampleshowshowtousememcpy_asyncandwaittoprefetchdata. | |
| | namespace | cg | = cooperative_groups; | | | | | |
| | ---------------- | --- | --------------------- | -------------------------- | --- | --- | | |
| | cg::thread_group | | my_group | = cg::this_thread_block(); | | | | |
| | __shared__ | int | shared_data[]; | | | | | |
| ∕∕ Perform an asynchronous copy from global memory to shared memory | |
| cg::memcpy_async(my_group, shared_data + my_group.rank(), input + my_group. | |
| | ,→rank(), | sizeof(int)); | | | | | | |
| | --------- | ------------- | ------------ | ---------- | ----------- | ----------- | | |
| | ∕∕ Hide | latency | by doing | work here. | Cannot use | shared_data | | |
| | ∕∕ Wait | for the | asynchronous | copy | to complete | | | |
| cg::wait(my_group); | |
| | ∕∕ Prefetched | | data is | now available | | | | |
| | ------------- | --- | ------- | ------------- | --- | --- | | |
| SeetheCooperativeGroupsAPIformoreinformation. | |
| 4.4.7.1 MemcpyAsyncAlignmentRequirements | |
| is only asynchronous if the source is global memory and the destination is shared | |
| memcpy_async | |
| memory and both are at least 4-byte aligned. For achieving best performance: an alignment of 16 | |
| bytesforbothsharedmemoryandglobalmemoryisrecommended. | |
| 4.4. CooperativeGroups 227 | |
| CUDAProgrammingGuide,Release13.1 | |
| 4.4.8. Large Scale Groups | |
| CooperativeGroupsallowsforlargegroupsthatspantheentiregrid. AllCooperativeGroupfunction- | |
| alitydescribedpreviouslyisavailabletotheselargegroups,withonenotableexception: synchronizing | |
| theentiregridrequiresusingthecudaLaunchCooperativeKernelruntimelaunchAPI. | |
| Multi-devicelaunchAPIsandrelatedreferencesforCooperativeGroupshavebeenremovedasofCUDA | |
| 13. | |
| 4.4.8.1 WhentousecudaLaunchCooperativeKernel | |
| cudaLaunchCooperativeKernelisaCUDAruntimeAPIfunctionusedtolaunchasingle-deviceker- | |
| nel that employs cooperative groups, specifically designed for executing kernels that require inter- | |
| block synchronization. This function ensures that all threads in the kernel can synchronize and co- | |
| operate across the entire grid, which is not possible with traditional CUDA kernels that only allow | |
| synchronizationwithinindividualthreadblocks. cudaLaunchCooperativeKernel ensuresthatthe | |
| kernel launch is atomic, i.e. if the API call succeeds, then the provided number of thread blocks will | |
| launchonthespecifieddevice. | |
| It is good practice to first ensure the device supports cooperative launches by querying the device | |
| attributecudaDevAttrCooperativeLaunch: | |
| int dev = 0; | |
| int supportsCoopLaunch = 0; | |
| cudaDeviceGetAttribute(&supportsCoopLaunch, cudaDevAttrCooperativeLaunch, | |
| ,→dev); | |
| whichwillsetsupportsCoopLaunchto1ifthepropertyissupportedondevice0. Onlydeviceswith | |
| computecapabilityof6.0andhigheraresupported. Inaddition,youneedtoberunningoneitherof | |
| these: | |
| ▶ TheLinuxplatformwithoutMPS | |
| ▶ TheLinuxplatformwithMPSandonadevicewithcomputecapability7.0orhigher | |
| ▶ ThelatestWindowsplatform | |
| 4.5. Programmatic Dependent Launch and | |
| Synchronization | |
| TheProgrammaticDependentLaunch mechanismallowsforadependentsecondary kerneltolaunch | |
| before the primary kernel it depends on in the same CUDA stream has finished executing. Available | |
| startingwithdevicesofcomputecapability9.0,thistechniquecanprovideperformancebenefitswhen | |
| thesecondarykernelcancompletesignificantworkthatdoesnotdependontheresultsoftheprimary | |
| kernel. | |
| 4.5.1. Background | |
| ACUDAapplicationutilizestheGPUbylaunchingandexecutingmultiplekernelsonit. AtypicalGPU | |
| activitytimelineisshowninFigure39. | |
| Here, secondary_kernel islaunchedafterprimary_kernel finishesitsexecution. Serializedexe- | |
| cution is usually necessary because secondary_kernel depends on result data produced by pri- | |
| 228 Chapter4. CUDAFeatures | |
| CUDAProgrammingGuide,Release13.1 | |
| Figure39: GPUactivitytimeline | |
| mary_kernel. If secondary_kernel has no dependency on primary_kernel, both of them can | |
| belaunchedconcurrentlybyusingCUDAStreams. Evenifsecondary_kernelisdependentonpri- | |
| mary_kernel, thereis some potential forconcurrent execution. Forexample, almostall the kernels | |
| have some sort of preamble section during which tasks such as zeroing buffers or loading constant | |
| valuesareperformed. | |
| Figure40: Preamblesectionofsecondary_kernel | |
| Figure40demonstratestheportionofsecondary_kernelthatcouldbeexecutedconcurrentlywith- | |
| outimpactingtheapplication. Notethatconcurrentlaunchalsoallowsustohidethelaunchlatency | |
| ofsecondary_kernelbehindtheexecutionofprimary_kernel. | |
| Figure41: Concurrentexecutionofprimary_kernelandsecondary_kernel | |
| The concurrent launch and execution of secondary_kernel shown in Figure 41 is achievable using | |
| ProgrammaticDependentLaunch. | |
| ProgrammaticDependentLaunchintroduceschangestotheCUDAkernellaunchAPIsasexplainedin | |
| followingsection. TheseAPIsrequireatleastcomputecapability9.0toprovideoverlappingexecution. | |
| 4.5. ProgrammaticDependentLaunchandSynchronization 229 | |
| CUDAProgrammingGuide,Release13.1 | |
| | 4.5.2. | API | Description | | | | | | | | |
| | ------ | --- | ----------- | --- | --- | --- | --- | --- | --- | | |
| InProgrammaticDependentLaunch,aprimaryandasecondarykernelarelaunchedinthesameCUDA | |
| stream. The primary kernel should execute cudaTriggerProgrammaticLaunchCompletion with | |
| all thread blocks when it’s ready for the secondary kernel to launch. The secondary kernel must be | |
| launchedusingtheextensiblelaunchAPIasshown. | |
| | __global__ | void | primary_kernel() | | { | | | | | | |
| | ---------- | ---- | ---------------- | --- | --- | --- | --- | --- | --- | | |
| ∕∕ Initial work that should finish before starting secondary kernel | |
| | ∕∕ Trigger | | the secondary | kernel | | | | | | | |
| | ---------- | --- | ------------- | ------ | --- | --- | --- | --- | --- | | |
| cudaTriggerProgrammaticLaunchCompletion(); | |
| | ∕∕ Work | that | can coincide | with | the | secondary | kernel | | | | |
| | ------- | ---- | ------------ | ---- | --- | --------- | ------ | --- | --- | | |
| } | |
| | __global__ | void | secondary_kernel() | | | | | | | | |
| | ---------- | ---- | ------------------ | --- | --- | --- | --- | --- | --- | | |
| { | |
| | ∕∕ Independent | | work | | | | | | | | |
| | -------------- | --- | ---- | --- | --- | --- | --- | --- | --- | | |
| ∕∕ Will block until all primary kernels the secondary kernel is dependent | |
| | ,→on have | completed | and flushed | | results | to global | memory | | | | |
| | --------- | --------- | ----------- | --- | ------- | --------- | ------ | --- | --- | | |
| cudaGridDependencySynchronize(); | |
| | ∕∕ Dependent | | work | | | | | | | | |
| | ------------ | --- | ---- | --- | --- | --- | --- | --- | --- | | |
| } | |
| | cudaLaunchAttribute | | attribute[1]; | | | | | | | | |
| | ------------------- | --- | ------------- | --- | --- | --- | --- | --- | --- | | |
| attribute[0].id = cudaLaunchAttributeProgrammaticStreamSerialization; | |
| | attribute[0].val.programmaticStreamSerializationAllowed | | | | | | | = 1; | | | |
| | ------------------------------------------------------- | --- | --- | ---------- | --- | ------------------ | --- | ---- | --- | | |
| | configSecondary.attrs | | = | attribute; | | | | | | | |
| | configSecondary.numAttrs | | | = 1; | | | | | | | |
| | primary_kernel<<<grid_dim, | | | block_dim, | | 0, stream>>>(); | | | | | |
| | cudaLaunchKernelEx(&configSecondary, | | | | | secondary_kernel); | | | | | |
| WhenthesecondarykernelislaunchedusingthecudaLaunchAttributeProgrammaticStreamSe- | |
| rializationattribute,theCUDAdriverissafetolaunchthesecondarykernelearlyandnotwaiton | |
| thecompletionandmemoryflushoftheprimarybeforelaunchingthesecondary. | |
| TheCUDAdrivercanlaunchthesecondarykernelwhenallprimarythreadblockshavelaunchedand | |
| executedcudaTriggerProgrammaticLaunchCompletion. Iftheprimarykerneldoesn’texecutethe | |
| trigger,itimplicitlyoccursafterallthreadblocksintheprimarykernelexit. | |
| In either case, the secondary thread blocks might launch before data written by the primary kernel | |
| isvisible. Assuch,whenthesecondarykernelisconfiguredwithProgrammaticDependentLaunch,it | |
| must always use cudaGridDependencySynchronize or other means to verify that the result data | |
| fromtheprimaryisavailable. | |
| Please note that these methods provide the opportunity for the primary and secondary kernels to | |
| executeconcurrently,howeverthisbehaviorisopportunisticandnotguaranteedtoleadtoconcurrent | |
| kernelexecution. Relianceonconcurrentexecutioninthismannerisunsafeandcanleadtodeadlock. | |
| | 230 | | | | | | | Chapter4. | CUDAFeatures | | |
| | --- | --- | --- | --- | --- | --- | --- | --------- | ------------ | | |
| CUDAProgrammingGuide,Release13.1 | |
| | 4.5.3. | Use | in CUDA | Graphs | | | | |
| | ------ | --- | ------- | ------ | --- | --- | | |
| ProgrammaticDependentLaunchcanbeusedinCUDAGraphsviastreamcaptureordirectlyviaedge | |
| data. To program this feature in a CUDA Graph with edge data, use a cudaGraphDependencyType | |
| valueofcudaGraphDependencyTypeProgrammaticonanedgeconnectingtwokernelnodes. This | |
| edgetypemakestheupstreamkernelvisibletoacudaGridDependencySynchronize()inthedown- | |
| streamkernel. ThistypemustbeusedwithanoutgoingportofeithercudaGraphKernelNodePort- | |
| LaunchCompletionorcudaGraphKernelNodePortProgrammatic. | |
| Theresultinggraphequivalentsforstreamcaptureareasfollows: | |
| | Streamcode(abbreviated) | | | | Resultinggraphedge | | | |
| | ----------------------- | --- | ---------- | --- | ------------------ | --------- | | |
| | cudaLaunchAttribute | | attribute; | | cudaGraphEdgeData | edgeData; | | |
| | attribute.id | = | | | edgeData.type | = | | |
| ,→cudaLaunchAttributeProgrammaticStreamS e→rciuadlaiGzraatpihoDne;pendencyTypeProgrammatic; | |
| , | |
| | ,→ | | | | ,→ | | | |
| | -------------- | --- | --- | --- | ------------------ | --- | | |
| | attribute.val. | | | | edgeData.from_port | = | | |
| ,→programmaticStreamSerializationAllowed , →cudaGraphKernelNodePortProgrammatic; | |
| | ,→= 1; | | | | ,→ | | | |
| | ------------------- | --- | ---------- | --- | ----------------- | --------- | | |
| | cudaLaunchAttribute | | attribute; | | cudaGraphEdgeData | edgeData; | | |
| | attribute.id | = | | | edgeData.type | = | | |
| ,→cudaLaunchAttributeProgrammaticEvent; ,→cudaGraphDependencyTypeProgrammatic; | |
| | ,→ | | | | ,→ | | | |
| | -------------------------------- | --- | --- | --- | ------------------ | --- | | |
| | attribute.val.programmaticEvent. | | | | edgeData.from_port | = | | |
| ,→triggerAtBlockStart = 0; ,→cudaGraphKernelNodePortProgrammatic; | |
| ,→ | |
| | cudaLaunchAttribute | | attribute; | | cudaGraphEdgeData | edgeData; | | |
| | ------------------- | --- | ---------- | --- | ----------------- | --------- | | |
| | attribute.id | = | | | edgeData.type | = | | |
| ,→cudaLaunchAttributeProgrammaticEvent; ,→cudaGraphDependencyTypeProgrammatic; | |
| | ,→ | | | | ,→ | | | |
| | -------------------------------- | --- | --- | --- | ------------------ | --- | | |
| | attribute.val.programmaticEvent. | | | | edgeData.from_port | = | | |
| ,→triggerAtBlockStart = 1; ,→cudaGraphKernelNodePortLaunchCompletion; | |
| ,→ | |
| | 4.6. | Green | Contexts | | | | | |
| | ---- | ----- | -------- | --- | --- | --- | | |
| Agreencontext(GC)isalightweightcontextassociated,fromitscreation,withasetofspecificGPU | |
| resources. Users can partition GPU resources, currently streaming multiprocessors (SMs) and work | |
| queues (WQs), during green context creation, so that GPU work targeting a green context can only | |
| useitsprovisionedSMsandworkqueues. Doingsocanbebeneficialinreducing,orbettercontrolling, | |
| interferenceduetouseofcommonresources. Anapplicationcanhavemultiplegreencontexts. | |
| Using green contexts does not require any GPU code (kernel) changes, just small host-side changes | |
| (e.g.,greencontextcreationandstreamcreationforthisgreencontext). Thegreencontextfunction- | |
| 4.6. GreenContexts 231 | |
| CUDAProgrammingGuide,Release13.1 | |
| alitycanbeusefulinvariousscenarios. Forexample,itcanhelpensuresomeSMsarealwaysavailable | |
| foralatency-sensitivekerneltostartexecuting,assumingnootherconstraints,orprovideaquickway | |
| totesttheeffectofusingfewerSMswithoutanykernelmodifications. | |
| GreencontextsupportfirstbecameavailableviatheCUDADriverAPI.StartingfromCUDA13.1,con- | |
| texts are exposed in the CUDA runtime via the execution context (EC) abstraction. Currently, an ex- | |
| ecution context can correspond to either the primary context (the context runtime API users have | |
| alwaysimplicitlyinteractedwith)oragreencontext. Thissectionwillusethetermsexecutioncontext | |
| andgreencontextinterchangeablywhenreferringtoagreencontext. | |
| Withtheruntimeexposureofgreencontexts,usingtheCUDAruntimeAPIdirectlyisstronglyrecom- | |
| mended. ThissectionwillalsosolelyusetheCUDAruntimeAPI. | |
| The remaining of this section is organized as follows: Section 4.6.1 provides a motivating example, | |
| Section4.6.2highlightseaseofuse,andSection4.6.3presentsthedeviceresourceandresourcede- | |
| scriptor structs. Section 4.6.4 explains how to create a green context, Section 4.6.5 how to launch | |
| workthattargetsit,andSection4.6.6highlightssomeadditionalgreencontextAPIs. Finally,Section | |
| 4.6.7wrapsupwithanexample. | |
| 4.6.1. Motivation / When to Use | |
| When launching a CUDA kernel, the user has no direct control over the number of SMs that kernel | |
| will execute on. One can only indirectly influence this by changing the kernel’s launch geometry or | |
| anythingthatcanaffectthekernel’smaximumnumberofactivethreadblocksperSM.Additionally, | |
| when multiple kernels execute in parallel on the GPU (kernels running on different CUDA streams or | |
| aspartofaCUDAgraph),theymayalsocontendforthesameSMresources. | |
| Thereare,however,usecaseswheretheuserneedstoensuretherearealwaysGPUresourcesavailable | |
| for latency-sensitive work to start, and thus complete, as soon as possible. Green contexts provide | |
| awaytowardsthatbypartitioningSMresources,soagivengreencontextcanonlyusespecificSMs | |
| (theonesprovisionedduringitscreation). | |
| Figure 42 illustrates such an example. Assume an application where two independent kernels A and | |
| B run on two different non-blocking CUDA streams. Kernel A is launched first and starts executing | |
| occupying all available SM resources. When, later in time, latency-sensitive kernel B is launched, no | |
| SMresourcesareavailable. Asaresult, kernelBcanonlystartexecutingoncekernelArampsdown, | |
| i.e.,oncethreadblocksfromkernelAfinishexecuting. Thefirstgraphillustratesthisscenariowhere | |
| criticalworkBgetsdelayed. They-axisshowsthepercentageofSMsoccupiedandx-axisdepictstime. | |
| Usinggreencontexts,onecouldpartitiontheGPU’sSMs,sothatgreencontextA,targetedbykernel | |
| A, has access to some SMs of the GPU, while green context B, targeted by kernel B, has access to | |
| the remaining SMs. In this setting, kernel A can only use the SMs provisioned for green context A, | |
| irrespectiveofitslaunchconfiguration. Asaresult,whencriticalkernelBgetslaunched,itisguaran- | |
| teedthattherewillbeavailableSMsforittostartexecutingimmediately,barringanyotherresource | |
| constraints. As the second graph in Figure 42 illustrates, even though the duration of kernel A may | |
| increase,latency-sensitiveworkBwillnolongerbedelayedduetounavailableSMs. Thefigureshows | |
| thatgreencontextAisprovisionedwithanSMcountequivalentto80%SMsoftheGPUforillustration | |
| purposes. | |
| ThisbehaviorcanbeachievedwithoutanycodemodificationstokernelsAandB.Onesimplyneedsto | |
| ensuretheyarelaunchedonCUDAstreamsbelongingtotheappropriategreencontexts. Thenumber | |
| of SMs each green context will have access to should be decided by the user during green context | |
| creationonapercasebasis. | |
| WorkQueues: | |
| Streamingmultiprocessorsareoneresourcetypethatcanbeprovisionedforagreencontext. Another | |
| 232 Chapter4. CUDAFeatures | |
| CUDAProgrammingGuide,Release13.1 | |
| Figure42: Motivation: GCs’staticresourcepartitioningenableslatency-sensitiveworkBtostartand | |
| completesooner | |
| resource type is work queues. Think of a workqueue as a black-box resource abstraction, which can | |
| also influence GPU work execution concurrency, along with other factors. If independent GPU work | |
| tasks(e.g.,kernelssubmittedondifferentCUDAstreams)maptothesameworkqueue,afalsedepen- | |
| dencebetweenthesetasksmaybeintroduced,whichcanleadtotheirserializedexecution. Theuser | |
| can influence the upper limit of work queues on the GPU via the CUDA_DEVICE_MAX_CONNECTIONS | |
| environmentvariable(seeSection5.2,Section3.1). | |
| Buildingontopofthepreviousexample, assumeworkBmapstothesameworkqueueasworkA.In | |
| thatcase, evenifSMresourcesareavailable(greencontextscase), workBmaystillneedtowaitfor | |
| workAtocompleteinitsentirety. SimilartoSMs,theuserhasnodirectcontroloverthespecificwork | |
| queuesthatmaybeusedunderthehood. Butgreencontextsallowtheusertoexpressthemaximum | |
| concurrencytheywouldexpectintermsofexpectednumberofconcurrentstream-orderedworkloads. | |
| Thedrivercanthenusethisvalueasahinttotrytopreventworkfromdifferentexecutioncontexts | |
| fromusingthesameworkqueue(s),thuspreventingunwantedinterferenceacrossexecutioncontexts. | |
| Attention | |
| EvenwhendifferentSMresourcesandworkqueuesareprovisionedpergreencontext,concurrent | |
| execution of independent GPU work is not guaranteed. It is best to think of all the techniques | |
| described under the Green Contexts section as removing factors which can prevent concurrent | |
| execution(i.e.,reducingpotentialinterference). | |
| GreenContextsversusMIGorMPS | |
| Forcompleteness,thissectionbrieflycomparesgreencontextswithtwootherresourcepartitioning | |
| mechanisms: MIG(Multi-InstanceGPU)andMPS(Multi-ProcessService). | |
| MIGstaticallypartitionsaMIG-supportedGPUintomultipleMIGinstances(“smallerGPUs”). Thisparti- | |
| tioninghastohappenbeforethelaunchofanapplication,anddifferentapplicationscanusedifferent | |
| MIGinstances. UsingMIGcanbebeneficialforuserswhoseapplicationsconsistentlyunderutilizethe | |
| availableGPUresources;anissuemorepronouncedasGPUsgetbigger. WithMIG,userscanrunthese | |
| differentapplicationsondifferentMIGinstances,thusimprovingGPUutilization. MIGcanbeattrac- | |
| tiveforcloudserviceproviders(CSPs)notonlyfortheincreasedGPUutilizationforsuchapplications, | |
| butalsoforthequalityofservice(QoS)andisolationitcanprovideacrossclientsrunningondifferent | |
| MIGinstances. PleaserefertotheMIGdocumentationlinkedaboveformoredetails. | |
| 4.6. GreenContexts 233 | |
| CUDAProgrammingGuide,Release13.1 | |
| ButusingMIGcannotaddresstheproblematicscenariodescribedearlier,wherecriticalworkBisde- | |
| layedbecauseallSMresourcesareoccupiedbyotherGPUworkfromthesameapplication. Thisissue | |
| canstillexistforanapplicationrunningonasingleMIGinstance. Toaddressit,onecanusegreencon- | |
| textsalongsideMIG.Inthatcase,theSMresourcesavailableforpartitioningwouldbetheresources | |
| ofthegivenMIGinstance. | |
| MPS primarily targets different processes (e.g., MPI programs), allowing them to run on the GPU at | |
| thesametimewithouttime-slicing. ItrequiresanMPSdaemontoberunningbeforetheapplication | |
| islaunched. Bydefault,MPSclientswillcontendforallavailableSMresourcesoftheGPUortheMIG | |
| instancetheyarerunningon. Inthismultipleclientprocessessetting,MPScansupportdynamicparti- | |
| tioningofSMresources,usingtheactivethreadpercentageoption,whichplacesanupperlimitonthe | |
| percentageofSMsanMPSclientprocesscanuse. Unlikegreencontexts,theactivethreadpercent- | |
| agepartitioninghappenswithMPSattheprocesslevel,andthepercentageistypicallyspecifiedbyan | |
| environmentvariablebeforetheapplicationislaunched. TheMPSactivethreadpercentagesignifies | |
| thatagivenclientapplicationcannotusemorethanx%ofaGPU’sSMs,letthatbeNSMs. However, | |
| theseSMscanbeany NSMsoftheGPU,whichcanalsovaryovertime. Ontheotherhand, agreen | |
| contextprovisionedwithNSMsduringitscreationcanonlyusethesespecificNSMs. | |
| StartingwithCUDA13.1, MPSalsosupportsstaticpartitioning, ifitisexplicitlyenabledwhenstart- | |
| ingtheMPScontroldaemon. Withstaticpartitioning,theuserhastospecifythestaticpartitionan | |
| MPS client process can use, when the application is launched. Dynamic sharing with active thread | |
| percentage is no longer applicable in that case. A key difference between MPS in static partitioning | |
| modeandgreencontextsisthatMPStargetsdifferentprocesses,whilegreencontextsisapplicable | |
| within a single process too. Also, contrary to green contexts, MPS with static partitioning does not | |
| allowoversubscriptionofSMresources. | |
| WithMPS,programmaticpartitioningofSMresourcesisalsopossibleforaCUDAcontextcreatedvia | |
| thecuCtxCreatedriverAPI,withexecutionaffinity. Thisprogrammaticpartitioningallowsdifferent | |
| clientCUDAcontextsfromoneormoreprocessestoeachuseuptoaspecifiednumberofSMs. As | |
| with the active thread percentage partitioning, these SMs can be any SMs of the GPU and can vary | |
| over time, unlike the green contexts case. This option is possible even under the presence of static | |
| MPSpartitioning. Pleasenotethatcreatingagreencontextismuchmorelightweightincomparison | |
| toanMPScontext,asmanyunderlyingstructuresareownedbytheprimarycontextandthusshared. | |
| 4.6.2. Green Contexts: Ease of use | |
| To highlight how easy it is to use green contexts, assume you have the following code snippet that | |
| createstwoCUDAstreamsandthencallsafunctionthatlauncheskernelsvia<<<>>>ontheseCUDA | |
| streams. Asdiscussedearlier,otherthanchangingthekernels’launchgeometries,onecannotinflu- | |
| encehowmanySMsthesekernelscanuse. | |
| int gpu_device_index = 0; ∕∕ GPU ordinal | |
| CUDA_CHECK(cudaSetDevice(gpu_device_index)); | |
| cudaStream_t strm1, strm2; | |
| CUDA_CHECK(cudaStreamCreateWithFlags(&strm1, cudaStreamNonBlocking)); | |
| CUDA_CHECK(cudaStreamCreateWithFlags(&strm2, cudaStreamNonBlocking)); | |
| ∕∕ No control over how many SMs kernel(s) running on each stream can use | |
| code_that_launches_kernels_on_streams(strm1, strm2); ∕∕ what is abstracted in | |
| ,→this function + the kernels is the vast majority of your code | |
| ∕∕ cleanup code not shown | |
| 234 Chapter4. CUDAFeatures | |
| CUDAProgrammingGuide,Release13.1 | |
| Starting with CUDA 13.1, one can control the number of SMs a given kernel can have access to, us- | |
| ing green contexts. The code snippet below shows how easy it is to do that. With a few extra lines | |
| andwithoutanykernelmodifications,youcancontroltheSMsresourceskernel(s)launchedonthese | |
| differentstreamscanuse. | |
| | int gpu_device_index | | | = 0; | ∕∕ | GPU ordinal | | | | |
| | -------------------- | --- | --- | ---- | --- | ----------- | --- | --- | | |
| CUDA_CHECK(cudaSetDevice(gpu_device_index)); | |
| ∕* ------------------ Code required to create green contexts ------------------- | |
| | ,→-------- | *∕ | | | | | | | | |
| | --------------- | --------- | ------------------------ | ------ | --------- | --- | --- | --- | | |
| | ∕∕ Get all | available | | GPU SM | resources | | | | | |
| | cudaDevResource | | initial_GPU_SM_resources | | | | {}; | | | |
| CUDA_CHECK(cudaDeviceGetDevResource(gpu_device_index, &initial_GPU_SM_ | |
| | ,→resources, | | cudaDevResourceTypeSm)); | | | | | | | |
| | ------------ | --- | ------------------------ | --- | --- | --- | --- | --- | | |
| ∕∕ Split SM resources. This example creates one group with 16 SMs and one with | |
| | ,→8. Assuming | | your | GPU has | >= | 24 SMs | | | | |
| | ---------------------------- | --- | ---- | --------- | ---- | --------------- | --- | --- | | |
| | cudaDevSmResource | | | result[2] | {{}, | {}}; | | | | |
| | cudaDevSmResourceGroupParams | | | | | group_params[2] | | = { | | |
| {.smCount=16, .coscheduledSmCount=0, .preferredCoscheduledSmCount=0, . | |
| ,→flags=0}, | |
| {.smCount=8, .coscheduledSmCount=0, .preferredCoscheduledSmCount=0, . | |
| ,→flags=0}}; | |
| CUDA_CHECK(cudaDevSmResourceSplit(&result[0], 2, &initial_GPU_SM_resources, | |
| | ,→nullptr, | 0, | &group_params[0])); | | | | | | | |
| | --------------------- | -------- | ------------------- | -------------- | --- | --- | ------------- | --- | | |
| | ∕∕ Generate | resource | | descriptors | | for | each resource | | | |
| | cudaDevResourceDesc_t | | | resource_desc1 | | | {}; | | | |
| | cudaDevResourceDesc_t | | | resource_desc2 | | | {}; | | | |
| CUDA_CHECK(cudaDevResourceGenerateDesc(&resource_desc1, &result[0], 1)); | |
| CUDA_CHECK(cudaDevResourceGenerateDesc(&resource_desc2, &result[1], 1)); | |
| | ∕∕ Create | green | contexts | | | | | | | |
| | ---------------------- | ----- | -------- | ------------- | --- | --- | --- | --- | | |
| | cudaExecutionContext_t | | | my_green_ctx1 | | | {}; | | | |
| | cudaExecutionContext_t | | | my_green_ctx2 | | | {}; | | | |
| CUDA_CHECK(cudaGreenCtxCreate(&my_green_ctx1, resource_desc1, gpu_device_ | |
| | ,→index, | 0)); | | | | | | | | |
| | -------- | ---- | --- | --- | --- | --- | --- | --- | | |
| CUDA_CHECK(cudaGreenCtxCreate(&my_green_ctx2, resource_desc2, gpu_device_ | |
| | ,→index, | 0)); | | | | | | | | |
| | -------- | ---- | --- | --- | --- | --- | --- | --- | | |
| ∕* ------------------ Modified code --------------------------- *∕ | |
| ∕∕ You just need to use a different CUDA API to create the streams | |
| | cudaStream_t | | strm1, | strm2; | | | | | | |
| | ------------ | --- | ------ | ------ | --- | --- | --- | --- | | |
| CUDA_CHECK(cudaExecutionCtxStreamCreate(&strm1, my_green_ctx1, | |
| | ,→cudaStreamDefault, | | | 0)); | | | | | | |
| | -------------------- | --- | --- | ---- | --- | --- | --- | --- | | |
| CUDA_CHECK(cudaExecutionCtxStreamCreate(&strm2, my_green_ctx2, | |
| | ,→cudaStreamDefault, | | | 0)); | | | | | | |
| | -------------------- | --- | --- | ---- | --- | --- | --- | --- | | |
| ∕* ------------------ Unchanged code --------------------------- *∕ | |
| (continuesonnextpage) | |
| 4.6. GreenContexts 235 | |
| CUDAProgrammingGuide,Release13.1 | |
| (continuedfrompreviouspage) | |
| ∕∕ No need to modify any code in this function or in your kernel(s). | |
| ∕∕ Reminder: what is abstracted in this function + kernels is the vast majority | |
| | ,→of your | code | | | | | | | |
| | --------- | ---- | --- | --- | --- | --- | --- | | |
| ∕∕ Now kernel(s) running on stream strm1 will use at most 16 SMs and kernel(s) | |
| | ,→on strm2 | at most | 8 SMs. | | | | | | |
| | -------------------------------------------- | ------- | --------- | --- | ------- | --- | --- | | |
| | code_that_launches_kernels_on_streams(strm1, | | | | strm2); | | | | |
| | ∕∕ cleanup | code | not shown | | | | | | |
| Various execution context APIs, some of which were shown in the previous example, take an explicit | |
| cudaExecutionContext_thandleandthusignorethecontextthatiscurrenttothecallingthread. | |
| Until now, CUDA runtime users who did not use the driver API would by default only interact with | |
| the primary context that is implicitly set as current to a thread via cudaSetDevice(). This shift to | |
| explicitcontext-basedprogrammingprovideseasiertounderstandsemanticsandcanhaveadditional | |
| benefits compared to the previous implicit context-based programming that relied on thread-local | |
| state(TLS). | |
| Thefollowingsectionswillexplainallthestepsshowninthepreviouscodesnippetindetail. | |
| | 4.6.3. | Green | Contexts: | Device | Resource | and Resource | | | |
| | ------ | ----- | --------- | ------ | -------- | ------------ | --- | | |
| Descriptor | |
| Attheheartofagreencontextisadeviceresource(cudaDevResource)tiedtoaspecificGPUdevice. | |
| Resourcescanbecombinedandencapsulatedintoadescriptor(cudaDevResourceDesc_t). Agreen | |
| contextonlyhasaccesstotheresourcesencapsulatedintothedescriptorusedforitscreation. | |
| CurrentlythecudaDevResourcedatastructureisdefinedas: | |
| | struct | { | | | | | | | |
| | ------ | ------------------- | ------------------------------ | ----- | --------- | --- | --- | | |
| | enum | cudaDevResourceType | | type; | | | | | |
| | union | { | | | | | | | |
| | | struct | cudaDevSmResource | sm; | | | | | |
| | | struct | cudaDevWorkqueueConfigResource | | wqConfig; | | | | |
| | | struct | cudaDevWorkqueueResource | | wq; | | | | |
| }; | |
| }; | |
| The supported valid resource types are cudaDevResourceTypeSm, cudaDevResourceType- | |
| WorkqueueConfigandcudaDevResourceTypeWorkqueue, whilecudaDevResourceTypeInvalid | |
| identifiesaninvalidresourcetype. | |
| Avaliddeviceresourcecanbeassociatedwith: | |
| ▶ | |
| aspecificsetofstreamingmultiprocessors(SMs)(resourcetypecudaDevResourceTypeSm), | |
| ▶ aspecificworkqueueconfiguration(resourcetypecudaDevResourceTypeWorkqueueConfig) | |
| or | |
| ▶ apre-existingworkqueueresource(resourcetypecudaDevResourceTypeWorkqueue). | |
| One can query if a given execution context or CUDA stream is associated with a cudaDevResource | |
| resource of a given type, using the cudaExecutionCtxGetDevResource and cudaStreamGetDe- | |
| vResource APIs respectively. Being associated with different types of device resources (e.g., SMs | |
| andworkqueues)isalsopossibleforanexecutioncontext,whileastreamcanonlybeassociatedwith | |
| anSM-typeresource. | |
| | 236 | | | | | Chapter4. | CUDAFeatures | | |
| | --- | --- | --- | --- | --- | --------- | ------------ | | |
| CUDAProgrammingGuide,Release13.1 | |
| AgivenGPUdevicehas,bydefault,allthreedeviceresourcetypes: anSM-typeresourceencompassing | |
| alltheSMsoftheGPU,aworkqueueconfigurationresourceencompassingallavailableworkqueues | |
| anditscorrespondingworkqueueresource. TheseresourcescanberetrievedviathecudaDeviceGet- | |
| DevResourceAPI. | |
| Overviewofrelevantdeviceresourcestructs | |
| Thedifferentresourcetypestructshavefieldsthatareseteitherexplicitlybytheuserorbyarelevant | |
| CUDAAPIcall. Itisrecommendedtozero-initializealldeviceresourcestructs. | |
| ▶ AnSM-typedeviceresource(cudaDevSmResource)hasthefollowingrelevantfields: | |
| ▶ unsigned int smCount: numberofSMsavailableinthisresource | |
| ▶ unsigned int minSmPartitionSize: minimum SM count required to partition this re- | |
| source | |
| ▶ unsigned int smCoscheduledAlignment: number of SMs in the resource guaranteed | |
| tobeco-scheduledonthesameGPUprocessingcluster,whichisrelevantforthreadblock | |
| clusters. smCountisamultipleofthisvaluewhenflagsiszero. | |
| ▶ unsigned int flags: supported flags are 0 (default) and cudaDevSmResourceGroup- | |
| Backfill(seecudaDevSmResourceGroupflags). | |
| The above fields will be set via either the appropriate split API | |
| (cudaDevSmResourceSplitByCount or cudaDevSmResourceSplit) used to create this | |
| SM-typeresourceorwillbepopulatedbythecudaDeviceGetDevResourceAPIwhichretrieves | |
| the SM resources of a given GPU device. These fields should never be set directly by the user. | |
| Seenextsectionformoredetails. | |
| ▶ Aworkqueueconfigurationdeviceresource(cudaDevWorkqueueConfigResource)hasthefol- | |
| lowingrelevantfields: | |
| ▶ int device: thedeviceonwhichtheworkqueueresourcesareavailable | |
| ▶ unsigned int wqConcurrencyLimit: thenumberofstream-orderedworkloadsexpected | |
| toavoidfalsedependencies | |
| ▶ enum cudaDevWorkqueueConfigScope sharingScope: the sharing scope for the | |
| workqueueresources. Supportedvaluesare: cudaDevWorkqueueConfigScopeDeviceCtx | |
| (default)andcudaDevWorkqueueConfigScopeGreenCtxBalanced. Withthedefaultop- | |
| tion,allworkqueueresourcesaresharedacrossallcontexts,whilewiththebalancedoption | |
| the driver tries to use non-overlapping workqueue resources across green contexts wher- | |
| everpossible,usingtheuser-specifiedwqConcurrencyLimitasahint. | |
| These fields need to be set by the user. There is no CUDA API similar to the split APIs that | |
| generates a workqueue configuration resource, with the exception of the workqueue configu- | |
| ration resource populated by the cudaDeviceGetDevResource API. That API can retrieve the | |
| workqueueconfigurationresourcesofagivenGPUdevice. | |
| ▶ Finally, a pre-existing workqueue resource (cudaDevResourceTypeWorkqueue) has no fields | |
| that can be set by the user. As with the other resource types, cudaDevGetDevResource can | |
| retrievethepre-existingworkqueueresourceofagivenGPUdevice. | |
| 4.6.4. Green Context Creation Example | |
| Therearefourmainstepsinvolvedingreencontextcreation: | |
| ▶ Step1: Startwithaninitialsetofresources,e.g.,byfetchingtheavailableresourcesoftheGPU | |
| 4.6. GreenContexts 237 | |
| CUDAProgrammingGuide,Release13.1 | |
| ▶ | |
| Step 2: Partition the SM resources into one or more partitions (using one of the available split | |
| APIs). | |
| ▶ | |
| Step3: Createaresourcedescriptorcombining,ifneeded,differentresources | |
| ▶ Step4: Createagreencontextfromthedescriptor,provisioningitsresources | |
| After the green context has been created, you can create CUDA streams belonging to that green | |
| context. GPUworksubsequentlylaunchedonsuchastream,suchasakernellaunchedvia<<< >>>, | |
| willonlyhaveaccesstothisgreencontext’sprovisionedresources. Librariescanalsoeasilyleverage | |
| greencontexts,aslongastheuserpassesastreambelongingtoagreencontexttothem. SeeGreen | |
| Contexts-Launchingworkformoredetails. | |
| | 4.6.4.1 Step1: | GetavailableGPUresources | | | | | | | | | | |
| | -------------- | ------------------------ | --- | --- | --- | --- | --- | --- | --- | --- | | |
| Thefirststepingreencontextcreationistogettheavailabledeviceresourcesandpopulatethecu- | |
| daDevResource struct(s). There are currently three possible starting points: a device, an execution | |
| contextoraCUDAstream. | |
| TherelevantCUDAruntimeAPIfunctionsignaturesarelistedbelow: | |
| ▶ Foradevice: cudaError_t cudaDeviceGetDevResource(int device, cudaDevResource* | |
| | resource, | | cudaDevResourceType | | | type) | | | | | | |
| | --------- | --- | ------------------- | --- | --- | ----- | --- | --- | --- | --- | | |
| ▶ For an execution context: cudaError_t cudaExecutionCtxGetDevRe- | |
| source(cudaExecutionContext_t ctx, cudaDevResource* resource, cudaDe- | |
| | vResourceType | | | type) | | | | | | | | |
| | ------------- | --- | --- | ----- | --- | --- | --- | --- | --- | --- | | |
| ▶ Forastream: | |
| cudaError_t cudaStreamGetDevResource(cudaStream_t hStream, cud- | |
| | aDevResource* | | | resource, | cudaDevResourceType | | | type) | | | | |
| | ------------- | --- | --- | --------- | ------------------- | --- | --- | ----- | --- | --- | | |
| All valid types are permitted for each of these APIs, with the exception of | |
| cudaDevResourceType | |
| cudaStreamGetDevResourcewhichonlysupportsanSM-typeresource. | |
| Usually,thestartingpointwillbeaGPUdevice. Thecodesnippetbelowshowshowtogettheavailable | |
| SMresourcesofagivenGPUdevice. AfterasuccessfulcudaDeviceGetDevResourcecall,theuser | |
| canreviewthenumberofSMsavailableinthisresource. | |
| | int current_device | | | = 0; ∕∕ | assume | device | ordinal | of 0 | | | | |
| | ------------------ | --- | --- | ------- | ------ | ------ | ------- | ---- | --- | --- | | |
| CUDA_CHECK(cudaSetDevice(current_device)); | |
| | cudaDevResource | | initial_SM_resources | | | = {}; | | | | | | |
| | --------------- | --- | -------------------- | --- | --- | ----- | --- | --- | --- | --- | | |
| CUDA_CHECK(cudaDeviceGetDevResource(current_device ∕* GPU device *∕, | |
| | | | | | | &initial_SM_resources | | | ∕* device | resource | | |
| | ------------- | --- | --- | --- | --- | --------------------- | --- | --- | ----------- | --------- | | |
| | ,→to populate | | *∕, | | | | | | | | | |
| | | | | | | cudaDevResourceTypeSm | | | ∕* resource | type*∕)); | | |
| std::cout << "Initial SM resources: " << initial_SM_resources.sm.smCount << " | |
| | ,→SMs" | << std::endl; | | ∕∕ number | | of available | SMs | | | | | |
| | ------ | ------------- | --- | --------- | --- | ------------ | --- | --- | --- | --- | | |
| ∕∕ Special fields relevant for partitioning (see Step 3 below) | |
| std::cout << "Min. SM partition size: " << initial_SM_resources.sm. | |
| | ,→minSmPartitionSize | | | << " | SMs" | << std::endl; | | | | | | |
| | -------------------- | --- | --- | ---- | ---- | ------------- | --- | --- | --- | --- | | |
| std::cout << "SM co-scheduled alignment: " << initial_SM_resources.sm. | |
| | ,→smCoscheduledAlignment | | | | << " SMs" | << std::endl; | | | | | | |
| | ------------------------ | --- | --- | --- | --------- | ------------- | --- | --- | --- | --- | | |
| Onecanalsogettheavailableworkqueueconfig. resources,asshowninthecodesnippetbelow. | |
| | 238 | | | | | | | | Chapter4. | CUDAFeatures | | |
| | --- | --- | --- | --- | --- | --- | --- | --- | --------- | ------------ | | |
| CUDAProgrammingGuide,Release13.1 | |
| | int current_device | | = 0; ∕∕ | assume device ordinal | of 0 | | |
| | ------------------ | --- | ------- | --------------------- | ---- | | |
| CUDA_CHECK(cudaSetDevice(current_device)); | |
| | cudaDevResource | | initial_WQ_config_resources | | = {}; | | |
| | --------------- | --- | --------------------------- | --- | ----- | | |
| CUDA_CHECK(cudaDeviceGetDevResource(current_device ∕* GPU device *∕, | |
| &initial_WQ_config_resources ∕* device | |
| | ,→resource | to populate | *∕, | | | | |
| | ---------- | ----------- | --- | --- | --- | | |
| cudaDevResourceTypeWorkqueueConfig ∕* | |
| | ,→resource | type*∕)); | | | | | |
| | ---------- | ----------- | ---------- | ---------- | --------------- | | |
| | std::cout | << "Initial | WQ config. | resources: | " << std::endl; | | |
| std::cout << " - WQ concurrency limit: " << initial_WQ_config_resources. | |
| | ,→wqConfig.wqConcurrencyLimit | | | << std::endl; | | | |
| | ----------------------------- | --- | --- | ------------- | --- | | |
| std::cout << " - WQ sharing scope: " << initial_WQ_config_resources.wqConfig. | |
| | ,→sharingScope | << | std::endl; | | | | |
| | -------------- | --- | ---------- | --- | --- | | |
| AfterasuccessfulcudaDeviceGetDevResourcecall,theusercanreviewthewqConcurrencyLimit | |
| forthisresource. WhenthestartingpointisaGPUdevice,thewqConcurrencyLimitwillmatchthe | |
| valueofCUDA_DEVICE_MAX_CONNECTIONSenvironmentvariableoritsdefaultvalue. | |
| | 4.6.4.2 Step2: | PartitionSMresources | | | | | |
| | -------------- | -------------------- | --- | --- | --- | | |
| The second step in green context creation is to statically split the available cudaDevResource SM | |
| resources into one or more partitions, with potentially some SMs left over in a remaining partition. | |
| ThispartitioningispossibleusingthecudaDevSmResourceSplitByCount()orthecudaDevSmRe- | |
| sourceSplit() API. The cudaDevSmResourceSplitByCount() API can only create one or more | |
| homogeneouspartitions,plusapotentialremainingpartition,whilethecudaDevSmResourceSplit() | |
| APIcanalsocreateheterogeneouspartitions,plusthepotentialremainingone. Thesubsequentsec- | |
| tionsdescribethefunctionalityofbothAPIsindetail. BothAPIsareonlyapplicabletoSM-typedevice | |
| resources. | |
| cudaDevSmResourceSplitByCountAPI | |
| ThecudaDevSmResourceSplitByCountruntimeAPIsignatureis: | |
| cudaError_t cudaDevSmResourceSplitByCount(cudaDevResource* result, unsigned | |
| int* nbGroups, const cudaDevResource* input, cudaDevResource* remaining, un- | |
| | signed | int useFlags, | unsigned | int minCount) | | | |
| | ------ | ------------- | -------- | ------------- | --- | | |
| AsFigure43highlights,theuserrequeststosplittheinputSM-typedeviceresourceinto*nbGroups | |
| homogeneous groups with minCount SMs each. However, the end result will contain a potentially | |
| updated *nbGroups number of homogeneous groups with N SMs each. The potentially updated | |
| *nbGroups will be less than or equal to the originally requested group number, while N will be equal | |
| toorgreaterthanminCount. Theseadjustmentsmayoccurduetosomegranularityandalignment | |
| requirements,whicharearchitecturespecific. | |
| Figure43: SMresourcesplitusingthecudaDevSmResourceSplitByCountAPI | |
| 4.6. GreenContexts 239 | |
| CUDAProgrammingGuide,Release13.1 | |
| Table30liststheminimumSMpartitionsizeandtheSMco-scheduledalignmentforallthecurrently | |
| supported compute capabilities, for the default useFlags=0 case. One can also retrieve these val- | |
| ues via the minSmPartitionSize and smCoscheduledAlignment fields of cudaDevSmResource, | |
| as shown in Step 1: Get available GPU resources. Some of these requirements can be lowered via a | |
| differentuseFlagsvalue. Table14providessomerelevantexampleshighlightingthedifferencebe- | |
| tweenwhatisrequestedandthefinalresult,alongwithanexplanation. Thetablefocusesoncompute | |
| capability(CC9.0),wheretheminimumnumberofSMsperpartitionis8andtheSMcounthastobe | |
| amultipleof8,ifuseFlagsiszero. | |
| | | | | Table14: | Splitfunctionality | | | | | | | |
| | ------------- | ----- | -------- | -------- | ------------------ | --------------- | ---- | ------- | ------ | ------ | | |
| | Re- | | | | | Actual(forGH200 | | | | | | |
| | quested | | | | | with132SMs) | | | | | | |
| | *nbGroumpsin- | | useFlags | | | *nbGroups | with | Remain- | Reason | | | |
| | | Count | | | | N SMs | | ingSMs | | | | |
| | 2 | 72 | 0 | | | 1groupof72SMs | | 60 | cannot | exceed | | |
| 132SMs | |
| | 6 | 11 | 0 | | | 6 groups | of 16 | 36 | multiple | of 8 | | |
| | --- | --- | --- | --- | --- | -------- | ----- | --- | ----------- | ---- | | |
| | | | | | | SMs | | | requirement | | | |
| 6 11 CU_DEV_SM_RESOURCE_SPLIT_IGN6ORgEr_ouSpMs_CwOSitChHE1D2UL6I0NG loweredtomul- | |
| | | | | | | SMseach | | | tipleof2req. | | | |
| | --- | --- | --- | --- | --- | -------- | ---- | ----- | ------------ | --- | | |
| | 2 | 1 | 0 | | | 2 groups | with | 8 116 | min. 8 SMs | re- | | |
| | | | | | | SMseach | | | quirement | | | |
| HereisacodesnippetrequestingtosplittheavailableSMresourcesintofivegroupsof8SMseach: | |
| | cudaDevResource | | avail_resources | | = {}; | | | | | | | |
| | --------------- | ---- | ------------------- | --------------- | ----- | --------- | ------- | --- | --- | --- | | |
| | ∕∕ Code | that | has populated | avail_resources | | not shown | | | | | | |
| | unsigned | int | min_SM_count | = 8; | | | | | | | | |
| | unsigned | int | actual_split_groups | | = 5; | ∕∕ may be | updated | | | | | |
| cudaDevResource actual_split_result[5] = {{}, {}, {}, {}, {}}; | |
| | cudaDevResource | | remaining_partition | | = | {}; | | | | | | |
| | --------------- | --- | ------------------- | --- | --- | --- | --- | --- | --- | --- | | |
| CUDA_CHECK(cudaDevSmResourceSplitByCount(&actual_split_result[0], | |
| &actual_split_groups, | |
| &avail_resources, | |
| &remaining_partition, | |
| | | | | | | 0 ∕*useFlags | | *∕, | | | | |
| | --- | --- | --- | --- | --- | ------------ | --- | --- | --- | --- | | |
| min_SM_count)); | |
| std::cout << "Split " << avail_resources.sm.smCount << " SMs into " << actual_ | |
| | ,→split_groups | | << " groups | " | \ | | | | | | | |
| | -------------- | --- | ----------- | --- | --- | --- | --- | --- | --- | --- | | |
| << "with " << actual_split_result[0].sm.smCount << " each " \ | |
| << "and a remaining group with " << remaining_partition.sm.smCount < | |
| | ,→< " | SMs" | << std::endl; | | | | | | | | | |
| | ----- | ---- | ------------- | --- | --- | --- | --- | --- | --- | --- | | |
| Beawarethat: | |
| ▶ | |
| onecoulduseresult=nullptrtoquerythenumberofgroupsthatwouldbecreated | |
| | 240 | | | | | | | Chapter4. | CUDAFeatures | | | |
| | --- | --- | --- | --- | --- | --- | --- | --------- | ------------ | --- | | |
| CUDAProgrammingGuide,Release13.1 | |
| ▶ onecouldsetremaining=nullptr,ifonedoesnotcarefortheSMsoftheremainingpartition | |
| ▶ theremaining(leftover)partitiondoesnothavethesamefunctionalorperformanceguarantees | |
| asthehomogeneousgroupsinresult. | |
| ▶ useFlagsisexpectedtobe0inthedefaultcase,butvaluesofcudaDevSmResourceSplitIg- | |
| noreSmCoscheduling and cudaDevSmResourceSplitMaxPotentialClusterSize are also | |
| supported | |
| ▶ any resulting cudaDevResource cannot be repartitioned without first creating a resource de- | |
| scriptorandagreencontextfromit(i.e.,steps3and4below) | |
| PleaserefertocudaDevSmResourceSplitByCountruntimeAPIreferenceformoredetails. | |
| cudaDevSmResourceSplitAPI | |
| Asmentionedearlier,asinglecudaDevSmResourceSplitByCountAPIcallcanonlycreatehomoge- | |
| neouspartitions,i.e.,partitionswiththesamenumberofSMs,plustheremainingpartition. Thiscan | |
| be limiting for heterogeneous workloads, where work running on different green contexts has dif- | |
| ferentSMcountrequirements. Toachieveheterogeneouspartitionswiththesplit-by-countAPI,one | |
| wouldusuallyneedtore-partitionanexistingresourcebyrepeatingSteps1-4(multipletimes). Or,in | |
| somecases,onemaybeabletocreatehomogeneouspartitionseachwithSMcountequaltotheGCD | |
| (greatestcommondivisor)ofalltheheterogeneouspartitionsaspartofstep-2andthenmergethe | |
| requirednumberofthemtogetheraspartofstep-3. Thislastapproachhoweverisnotrecommended, | |
| astheCUDAdrivermaybeabletocreatebetterpartitionsiflargersizeswererequestedupfront. | |
| ThecudaDevSmResourceSplitAPIaimstoaddresstheselimitationsbyallowingtheusertocreate | |
| non-overlapping heterogeneous partitions in a single call. The cudaDevSmResourceSplit runtime | |
| APIsignatureis: | |
| cudaError_t cudaDevSmResourceSplit(cudaDevResource* result, unsigned int | |
| nbGroups, const cudaDevResource* input, cudaDevResource* remainder, unsigned | |
| int flags, cudaDevSmResourceGroupParams* groupParams) | |
| ThisAPIwill attemptto partitionthe input SM-typeresourceinto nbGroups validdeviceresources | |
| (groups)placedintheresultarraybasedontherequirementsspecifiedforeachoneinthegroup- | |
| Paramsarray. Anoptionalremainingpartitionmayalsobecreated. Inasuccessfulsplit,asshownin | |
| Figure44,eachresourceintheresultcanhaveadifferentnumberofSMs,butneverzeroSMs. | |
| Figure44: SMresourcesplitusingthecudaDevSmResourceSplitAPI | |
| Whenrequestingaheterogeneoussplit,oneneedstospecifytheSMcount(smCountfieldofrelevant | |
| groupParamsentry)foreachresourceinresult. ThisSMcountshouldalwaysbeamultipleoftwo. | |
| For the scenario in the previous image, groupParams[0].smCount would be X, groupParams[1]. | |
| smCountY,etc. However,justspecifyingtheSMcountisnotsufficient,ifanapplicationusesThread | |
| BlockClusters. Sinceallthethreadblocksofaclusterareguaranteedtobeco-scheduled,theuseralso | |
| needstospecifythemaximumsupportedclustersize,ifany,agivenresourcegroupshouldsupport. | |
| This is possible via the coscheduledSmCount field of the relevant groupParams entry. For GPUs | |
| withcomputecapability10.0andon(CC10.0+),clusterscanalsohaveapreferreddimension,which | |
| isamultipleoftheirdefaultclusterdimension. Duringasinglekernellaunchonsupportedsystems, | |
| this larger preferred cluster dimension is used as much as possible, if at all, and the smaller default | |
| 4.6. GreenContexts 241 | |
| CUDAProgrammingGuide,Release13.1 | |
| cluster dimension is used otherwise. The user can express this preferred cluster dimension hint via | |
| thepreferredCoscheduledSmCountfieldoftherelevantgroupParamsentry. Finally,theremaybe | |
| caseswheretheusermaywanttoloosentheSMcountrequirementsandpullinmoreavailableSMs | |
| in a given group; the user can express this backfill option by setting the flags field of the relevant | |
| groupParamsentrytoitsnon-defaultflagvalue. | |
| Toprovidemoreflexibility,thecudaDevSmResourceSplitAPIalsohasadiscoverymode,tobeused | |
| when the exact SM count, for one or more groups, is not known ahead of time. For example, a user | |
| may want to create a device resource that has as many SMs as possible, while meeting some co- | |
| schedulingrequirements(e.g.,allowingclustersofsizefour). Toexercisethisdiscoverymode,theuser | |
| cansetthesmCountfieldoftherelevantgroupParamsentry(orentries)tozero. Afterasuccessful | |
| cudaDevSmResourceSplitcall,thesmCountfieldofthegroupParamswillhavebeenpopulatedwith | |
| a valid non-zero value; we refer to this as the actual smCount value. If result was not null (so this | |
| wasnotadryrun),thentherelevantgroupofresultwillalsohaveitssmCountsettothesamevalue. | |
| TheorderthenbGroupsgroupParamsentriesarespecifiedmatters,astheyareevaluatedfromleft | |
| (index0)toright(indexnbGroups-1). | |
| Table 15 provides a high level view of the supported arguments for the cudaDevSmResourceSplit | |
| API. | |
| Table15: OverviewofcudaDevSmResourceSplitsplitAPI | |
| groupParamsarray;showingentryiwithi[0, | |
| nbGroups) | |
| result nbGroiunppsut remain- flags smCount cosched- preferred- flags | |
| der uledSm- Cosched- | |
| Count uledSm- | |
| Count | |
| nullptr for num- resource nullptr if 0 0 for dis- 0 (de- 0 (default) 0 (default) | |
| explorative ber to split you do covery fault) or valid or cud- | |
| dry run; of into not want mode or valid preferred aDevSm- | |
| not null ptr groupsnbGroups a remain- or other cosched- coscheduled Resource- | |
| otherwise groups dergroup valid sm- uled SM SM count Group- | |
| Count count (hint) Backfill | |
| Notes: | |
| 1) cudaDevSmResourceSplitAPI’sreturnvaluedependsonresult: | |
| ▶ result != nullptr: the API will return cudaSuccess only when the split is successful | |
| and nbGroups valid cudaDevResource groups, meeting the specified requirements were cre- | |
| ated; otherwise, it will return an error. As different types of errors may return the same error | |
| code (e.g., CUDA_ERROR_INVALID_RESOURCE_CONFIGURATION), it is recommended to use the | |
| CUDA_LOG_FILEenvironmentvariabletogetmoreinformativeerrordescriptionsduringdevel- | |
| opment. | |
| ▶ result == nullptr: theAPImayreturncudaSuccesseveniftheresultingsmCountofagroup | |
| iszero,acasewhichwouldhavereturnedanerrorwithanon-nullptrresult. Thinkofthismode | |
| asadry-runtestyoucanusewhileexploringwhatissupported,especiallyindiscoverymode. | |
| 2) Onasuccessfulcallwithresult!=nullptr,theresultingresult[i]deviceresourcewithiin[0, | |
| nbGroups)willbeoftypecudaDevResourceTypeSmandhavearesult[i].sm.smCountthat | |
| will either be the non-zero user-specified groupParams[i].smCount value or the discovered | |
| one. Inbothcases,theresult[i].sm.smCountwillmeetallthefollowingconstraints: | |
| 242 Chapter4. CUDAFeatures | |
| CUDAProgrammingGuide,Release13.1 | |
| ▶ | |
| | beamultiple | | of 2and | | | | | |
| | ------------ | --- | ------------------------- | --- | --- | --- | | |
| | ▶ beinthe[2, | | input.sm.smCount]rangeand | | | | | |
| ▶ (flags == 0) ? (multiple of actual group_params[i].coscheduledSmCount) : | |
| | (>= | groups_params[i].coscheduledSmCount) | | | | | | |
| | --- | ------------------------------------ | --- | --- | --- | --- | | |
| 3) Specifying zero for any of the coscheduledSmCount and preferredCoscheduledSmCount | |
| fieldsindicatesthatthedefaultvaluesforthesefieldsshouldbeused;thesecanvaryperGPU. | |
| These default values are both equal to the smCoscheduledAlignment of the SM resource re- | |
| trievedviathecudaDeviceGetDevResourceAPIforthegivendevice(andnotanySMresource). | |
| Toreviewthesedefaultvalues,onecanexaminetheirupdatedvaluesintherelevantgroupPa- | |
| rams entry after a successful cudaDevSmResourceSplit call with them initially set to 0; see | |
| below. | |
| | int | gpu_device_index | | = 0; | | | | |
| | --------------- | ---------------- | ------------------------ | ---- | --- | --- | | |
| | cudaDevResource | | initial_GPU_SM_resources | | | {}; | | |
| CUDA_CHECK(cudaDeviceGetDevResource(gpu_device_index, &initial_GPU_ | |
| | ,→SM_resources, | | cudaDevResourceTypeSm)); | | | | | |
| | --------------- | --- | ------------------------ | --- | --- | --- | | |
| std::cout << "Default value will be equal to " << initial_GPU_SM_ | |
| | ,→resources.sm.smCoscheduledAlignment | | | | | << std::endl; | | |
| | ------------------------------------- | ------------------- | --- | --- | --- | ------------- | | |
| | int | default_split_flags | | = | 0; | | | |
| cudaDevSmResourceGroupParams group_params_tmp = {.smCount=0, . | |
| ,→coscheduledSmCount=0, .preferredCoscheduledSmCount=0, .flags=0}; | |
| CUDA_CHECK(cudaDevSmResourceSplit(nullptr, 1, &initial_GPU_SM_ | |
| ,→resources, nullptr ∕*remainder*∕, default_split_flags, &group_ | |
| ,→params_tmp)); | |
| std::cout << "coscheduledSmcount default value: " << group_params. | |
| | ,→coscheduledSmCount | | | << std::endl; | | | | |
| | -------------------- | --- | --- | ------------- | --- | --- | | |
| std::cout << "preferredCoscheduledSmcount default value: " << group_ | |
| | ,→params.preferredCoscheduledSmCount | | | | | << std::endl; | | |
| | ------------------------------------ | --- | --- | --- | --- | ------------- | | |
| 4) Theremaindergroup,ifpresent,willnothaveanyconstraintsonitsSMcountorco-scheduling | |
| | requirements. | | Itwillbeuptotheusertoexplorethat. | | | | | |
| | ------------- | --- | --------------------------------- | --- | --- | --- | | |
| BeforeprovidingmoredetailedinformationforthevariouscudaDevSmResourceGroupParamsfields, | |
| Table 16 shows what these values could be for some example use cases. Assume an ini- | |
| tial_GPU_SM_resourcesdeviceresourcehasalreadybeenpopulated,asinthepreviouscodesnip- | |
| pet,andistheresourcethatwillbesplit. Everyrowinthetablewillhavethatsamestartingpoint. For | |
| simplicitythetablewillonlyshowthenbGroupsvalueandthegroupParamsfieldsperusecasethat | |
| canbeusedinacodesnippetliketheonebelow. | |
| | int nbGroups | | = 2; ∕∕ update | as | needed | | | |
| | --------------- | --- | ------------------- | --- | --------- | --------- | | |
| | unsigned | int | default_split_flags | | = 0; | | | |
| | cudaDevResource | | remainder | {}; | ∕∕ update | as needed | | |
| cudaDevResource result_use_case[2] = {{}, {}}; ∕∕ Update depending on number | |
| ,→of groups planned. Increase size if you plan to also use a workqueue resource | |
| cudaDevSmResourceGroupParams group_params_use_case[2] = {{.smCount = X, . | |
| ,→coscheduledSmCount=0, .preferredCoscheduledSmCount = 0, .flags = 0}, | |
| {.smCount = Y, . | |
| ,→coscheduledSmCount=0, .preferredCoscheduledSmCount = 0, .flags = 0}} | |
| CUDA_CHECK(cudaDevSmResourceSplit(&result_use_case[0], nbGroups, &initial_GPU_ | |
| ,→SM_resources, remainder, default_split_flags, &group_params_use_case[0])); | |
| 4.6. GreenContexts 243 | |
| CUDAProgrammingGuide,Release13.1 | |
| | | | | | Table16: | splitAPIusecases | | | | | | | |
| | --- | --- | --- | --- | -------- | ---------------- | ------------------------------------ | --- | --- | --- | --- | | |
| | | | | | | | groupParams[i]fields(ishowninascend- | | | | i | | |
| ingorder;seelastcolumn) | |
| | # Goal/UseCases | | | | nbGroruep-s | | sm- | cosched-preferred- | | flags | | | |
| | --------------- | --- | --- | --- | ----------- | ---------- | --- | ------------------ | -------- | ----- | --- | | |
| | | | | | | main-Count | | uledSm- | Cosched- | | | | |
| | | | | | | der | | Count | uledSm- | | | | |
| Count | |
| | 1 Resource | with | 16 SMs. | Do | not 1 | nullptr16 | | 0 | 0 | 0 | 0 | | |
| | ---------- | --------- | ------- | --- | ----- | --------- | --- | --- | --- | --- | --- | | |
| | care for | remaining | SMs. | May | use | | | | | | | | |
| clusters. | |
| | 2a One resource | | with | 16 SMs | and 1 | not | 16 | 2 | 2 | 0 | 0 | | |
| | ---------------------- | --- | ---- | ------- | ----- | ------- | --- | --- | --- | --- | --- | | |
| | onewitheverythingelse. | | | Willnot | (2a) | nullptr | | | | | | | |
| useclusters. | |
| | (Note: | showing | two | options: | in | | | | | | | | |
| | --------------------------------- | ------- | ------ | ---------- | -------- | --------- | --- | --- | --- | ------- | --- | | |
| | 2b option(2a),the2ndresourceisthe | | | | 2 | nullptr16 | | 2 | 2 | 0 | 0 | | |
| | remainder; | in | option | (2b), itis | the (2b) | | | | | | | | |
| | | | | | | | 0 | 2 | 2 | cudaDe- | 1 | | |
| result_use_case[1].) | |
| vSmRe- | |
| source- | |
| GroupBack- | |
| fill | |
| | 3 Two resources | | with | 28 and | 32 2 | nullptr28 | | 4 | 4 | 0 | 0 | | |
| | ---------------- | --- | ------------ | ------ | ---- | --------- | --- | --- | --- | --- | --- | | |
| | SMsrespectively. | | Willuseclus- | | | | | | | | | | |
| | | | | | | | 32 | 4 | 4 | 0 | 1 | | |
| tersofsize4. | |
| | 4 One resource | | with as | many | SMs 1 | not | 0 | 8 | 8 | 0 | 0 | | |
| | -------------- | --- | --------- | ---- | ----- | ------- | --- | --- | --- | --- | --- | | |
| | as possible, | | which can | run | clus- | nullptr | | | | | | | |
| tersofsize8,andoneremainder. | |
| | 5 One resource | | with as | many | SMs 2 | nullptr8 | | 2 | 2 | 0 | 0 | | |
| | -------------- | --- | --------- | ---- | ----- | -------- | --- | --- | --- | --- | --- | | |
| | as possible, | | which can | run | clus- | | | | | | | | |
| tersofsize4,andonewith8SMs. | |
| | (Note: | Order | matters! | Changing | | | 0 | 4 | 4 | 0 | 1 | | |
| | ----------- | ---------- | -------- | ----------- | --- | --- | --- | --- | --- | --- | --- | | |
| | order | of entries | in | groupParams | | | | | | | | | |
| | array could | mean | no | SMs left | for | | | | | | | | |
| the8-SMgroup) | |
| DetailedinformationaboutthevariouscudaDevSmResourceGroupParamsstructfields | |
| smCount: | |
| ▶ ControlsSMcountforthecorrespondinggroupinresult. | |
| ▶ | |
| | Values: | 0(discoverymode)orvalidnon-zerovalue(non-discoverymode) | | | | | | | | | | | |
| | ------- | ------------------------------------------------------- | --- | --- | --- | --- | --- | --- | --------- | ------------ | --- | | |
| | 244 | | | | | | | | Chapter4. | CUDAFeatures | | | |
| CUDAProgrammingGuide,Release13.1 | |
| ▶ Validnon-zerosmCountvaluerequirements: (multiple of 2) and in [2, input->sm. | |
| smCount] and ((flags == 0) ? multiple of actual coscheduledSmCount : | |
| greater than or equal to coscheduledSmCount) | |
| ▶ Use cases: use discovery mode to explore what’s possible when SM count is not known/fixed; | |
| usenon-discoverymodetorequestaspecificnumberofSMs. | |
| ▶ Note: indiscoverymode,actualSMcount,aftersuccessfulsplitcallwithnon-nullptrresult,will | |
| meetvalidnon-zerovaluerequirements | |
| coscheduledSmCount: | |
| ▶ ControlsnumberofSMsgroupedtogether(“co-scheduled”)toenablelaunchofdifferentclusters | |
| oncomputecapability9.0+. ItcanthusimpactthenumberofSMsinaresultinggroupandthe | |
| clustersizestheycansupport. | |
| ▶ Values: 0(defaultforcurrentarchitecture)orvalidnon-zerovalue | |
| ▶ Validnon-zerovaluerequirements: (multiple of 2)uptomaxlimit | |
| ▶ Usecases: Usedefaultoramanuallychosenvalueforclusters,keepinginmindthemax. portable | |
| clustersizeonagivenarchitecture. Ifyourcodedoesnotuseclusters,youcanusetheminimum | |
| supportedvalueof2orthedefaultvalue. | |
| ▶ Note: whenthedefaultvalueisused,theactualcoscheduledSmCount,afterasuccessfulsplit | |
| call,willalsomeetvalidnon-zerovaluerequirements. Ifflagsisnotzero,theresultingsmCount | |
| willbe>=coscheduledSmCount. ThinkofcoscheduledSmCountasprovidingsomeguaranteed | |
| underlying “structure” to validresulting groups (i.e., that group can run at least a single cluster | |
| ofcoscheduledSmCountsizeintheworstcase). Thistypeofstructureguaranteedoesnotapply | |
| totheremaininggroup;thereitisuptotheusertoexplorewhatclustersizescanbelaunched. | |
| preferredCoscheduledSmCount: | |
| ▶ Acts as a hint to the driver to try to merge groups of actual coscheduledSmCount SMs into | |
| largergroupsofpreferredCoscheduledSmCountifpossible. Doingsocanallowcodetomake | |
| use of preferred cluster dimensions feature available on devices with compute capability (CC) | |
| 10.0andon). SeecudaLaunchAttributeValue::preferredClusterDim. | |
| ▶ Values: 0(defaultforcurrentarchitecture)orvalidnon-zerovalue | |
| ▶ Validnon-zerovaluerequirements: (multiple of actual coscheduledSmCount) | |
| ▶ Usecases: useamanuallychosenvaluegreaterthan2ifyouusepreferredclustersandareona | |
| deviceofcomputecapability10.0(Blackwell)orlater. Ifyoudon’tuseclusters,choosethesame | |
| value as coscheduledSmCount: either select the minimum supported value of 2 or use 0 for | |
| both | |
| ▶ Note: whenthedefaultvalueisused,theactualpreferredCoscheduledSmCount,afterasuc- | |
| cessfulsplitcall,willalsomeetvalidnon-zerovaluerequirement. | |
| flags: | |
| ▶ Controls if the resulting SM count of a group will be a multiple of actual coscheduled SM | |
| count (default) or if SMs can be backfilled into this group (backfill). In the backfill case, the | |
| resulting SM count (result[i].sm.smCount) will be greater than or equal to the specified | |
| groupParams[i].smCount. | |
| ▶ Values: 0(default)orcudaDevSmResourceGroupBackfill | |
| ▶ Use cases: Use the zero (default), so the resulting group has the guaranteed flexibility of sup- | |
| porting multiple clusters of coScheduledSmCount size. Use the backfill option, if you want to | |
| get as many SMs as possible in the group, with some of these SMs (the backfilled ones), not | |
| providinganycoschedulingguarantee. | |
| 4.6. GreenContexts 245 | |
| CUDAProgrammingGuide,Release13.1 | |
| ▶ | |
| Note: a group created with the backfill flag can still support clusters (e.g., it is guaranteed to | |
| supportatleastonecoscheduledSmCountsize). | |
| | 4.6.4.3 Step2(continued): | | Addworkqueueresources | | | | | | |
| | ------------------------- | --- | --------------------- | --- | --- | --- | --- | | |
| Ifyoualsowanttospecifyaworkqueueresource,thenthisneedstobedoneexplicitly. Thefollowing | |
| exampleshowshowtocreateaworkqueueconfigurationresourceforaspecificdevicewithbalanced | |
| sharingscopeandaconcurrencylimitoffour. | |
| | cudaDevResource | split_result[2] | | = {{}, | {}}; | | | | |
| | --------------- | --------------- | --- | ------ | ---- | --- | --- | | |
| ∕∕ code to populate split_result[0] not shown; used split API with nbGroups=1 | |
| | ∕∕ The | last resource | will be | a workqueue | resource. | | | | |
| | -------------------- | ------------- | ------------------------------------- | ----------- | --------- | --- | --- | | |
| | split_result[1].type | | = cudaDevResourceTypeWorkqueueConfig; | | | | | | |
| split_result[1].wqConfig.device = 0; ∕∕ assume device ordinal of 0 | |
| | split_result[1].wqConfig.sharingScope | | | | = | | | | |
| | ------------------------------------- | --- | --- | --- | --- | --- | --- | | |
| ,→cudaDevWorkqueueConfigScopeGreenCtxBalanced; | |
| | split_result[1].wqConfig.wqConcurrencyLimit | | | | = 4; | | | | |
| | ------------------------------------------- | --- | --- | --- | ---- | --- | --- | | |
| A workqueueconcurrencylimitoffourhintsto the driverthat the userexpectsmaximum fourcon- | |
| current stream-ordered workloads. The driver will assign work queues trying to respect this hint, if | |
| possible. | |
| | 4.6.4.4 Step3: | CreateaResourceDescriptor | | | | | | | |
| | -------------- | ------------------------- | --- | --- | --- | --- | --- | | |
| The next step, after resources have been split, is to generate a resource descriptor, using the cud- | |
| aDevResourceGenerateDescAPI,foralltheresourcesexpectedtobeavailabletoagreencontext. | |
| TherelevantCUDAruntimeAPIfunctionsignatureis: | |
| cudaError_t cudaDevResourceGenerateDesc(cudaDevResourceDesc_t *phDesc, cudaDe- | |
| | vResource | *resources, | unsigned | int nbResources) | | | | | |
| | --------- | ----------- | -------- | ---------------- | --- | --- | --- | | |
| ItispossibletocombinemultiplecudaDevResourceresources. Forexample,thecodesnippetbelow | |
| shows how to generate a resource descriptor that encapsulates three groups of resources. You just | |
| needtoensurethattheseresourcesareallallocatedcontinuouslyintheresourcesarray. | |
| | cudaDevResource | actual_split_result[5] | | | = {}; | | | | |
| | --------------- | ---------------------- | ------------------- | --- | --------- | --- | --- | | |
| | ∕∕ code | to populate | actual_split_result | | not shown | | | | |
| ∕∕ Generate resource desc. to encapsulate 3 resources: actual_split_result[2] | |
| ,→to [4] | |
| | cudaDevResourceDesc_t | | resource_desc; | | | | | | |
| | --------------------- | --- | -------------- | --- | --- | --- | --- | | |
| CUDA_CHECK(cudaDevResourceGenerateDesc(&resource_desc, &actual_split_ | |
| | ,→result[2], | 3)); | | | | | | | |
| | ------------ | ---- | --- | --- | --- | --- | --- | | |
| Combiningdifferenttypesofresourcesisalsosupported. Forexample,onecouldgenerateadescriptor | |
| withbothSMandworkqueueresources. | |
| ForacudaDevResourceGenerateDesccalltobesuccessful: | |
| ▶ | |
| AllnbResourcesresourcesshouldbelongtothesameGPUdevice. | |
| ▶ IfmultipleSM-typeresourcesarecombined,theyshouldbegeneratedfromthesamesplitAPI | |
| callandhavethesamecoscheduledSmCountvalues(ifnotpartofremainder) | |
| ▶ Onlyasingleworkqueueconfigorworkqueuetyperesourcemaybepresent. | |
| | 246 | | | | | Chapter4. | CUDAFeatures | | |
| | --- | --- | --- | --- | --- | --------- | ------------ | | |
| CUDAProgrammingGuide,Release13.1 | |
| 4.6.4.5 Step4: CreateaGreenContext | |
| ThefinalstepistocreateagreencontextfromaresourcedescriptorusingthecudaGreenCtxCreate | |
| API.Thatgreencontextwillonlyhaveaccesstotheresources(e.g.,SMs,workqueues)encapsulatedin | |
| theresourcedescriptorspecifiedduringitscreation. Theseresourceswillbeprovisionedduringthis | |
| step. | |
| TherelevantCUDAruntimeAPIfunctionsignatureis: | |
| cudaError_t cudaGreenCtxCreate(cudaExecutionContext_t *phCtx, cudaDevRe- | |
| sourceDesc_t desc, int device, unsigned int flags) | |
| The flags parameter should be set to 0. It is also recommended to explicitly initialize the device’s | |
| primarycontextbeforecreatingagreencontextviaeitherthecudaInitDeviceAPIorthecudaSet- | |
| Device API, which also sets the primary context as current to the calling thread. Doing so ensures | |
| therewillbenoadditionalprimarycontextinitializationoverheadduringgreencontextcreation. | |
| Seecodesnippetbelow. | |
| int current_device = 0; ∕∕ assume single GPU | |
| CUDA_CHECK(cudaSetDevice(current_device)); ∕∕ Or cudaInitDevice | |
| cudaDevResourceDesc_t resource_desc {}; | |
| ∕∕ Code to generate resource_desc not shown | |
| ∕∕ Create a green_ctx on GPU with current_device ID with access to resources | |
| ,→from resource_desc | |
| cudaExecutionContext_t green_ctx {}; | |
| CUDA_CHECK(cudaGreenCtxCreate(&green_ctx, resource_desc, current_device, 0)); | |
| After a successful green context creation, the user can verify its resources by calling cudaExecu- | |
| tionCtxGetDevResourceonthatexecutioncontextforeachresourcetype. | |
| CreatingMultipleGreenContexts | |
| Anapplicationcanhavemorethanonegreencontext,inwhichcasesomeofthestepsaboveshould | |
| be repeated. For most use cases, these green contexts will each have a separate non-overlapping | |
| set of provisioned SMs. For example, for the case of five homogeneous cudaDevResource groups | |
| (actual_split_resultarray),onegreencontext’sdescriptormayencapsulateactual_split_result[2] | |
| to[4]resources,whilethedescriptorofanothergreencontextmayencapsulateactual_split_result[0] | |
| to [1]. In this case, a specific SM will be provisioned for only one of the two green contexts of the | |
| application. | |
| But SM oversubscription is also possible and may be used in some cases. For example, it may | |
| be acceptable to have the second green context’s descriptor encapsulate actual_split_result[0] to | |
| [2]. In this case, all the SMs of actual_split_resource[2] cudaDevResource will be oversubscribed, | |
| i.e., provisioned for both green contexts, while resources actual_split_resource[0] to [1] and ac- | |
| tual_split_resource[3]to[4]mayonlybeusedbyoneofthetwogreencontexts. SMoversubscription | |
| shouldbejudiciouslyusedonaper-casebasis. | |
| 4.6.5. Green Contexts - Launching work | |
| Tolaunchakerneltargetingagreencontextcreatedusingthepriorsteps,youfirstneedtocreatea | |
| stream for that green context with the cudaExecutionCtxStreamCreate API. Launching a kernel | |
| onthatstreamusing<<< >>>orthecudaLaunchKernelAPI,willensurethatkernelcanonlyusethe | |
| resources(SMs,workqueues)availabletothatstreamviaitsexecutioncontext. Forexample: | |
| 4.6. GreenContexts 247 | |
| CUDAProgrammingGuide,Release13.1 | |
| ∕∕ Create green_ctx_stream CUDA stream for previously created green_ctx green | |
| ,→context | |
| cudaStream_t green_ctx_stream; | |
| int priority = 0; | |
| CUDA_CHECK(cudaExecutionCtxStreamCreate(&green_ctx_stream, | |
| green_ctx, | |
| cudaStreamDefault, | |
| priority)); | |
| ∕∕ Kernel my_kernel will only use the resources (SMs, work queues, as | |
| ,→applicable) available to green_ctx_stream's execution context | |
| my_kernel<<<grid_dim, block_dim, 0, green_ctx_stream>>>(); | |
| CUDA_CHECK(cudaGetLastError()); | |
| The default stream creation flag passed to the stream creation API above is equivalent to cudaS- | |
| treamNonBlockinggivengreen_ctxisagreencontext. | |
| CUDAgraphs | |
| For kernels launched as part of a CUDA graph (see CUDA Graphs), there are a few more subtleties. | |
| Unlikekernels,theCUDAstreamaCUDAgraphislaunchedondoesnotdeterminetheSMresources | |
| used,asthatstreamissolelyusedfordependencytracking. | |
| The execution context a kernel node (and other applicable node types) will execute on is set during | |
| nodecreation. IftheCUDAgraphwillbecreatedusingstreamcapture,thentheexecutioncontext(s) | |
| ofthestream(s)involvedinthecapturewilldeterminetheexecutioncontext(s)oftherelevantgraph | |
| nodes. If the graph will be created using the graph APIs, then the user should explicitly set the ex- | |
| ecution context for each relevant node. For example, to add a kernel node, the user should use the | |
| polymorphic cudaGraphAddNode API with cudaGraphNodeTypeKernel type and explicitly specify | |
| the .ctx field of the cudaKernelNodeParamsV2 struct under .kernel. The cudaGraphAddKer- | |
| nelNodedoesnotallowtheusertospecifyanexecutioncontextandshouldthusbeavoided. Please | |
| notethatitispossiblefordifferentgraphnodesinagraphtobelongtodifferentexecutioncontexts. | |
| Forverificationpurposes,onecoulduseNsightSystemsinnodetracingmode(--cuda-graph-trace | |
| node)toobservethegreencontext(s)specificgraphnodeswillexecuteon. Notethatinthedefault | |
| graphtracingmode,theentiregraphwillappearunderthegreencontextofthestreamitwaslaunched | |
| on,but,aspreviouslyexplained,thisdoesnotprovideanyinformationabouttheexecutioncontext(s) | |
| ofthevariousgraphnodes. | |
| To verify programmatically, one could potentially use the CUDA driver API | |
| cuGraphKernelNodeGetParams(graph_node, &node_params)andcomparethenode_params. | |
| ctxcontexthandlefieldwiththeexpectedcontexthandleforthatgraphnode. UsingthedriverAPI | |
| is possible given CUgraphNode and cudaGraphNode_t can be used interchangeably, but the user | |
| wouldneedtoincludetherelevantcuda.hheaderandlinkwiththedriverdirectly(-lcuda). | |
| ThreadBlockClusters | |
| Kernels with thread block clusters (see Section 1.2.2.1.1) can also be launched on a green context | |
| stream, like any other kernel, and thus use that green context’s provisioned resources. Section | |
| 4.6.4.2 showed how to specify the number of SMs that need to be coscheduled when a device re- | |
| sourceissplit, tofacilitateclusters. Butaswithanykernelusingclusters,theusershouldmakeuse | |
| of the relevant occupancy APIs to determine the max potential cluster size for a kernel (via cud- | |
| aOccupancyMaxPotentialClusterSize) and, if needed, the maximum number of active clusters | |
| (cudaOccupancyMaxActiveClusters). Iftheuserspecifiesagreencontextstreamasthestream | |
| fieldoftherelevantcudaLaunchConfig,thentheseoccupancyAPIswilltakeintoconsiderationthe | |
| SMresourcesprovisionedforthatgreencontext. Thisusecaseisespeciallyrelevantforlibrariesthat | |
| 248 Chapter4. CUDAFeatures | |
| CUDAProgrammingGuide,Release13.1 | |
| maygetagreencontextCUDAstreampassedtothembytheuser,aswellasincaseswherethegreen | |
| contextwascreatedfromaremainingdeviceresource. | |
| ThecodesnippetbelowshowshowtheseAPIscanbeused. | |
| ∕∕ Assume cudaStream_t gc_stream has already been created and a __global__ | |
| | ,→void | cluster_kernel | exists. | | | | | |
| | ------ | -------------- | ------- | --- | --- | --- | | |
| ∕∕ Uncomment to support non portable cluster size, if possible | |
| ∕∕ CUDA_CHECK(cudaFuncSetAttribute(cluster_kernel, | |
| | ,→cudaFuncAttributeNonPortableClusterSizeAllowed, | | | | | 1)) | | |
| | ------------------------------------------------- | --- | ------ | ------ | --- | --- | | |
| | cudaLaunchConfig_t | | config | = {0}; | | | | |
| config.gridDim = grid_dim; ∕∕ has to be a multiple of cluster dim. | |
| | config.blockDim | | = | block_dim; | | | | |
| | ----------------------------- | ------------ | -------------------------------------- | ---------------------------- | --- | --- | | |
| | config.dynamicSmemBytes | | = | expected_dynamic_shared_mem; | | | | |
| | cudaLaunchAttribute | | attribute[1]; | | | | | |
| | attribute[0].id | | = cudaLaunchAttributeClusterDimension; | | | | | |
| | attribute[0].val.clusterDim.x | | | = 1; | | | | |
| | attribute[0].val.clusterDim.y | | | = 1; | | | | |
| | attribute[0].val.clusterDim.z | | | = 1; | | | | |
| | config.attrs | = attribute; | | | | | | |
| | config.numAttrs | | = 1; | | | | | |
| config.stream=gc_stream; ∕∕ Need to pass the CUDA stream that will be used for | |
| | ,→that | kernel | | | | | | |
| | ------------------------------ | --------- | ----------- | ------- | -------------- | ------ | | |
| | int max_potential_cluster_size | | | = | 0; | | | |
| | ∕∕ the | next call | will ignore | cluster | dims in launch | config | | |
| CUDA_CHECK(cudaOccupancyMaxPotentialClusterSize(&max_potential_cluster_size, | |
| | ,→cluster_kernel, | | &config)); | | | | | |
| | ----------------- | --- | ---------- | --- | --- | --- | | |
| std::cout << "max potential cluster size is " << max_potential_cluster_size < | |
| | ,→< " for | CUDA stream | gc_stream" | | << std::endl; | | | |
| | --------- | ----------- | ---------- | --- | ------------- | --- | | |
| ∕∕ Could choose to update launch config's clusterDim with max_potential_cluster_ | |
| ,→size. | |
| ∕∕ Doing so would result in a successful cudaLaunchKernelEx call for the same | |
| | ,→kernel | and launch | config. | | | | | |
| | ----------------- | ---------- | ------- | --- | --- | --- | | |
| | int num_clusters= | | 0; | | | | | |
| CUDA_CHECK(cudaOccupancyMaxActiveClusters(&num_clusters, cluster_kernel, & | |
| ,→config)); | |
| std::cout << "Potential max. active clusters count is " << num_clusters << | |
| ,→std::endl; | |
| VerifyGreenContextsUse | |
| Beyondempiricalobservationsofaffectedkernelexecutiontimesduetogreencontextprovisioning, | |
| the user can leverage Nsight Systems or Nsight Compute CUDA developer tools to verify, to some | |
| extent,correctgreencontextsuse. | |
| For example, kernels launched on CUDA streams belonging to different green contexts will appear | |
| underdifferentGreenContextrowsundertheCUDAHWtimelinesectionofanNsightSystemsreport. | |
| NsightCompute provides a GreenContextResources overview in its Session page as well as updated | |
| # SMs under the Launch Statistics of the Details section. The former provides a visual bitmask of | |
| 4.6. GreenContexts 249 | |
| CUDAProgrammingGuide,Release13.1 | |
| provisionedresources. Thisisparticularlyusefulifanapplicationusesdifferentgreencontexts,asthe | |
| user can confirm expected overlap across GCs (no overlap or expected non-zero overlap if SMs are | |
| oversubscribed). | |
| Figure45depictstheseresourcesforanexamplewithtwogreencontextsprovisionedwith112and16 | |
| SMsrespectively,withnoSMoverlapacrossthem. Theprovidedviewcanhelptheuserverifythepro- | |
| visionedSMresourcecountpergreencontext. ItalsohelpsconfirmthatnoSMswereoversubscribed, | |
| asnoboxismarkedgreen(provisionedforthatGC)acrossbothgreencontexts. | |
| Figure45: GreencontextsresourcessectionfromNsightCompute | |
| TheLaunchStatisticssectionalsoexplicitlyliststhenumberofSMsprovisionedforthisgreencontext, | |
| which can thus be used by this kernel. Please note that these are the SMs a given kernel can have | |
| accesstoduringitsexecution,andnottheactualnumberofSMsthatkernelranon. Thesameapplies | |
| to the resources overview shown earlier. The actual number of SMs used by the kernel can depend | |
| onvariousfactors,includingthekernelitself(launchgeometry,etc.),otherworkrunningatthesame | |
| timeontheGPU,etc. | |
| 4.6.6. Additional Execution Contexts APIs | |
| ThissectiontouchesuponsomeadditionalgreencontextAPIs. Foracompletelist,pleaserefertothe | |
| relevantCUDAruntimeAPIsection. | |
| For synchronization using CUDA events, one can leverage the cudaError_t cudaExecutionC- | |
| txRecordEvent(cudaExecutionContext_t ctx, cudaEvent_t event)andcudaError_t cu- | |
| daExecutionCtxWaitEvent(cudaExecutionCtxWaitEvent(cudaExecutionContext_t ctx, | |
| cudaEvent_t event) APIs. cudaExecutionCtxRecordEvent records a CUDA event capturing all | |
| work/activities of the specified execution context at the time of this call, while cudaExecutionC- | |
| txWaitEventmakesallfutureworksubmittedtotheexecutioncontextwaitfortheworkcaptured | |
| inthespecifiedevent. | |
| Using cudaExecutionCtxRecordEvent is more convenient than cudaEventRecord if the execu- | |
| tioncontexthasmultipleCUDAstreams. Toachieveequivalentbehaviorwithoutthisexecutioncon- | |
| textAPI,onewouldneedtorecordaseparateCUDAeventviacudaEventRecordoneveryexecution | |
| context stream and then have dependent work wait separately for all these events. Similarly, cud- | |
| aExecutionCtxWaitEventismoreconvenientthancudaStreamWaitEvent,ifoneneedsallexecu- | |
| tion context streams to wait for an event to complete. The alternative would be a separate cudaS- | |
| treamWaitEventforeverystreaminthisexecutioncontext. | |
| ForblockingsynchronizationontheCPUside,onecanusecudaError_t cudaExecutionCtxSyn- | |
| chronize(cudaExecutionContext_t ctx). Thiscallwillblockuntilthespecifiedexecutioncontext | |
| hascompletedallitswork. IfthespecifiedexecutioncontextwasnotcreatedviacudaGreenCtxCre- | |
| ate, but was rather obtained via cudaDeviceGetExecutionCtx, and is thus the device’s primary | |
| context, calling that function will also synchronize all green contexts that have been created on the | |
| samedevice. | |
| To retrieve the device a given execution context is associated with, one can use cudaExecutionC- | |
| txGetDevice. Toretrievetheuniqueidentifierofagivenexecutioncontext, onecanusecudaExe- | |
| cutionCtxGetId. | |
| Finally, anexplicitlycreatedexecutioncontextcanbedestroyedviathecudaError_t cudaExecu- | |
| tionCtxDestroy(cudaExecutionContext_t ctx)API. | |
| 250 Chapter4. CUDAFeatures | |
| CUDAProgrammingGuide,Release13.1 | |
| 4.6.7. Green Contexts Example | |
| Thissectionillustrateshowgreencontextscanenablecriticalworktostartandcompletesooner. Sim- | |
| ilartothescenariousedinSection4.6.1,theapplicationhastwokernelsthatwillrunontwodifferent | |
| non-blocking CUDA streams. The timeline, from the CPU side, is as follows. A long running kernel | |
| (delay_kernel_us),whichtakesmultiplewavesonthefullGPU,islaunchedfirstonCUDAstreamstrm1. | |
| Thenafterabriefwaittime(lessthanthekernelduration),ashorterbutcriticalkernel(critical_kernel) | |
| is launched on stream strm2. The GPU durations and time from CPU launch to completion for both | |
| kernelsaremeasured. | |
| Asaproxyforalongrunningkernel, adelaykernelisusedwhereeverythreadblockrunsforafixed | |
| numberofmicrosecondsandthenumberofthreadblocksexceedstheGPU’savailableSMs. | |
| Initially,nogreencontextsareused,butthecriticalkernelislaunchedonaCUDAstreamwithahigher | |
| prioritythanthelongrunningkernel. Becauseofitshighprioritystream,thecriticalkernelcanstart | |
| executing as soon as some of the thread blocks of the long running kernel complete. However, it | |
| willstillneedtowaitforsomepotentiallylong-runningthreadblockstocomplete,whichwilldelayits | |
| executionstart. | |
| Figure 46 shows this scenario in an Nsight Systems report. The long running kernel is launched on | |
| stream13,whiletheshortbutcriticalkernelislaunchedonstream14,whichhashigherstreampri- | |
| ority. Ashighlightedontheimage,thecriticalkernelwaitsfor0.9ms(inthiscase)beforeitcanstart | |
| executing. Ifthetwostreamshadidenticalpriorities,thecriticalkernelwouldexecutemuchlater. | |
| Figure46: NsightSystemstimelinewithoutgreencontexts | |
| Toleveragethegreencontextsfeature,twogreencontextsarecreated,eachprovisionedwithadis- | |
| tinct non-overlapping set of SMs. The exact SM split in this case for an H100 with 132 SMs was | |
| chosen,forillustrationpurposes,as16SMsforthecriticalkernel(GreenContext3)and112SMsfor | |
| thelongrunningkernel(GreenContext2). AsFigure47shows,thecriticalkernelcannowstartalmost | |
| instantaneously,asthereareSMsonlyGreenContext3canuse. | |
| Thedurationoftheshortkernelmayincrease,comparedtoitsdurationwhenrunninginisolation,as | |
| there is now a limit on the number of SMs it can use. The same is also the case for the long run- | |
| ningkernel,whichcannolongerusealltheSMsoftheGPU,butisconstrainedbyitsgreencontext’s | |
| provisionedresources. However,thekeyresultisthatthecriticalkernelworkcannowstartandcom- | |
| plete significantly sooner than before. That is barring any other limitations, as parallel execution, as | |
| 4.6. GreenContexts 251 | |
| CUDAProgrammingGuide,Release13.1 | |
| mentionedearlier,cannotbeguaranteed. | |
| Figure47: NsightSystemstimelinewithgreencontexts | |
| Inallcases,theexactSMsplitshouldbedecidedonapercasebasisafterexperimentation. | |
| | 4.7. | Lazy Loading | | | | | | | |
| | ------ | ------------ | --- | --- | --- | --- | --- | | |
| | 4.7.1. | Introduction | | | | | | | |
| LazyloadingreducesprograminitializationtimebywaitingtoloadCUDAmodulesuntiltheyareneeded. | |
| Lazy loading is particularly effective for programs that only use a small number of the kernels they | |
| include,asiscommonwhenusinglibraries. Lazyloadingisdesignedtobeinvisibletotheuserwhenthe | |
| CUDAprogrammingmodelisfollowed. PotentialHazardsexplainsthisindetail. AsofCUDA12.3lazy | |
| Loadingisenabledbydefaultonallplatforms,butcanbecontrolledviatheCUDA_MODULE_LOADING | |
| environmentvariable. | |
| | 4.7.2. | Change | History | | | | | | |
| | ------ | ------ | ------- | --- | --- | --- | --- | | |
| Table17: SelectLazyLoadingChangesbyCUDAVersion | |
| | CUDAVersion | | Change | | | | | | |
| | ----------- | --- | ------ | --- | --- | --- | --- | | |
| 12.3 Lazyloadingperformanceimproved. NowenabledbydefaultforWindows. | |
| | 12.2 | | LazyloadingenabledbydefaultforLinux. | | | | | | |
| | ------ | ------------ | --------------------------------------------- | -------- | ------- | --- | --- | | |
| | 11.7 | | Lazyloadingfirstintroduced,disabledbydefault. | | | | | | |
| | 4.7.3. | Requirements | | for Lazy | Loading | | | | |
| LazyloadingisajointfeatureofboththeCUDAruntimeanddriver. Lazyloadingisonlyavailablewhen | |
| theruntimeanddriverversionrequirementsaresatisfied. | |
| | 252 | | | | | Chapter4. | CUDAFeatures | | |
| | --- | --- | --- | --- | --- | --------- | ------------ | | |
| CUDAProgrammingGuide,Release13.1 | |
| 4.7.3.1 CUDARuntimeVersionRequirement | |
| Lazy loading is available starting in CUDA runtime version 11.7. As CUDA runtime is usually linked | |
| staticallyintoprogramsandlibraries,onlyprogramsandlibrariesfromorcompiledwithCUDA11.7+ | |
| toolkitwillbenefitfromlazyloading. LibrariescompiledusingolderCUDAruntimeversionswillloadall | |
| moduleseagerly. | |
| 4.7.3.2 CUDADriverVersionRequirement | |
| Lazyloadingrequiresdriverversion515ornewer. Lazyloadingisnotavailablefordriverversionsolder | |
| than515,evenwhenusingCUDAtoolkit11.7ornewer. | |
| 4.7.3.3 CompilerRequirements | |
| Lazyloadingdoesnotrequireanycompilersupport. BothSASSandPTXcompiledwithpre-11.7com- | |
| pilerscanbeloadedwithlazyloadingenabled, andwillseefullbenefitsofthefeature. However, the | |
| version11.7+CUDAruntimeisstillrequired,asdescribedabove. | |
| 4.7.3.4 KernelRequirements | |
| Lazyloadingdoesnotaffectmodulescontainingmanagedvariables,whichwillstillbeloadedeagerly. | |
| 4.7.4. Usage | |
| 4.7.4.1 Enabling&Disabling | |
| Lazy loading is enabled by setting the CUDA_MODULE_LOADING environment variable to LAZY. Lazy | |
| loadingcanbedisabledbysettingtheCUDA_MODULE_LOADINGenvironmentvariabletoEAGER.Asof | |
| CUDA12.3,lazyloadingisenabledbydefaultonallplatforms. | |
| 4.7.4.2 CheckingifLazyLoadingisEnabledatRuntime | |
| ThecuModuleGetLoadingModeAPIintheCUDAdriverAPIcanbeusedtodetermineiflazyloading | |
| isenabled. NotethatCUDAmustbeinitializedbeforerunningthisfunction. Sampleusageisshown | |
| inthesnippetbelow. | |
| #include "<cuda.h>" | |
| #include "<assert.h>" | |
| #include "<iostream>" | |
| int main() { | |
| CUmoduleLoadingMode mode; | |
| assert(CUDA_SUCCESS == cuInit(0)); | |
| assert(CUDA_SUCCESS == cuModuleGetLoadingMode(&mode)); | |
| std::cout << "CUDA Module Loading Mode is " << ((mode == CU_MODULE_ | |
| ,→LAZY_LOADING) ? "lazy" : "eager") << std::endl; | |
| return 0; | |
| } | |
| 4.7. LazyLoading 253 | |
| CUDAProgrammingGuide,Release13.1 | |
| 4.7.4.3 ForcingaModuletoLoadEagerlyatRuntime | |
| Loading kernels and variables happens automatically, without any need for explicit loading. Kernels | |
| canbeloadedexplicitlyevenwithoutexecutingthembydoingthefollowing: | |
| ▶ ThecuModuleGetFunction()functionwillcauseamoduletobeloadedintodevicememory | |
| ▶ ThecudaFuncGetAttributes()functionwillcauseakerneltobeloadedintodevicememory | |
| Note | |
| cuModuleLoad()doesnotguaranteethatamodulewillbeloadedimmediately. | |
| 4.7.5. Potential Hazards | |
| Lazyloadingisdesignedsothatitshouldnotrequireanymodificationstoapplicationstouseit. That | |
| said,therearesomecaveats,especiallywhenapplicationsarenotfullycompliantwiththeCUDApro- | |
| grammingmodel,asdescribedbelow. | |
| 4.7.5.1 ImpactonConcurrentKernelExecution | |
| Some programs incorrectly assume that concurrent kernel execution is guaranteed. A deadlock can | |
| occurifcross-kernelsynchronizationisrequired,butkernelexecutionhasbeenserialized. Tominimize | |
| theimpactoflazyloadingonconcurrentkernelexecution,dothefollowing: | |
| ▶ preloadallkernelsthatyouhopetoexecuteconcurrentlypriortolaunchingthemor | |
| ▶ run application with CUDA_MODULE_LOADING = EAGER to force loading data eagerly without | |
| forcingeachfunctiontoloadeagerly | |
| 4.7.5.2 LargeMemoryAllocations | |
| LazyloadingdelaysmemoryallocationforCUDAmodulesfromprograminitializationuntilclosertoex- | |
| ecutiontime. IfanapplicationallocatestheentireVRAMonstartup,CUDAcanfailtoallocatememory | |
| formodulesatruntime. Possiblesolutions: | |
| ▶ usecudaMallocAsync()insteadofanallocatorthatallocatestheentireVRAMonstartup | |
| ▶ addsomebuffertocompensateforthedelayedloadingofkernels | |
| ▶ preloadallkernelsthatwillbeusedintheprogrambeforetryingtoinitializetheallocator | |
| 4.7.5.3 ImpactonPerformanceMeasurements | |
| Lazy loading may skew performance measurements by moving CUDA module initialization into the | |
| measuredexecutionwindow. Toavoidthis: | |
| ▶ doatleastonewarmupiterationpriortomeasurement | |
| ▶ preloadthebenchmarkedkernelpriortolaunchingit | |
| 4.8. Error Log Management | |
| The ErrorLogManagement mechanism allows for CUDA API errors to be reported to developers in a | |
| plain-Englishformatthatdescribesthecauseoftheissue. | |
| 254 Chapter4. CUDAFeatures | |
| CUDAProgrammingGuide,Release13.1 | |
| 4.8.1. Background | |
| Traditionally,theonlyindicationofafailedCUDAAPIcallisthereturnofanon-zerocode. AsofCUDA | |
| Toolkit12.9,theCUDARuntimedefinesover100differentreturncodesforerrorconditions,butmany | |
| ofthemaregenericandgivethedevelopernoassistancewithdebuggingthecause. | |
| 4.8.2. Activation | |
| SettheCUDA_LOG_FILEenvironmentvariable. Acceptablevaluesarestdout,stderr,oravalidpathon | |
| thesystemtowriteafile. ThelogbuffercanbedumpedviaAPIevenifCUDA_LOG_FILE wasnotset | |
| beforeprogramexecution. NOTE:Anerror-freeexecutionmaynotprintanylogs. | |
| 4.8.3. Output | |
| Logsareoutputinthefollowingformat: | |
| [Time][TID][Source][Severity][API Entry Point] Message | |
| ThefollowinglineisanactualerrormessagethatisgeneratedifthedevelopertriestodumptheError | |
| LogManagementlogstoanunallocatedbuffer: | |
| [22:21:32.099][25642][CUDA][E][cuLogsDumpToMemory] buffer cannot be NULL | |
| Wherebefore,allthedeveloperwouldhavegottenisCUDA_ERROR_INVALID_VALUEinthereturncode | |
| andpossibly“invalidargument”ifcuGetErrorStringiscalled. | |
| 4.8.4. API Description | |
| TheCUDADriverprovidesAPIsintwocategoriesforinteractingwiththeErrorLogManagementfea- | |
| ture. | |
| Thisfeatureallowsdeveloperstoregistercallbackfunctionstobeusedwheneveranerrorlogisgen- | |
| erated,wherethecallbacksignatureis: | |
| void callbackFunc(void *data, CUlogLevel logLevel, char *message, size_t | |
| ,→length) | |
| CallbacksareregisteredwiththisAPI: | |
| CUresult cuLogsRegisterCallback(CUlogsCallback callbackFunc, void *userData, | |
| ,→CUlogsCallbackHandle *callback_out) | |
| Where userData is passed to the callback function without modifications. callback_out should be | |
| storedbythecallerforuseincuLogsUnregisterCallback. | |
| CUresult cuLogsUnregisterCallback(CUlogsCallbackHandle callback) | |
| TheothersetofAPIfunctionsareformanagingtheoutputoflogs. Animportantconceptisthelog | |
| iterator,whichpointstothecurrentendofthebuffer: | |
| CUresult cuLogsCurrent(CUlogIterator *iterator_out, unsigned int flags) | |
| 4.8. ErrorLogManagement 255 | |
| CUDAProgrammingGuide,Release13.1 | |
| Theiteratorpositioncanbekeptbythecallingsoftwareinsituationswhereadumpoftheentirelog | |
| buffer is not desired. Currently, the flags parameter must be 0, with additional options reserved for | |
| futureCUDAreleases. | |
| Atanytime,theerrorlogbuffercanbedumpedtoeitherafileormemorywiththesefunctions: | |
| CUresult cuLogsDumpToFile(CUlogIterator *iterator, const char *pathToFile, | |
| ,→unsigned int flags) | |
| CUresult cuLogsDumpToMemory(CUlogIterator *iterator, char *buffer, size_t | |
| ,→*size, unsigned int flags) | |
| If iterator is NULL, the entire bufferwill be dumped, up to the maximum of 100 entries. If iterator is | |
| notNULL,logswillbedumpedstartingfromthatentryandthevalueofiteratorwillbeupdatedtothe | |
| current end of the logs, as if cuLogsCurrent had been called. If there have been more than 100 log | |
| entriesintothebuffer,anotewillbeaddedatthestartofthedumpnotingthis. | |
| Theflagsparametermustbe0,withadditionaloptionsreservedforfutureCUDAreleases. | |
| ThecuLogsDumpToMemoryfunctionhasadditionalconsiderations: | |
| 1. Thebufferitselfwillbenull-terminated,buteachindividuallogentrywillonlybeseparatedbya | |
| newline(n)character. | |
| 2. Themaximumsizeofthebufferis25600bytes. | |
| 3. Ifthevalueprovidedinsizeisnotsufficienttostorealldesiredlogs,anotewillbeaddedasthe | |
| firstentryandtheoldestentriesthatdonotfitwillnotbedumped. | |
| 4. Afterreturning,sizewillcontaintheactualnumberofbyteswrittentotheprovidedbuffer. | |
| 4.8.5. Limitations and Known Issues | |
| 1. The log buffer is limited to 100 entries. After this limit is reached, the oldest entries will be | |
| replacedandlogdumpswillcontainalinenotingtherollover. | |
| 2. Not all CUDA APIs are covered yet. This is an ongoing project to provide better usage error re- | |
| portingforallAPIs. | |
| 3. TheErrorLogManagementloglocation(ifgiven)willnotbetestedforvalidityuntil/unlessalog | |
| isgenerated. | |
| 4. TheErrorLogManagementAPIsarecurrentlyonlyavailableviatheCUDADriver. EquivalentAPIs | |
| willbeaddedtotheCUDARuntimeinafuturerelease. | |
| 5. ThelogmessagesarenotlocalizedtoanylanguageandallprovidedlogsareinUSEnglish. | |
| 4.9. Asynchronous Barriers | |
| Asynchronousbarriers,introducedinAdvancedSynchronizationPrimitives,extendCUDAsynchroniza- | |
| tionbeyond__syncthreads()and__syncwarp(),enablingfine-grained,non-blockingcoordination | |
| andbetteroverlapofcommunicationandcomputation. | |
| Thissectionprovidesdetailsonhowtouseasynchronousbarriersmainlyviathecuda::barrierAPI | |
| (withpointerstocuda::ptxandprimitiveswhereapplicable). | |
| 256 Chapter4. CUDAFeatures | |
| CUDAProgrammingGuide,Release13.1 | |
| 4.9.1. Initialization | |
| Initializationmusthappenbeforeanythreadbeginsparticipatinginabarrier. | |
| CUDAC++cuda::barrier | |
| | #include <cuda∕barrier> | | | | | |
| | ------------------------------- | ------------------- | --- | --- | | |
| | #include <cooperative_groups.h> | | | | | |
| | __global__ | void init_barrier() | | | | |
| { | |
| | __shared__ | cuda::barrier<cuda::thread_scope_block> | | bar; | | |
| | ----------------------- | ------------------------------------------ | ----- | ---- | | |
| | auto block | = cooperative_groups::this_thread_block(); | | | | |
| | if (block.thread_rank() | | == 0) | | | |
| { | |
| ∕∕ A single thread initializes the total expected arrival count. | |
| | init(&bar, | block.size()); | | | | |
| | ---------- | -------------- | --- | --- | | |
| } | |
| block.sync(); | |
| } | |
| CUDAC++cuda::ptx | |
| | #include <cuda∕ptx> | | | | | |
| | ------------------------------- | ------------------- | --- | --- | | |
| | #include <cooperative_groups.h> | | | | | |
| | __global__ | void init_barrier() | | | | |
| { | |
| | __shared__ | uint64_t | bar; | | | |
| | ----------------------- | ------------------------------------------ | ----- | --- | | |
| | auto block | = cooperative_groups::this_thread_block(); | | | | |
| | if (block.thread_rank() | | == 0) | | | |
| { | |
| ∕∕ A single thread initializes the total expected arrival count. | |
| cuda::ptx::mbarrier_init(&bar, block.size()); | |
| } | |
| block.sync(); | |
| } | |
| 4.9. AsynchronousBarriers 257 | |
| CUDAProgrammingGuide,Release13.1 | |
| CUDACprimitives | |
| | #include | <cuda_awbarrier_primitives.h> | | | | | | | |
| | ---------- | ----------------------------- | --- | --- | --- | --- | --- | | |
| | #include | <cooperative_groups.h> | | | | | | | |
| | __global__ | void init_barrier() | | | | | | | |
| { | |
| | __shared__ | uint64_t | bar; | | | | | | |
| | ---------- | ------------------------------------------------ | ---- | --- | --- | --- | --- | | |
| | auto | block = cooperative_groups::this_thread_block(); | | | | | | | |
| | if | (block.thread_rank() | == | 0) | | | | | |
| { | |
| ∕∕ A single thread initializes the total expected arrival count. | |
| | __mbarrier_init(&bar, | | block.size()); | | | | | | |
| | --------------------- | --- | -------------- | --- | --- | --- | --- | | |
| } | |
| block.sync(); | |
| } | |
| Before any thread can participate in a barrier, the barrier must be initialized using the | |
| cuda::barrier::init() friend function. This must happen before any thread arrives on the bar- | |
| rier. This poses a bootstrapping challenge in that threads must synchronize before participating in | |
| thebarrier,butthreadsarecreatingabarrierinordertosynchronize. Inthisexample,threadsthatwill | |
| participatearepartofacooperativegroupanduseblock.sync()tobootstrapinitialization. Sincea | |
| wholethreadblockisparticipatinginthebarrier,__syncthreads()couldalsobeused. | |
| Thesecondparameterofinit()istheexpectedarrivalcount,i.e.,thenumberoftimesbar.arrive() | |
| willbecalledbyparticipatingthreadsbeforeaparticipatingthreadisunblockedfromitscalltobar. | |
| wait(std::move(token)). Inthisandthepreviousexamples,thebarrierisinitializedwiththenum- | |
| berofthreadsinthethreadblocki.e.,cooperative_groups::this_thread_block().size(),so | |
| thatallthreadswithinthethreadblockcanparticipateinthebarrier. | |
| Asynchronous barriers are flexible in specifying how threads participate (split arrive/wait) and which | |
| threads participate. In contrast, this_thread_block.sync() or __syncthreads() is applicable | |
| to the whole thread-block and to a specified subset of a warp. Nonetheless, if | |
| __syncwarp(mask) | |
| the intention of the user is to synchronize a full thread block or a full warp, we recommend using | |
| __syncthreads()and__syncwarp()respectivelyforbetterperformance. | |
| | 4.9.2. | A Barrier’s | Phase: | Arrival, | Countdown, | Completion, | | | |
| | ------ | ----------- | ------ | -------- | ---------- | ----------- | --- | | |
| and Reset | |
| Anasynchronousbarriercountsdownfromtheexpectedarrivalcounttozeroasparticipatingthreads | |
| callbar.arrive(). Whenthecountdownreacheszero,thebarrieriscompleteforthecurrentphase. | |
| Whenthelastcalltobar.arrive()causesthecountdowntoreachzero,thecountdownisautomat- | |
| icallyandatomicallyreset. Theresetassignsthecountdowntotheexpectedarrivalcount,andmoves | |
| thebarriertothenextphase. | |
| Atokenobjectofclasscuda::barrier::arrival_token,asreturnedfromtoken=bar.arrive(), | |
| isassociatedwiththecurrentphaseofthebarrier. Acalltobar.wait(std::move(token))blocks | |
| the calling thread while the barrier is in the current phase, i.e., while the phase associated with the | |
| token matches the phase of the barrier. If the phase is advanced (because the countdown reaches | |
| | 258 | | | | | Chapter4. | CUDAFeatures | | |
| | --- | --- | --- | --- | --- | --------- | ------------ | | |
| CUDAProgrammingGuide,Release13.1 | |
| zero)beforethecalltobar.wait(std::move(token))thenthethreaddoesnotblock;ifthephase | |
| isadvancedwhilethethreadisblockedinbar.wait(std::move(token)),thethreadisunblocked. | |
| It is essential to know when a reset could or could not occur, especially in non-trivial arrive/wait | |
| synchronizationpatterns. | |
| ▶ A thread’s calls to token=bar.arrive() and bar.wait(std::move(token)) must be se- | |
| quencedsuchthattoken=bar.arrive() occursduringthebarrier’scurrentphase, andbar. | |
| wait(std::move(token))occursduringthesameornextphase. | |
| ▶ Athread’scalltobar.arrive()mustoccurwhenthebarrier’scounterisnon-zero. Afterbar- | |
| rier initialization, if a thread’s call to bar.arrive() causes the countdown to reach zero then | |
| a call to bar.wait(std::move(token)) must happen before the barrier can be reused for a | |
| subsequentcalltobar.arrive(). | |
| ▶ bar.wait()mustonlybecalledusingatokenobjectofthecurrentphaseortheimmediately | |
| precedingphase. Foranyothervaluesofthetokenobject,thebehaviorisundefined. | |
| Forsimplearrive/waitsynchronizationpatterns,compliancewiththeseusagerulesisstraightforward. | |
| 4.9.2.1 WarpEntanglement | |
| Warp-divergence affects the number of times an arrive on operation updates the barrier. If the in- | |
| vokingwarpisfullyconverged,thenthebarrierisupdatedonce. Iftheinvokingwarpisfullydiverged, | |
| then32individualupdatesareappliedtothebarrier. | |
| Note | |
| Itisrecommendedthatarrive-on(bar)invocationsareusedbyconvergedthreadstominimize | |
| updatestothebarrierobject. Whencodeprecedingtheseoperationsdivergesthreads, thenthe | |
| warpshouldbere-converged,via__syncwarpbeforeinvokingarrive-onoperations. | |
| 4.9.3. Explicit Phase Tracking | |
| An asynchronous barrier can have multiple phases depending on how many times it is used to syn- | |
| chronizethreadsandmemoryoperations. Insteadofusingtokenstotrackbarrierphaseflips,wecan | |
| directlytrackaphaseusingthembarrier_try_wait_parity()familyoffunctionsavailablethrough | |
| thecuda::ptxandprimitivesAPIs. | |
| In its simplest form, the cuda::ptx::mbarrier_try_wait_parity(uint64_t* bar, const | |
| uint32_t& phaseParity) function waits for a phase with a particular parity. The phaseParity | |
| operand is the integer parity of either the current phase or the immediately preceding phase of the | |
| barrierobject. Anevenphasehasintegerparity0andanoddphasehasintegerparity1. Whenweini- | |
| tializeabarrier,itsphasehasparity0. SothevalidvaluesofphaseParityare0and1. Explicitphase | |
| trackingcanbeusefulwhentrackingasynchronousmemoryoperations,asitallowsonlyasinglethread | |
| toarriveonthebarrierandsetthetransactioncount,whileotherthreadsonlywaitforaparity-based | |
| phaseflip. Thiscanbemoreefficientthanhavingallthreadsarriveonthebarrierandusetokens. This | |
| functionalityisonlyavailableforshared-memorybarriersatthread-blockandclusterscope. | |
| 4.9. AsynchronousBarriers 259 | |
| CUDAProgrammingGuide,Release13.1 | |
| CUDAC++cuda::barrier | |
| | #include | | <cuda∕ptx> | | | | | | | | | |
| | ---------- | --- | ---------------------- | ------------- | --- | ------ | --------------- | --- | --- | --- | | |
| | #include | | <cooperative_groups.h> | | | | | | | | | |
| | __device__ | | void | compute(float | | *data, | int iteration); | | | | | |
| __global__ void split_arrive_wait(int iteration_count, float *data) | |
| { | |
| | | using | barrier_t | = | cuda::barrier<cuda::thread_scope_block>; | | | | | | | |
| | --- | ----------------------- | --------- | ---------------------------------------- | ---------------------------------------- | ------------ | ----- | --- | --- | --- | | |
| | | __shared__ | barrier_t | | bar; | | | | | | | |
| | | int parity | = | 0; ∕∕ | Initial | phase parity | is 0. | | | | | |
| | | auto block | = | cooperative_groups::this_thread_block(); | | | | | | | | |
| | | if (block.thread_rank() | | | | == 0) | | | | | | |
| { | |
| | | ∕∕ Initialize | | barrier | | with expected | arrival | count. | | | | |
| | --- | ------------- | --- | -------------- | --- | ------------- | ------- | ------ | --- | --- | | |
| | | init(&bar, | | block.size()); | | | | | | | | |
| } | |
| block.sync(); | |
| | | for (int | i = | 0; i | < iteration_count; | | ++i) | | | | | |
| | --- | -------- | --- | ---- | ------------------ | --- | ---- | --- | --- | --- | | |
| { | |
| | | ∕* code | before | arrive | | *∕ | | | | | | |
| | --- | ------- | ------ | -------- | --- | ------------ | --------- | --------- | --- | --- | | |
| | | ∕∕ This | thread | arrives. | | Arrival does | not block | a thread. | | | | |
| ∕∕ Get a handle to the native barrier to use with cuda::ptx API. | |
| (void)cuda::ptx::mbarrier_arrive(cuda::device::barrier_native_ | |
| ,→handle(bar)); | |
| | | compute(data, | | i); | | | | | | | | |
| | --- | ------------- | --- | --- | --- | --- | --- | --- | --- | --- | | |
| ∕∕ Wait for all threads participating in the barrier to complete mbarrier_ | |
| ,→arrive(). | |
| ∕∕ Get a handle to the native barrier to use with cuda::ptx API. | |
| while (!cuda::ptx::mbarrier_try_wait_parity(cuda::device::barrier_ | |
| | | ,→native_handle(bar), | | | parity)) | {} | | | | | | |
| | --- | --------------------- | ------- | ---- | -------- | --- | --- | --- | --- | --- | | |
| | | ∕∕ Flip | parity. | | | | | | | | | |
| | | parity | ^= 1; | | | | | | | | | |
| | | ∕* code | after | wait | *∕ | | | | | | | |
| } | |
| } | |
| | 260 | | | | | | | | Chapter4. | CUDAFeatures | | |
| | --- | --- | --- | --- | --- | --- | --- | --- | --------- | ------------ | | |
| CUDAProgrammingGuide,Release13.1 | |
| CUDAC++cuda::ptx | |
| | #include | <cuda∕ptx> | | | | | | |
| | ---------- | ---------------------- | ------------- | ------ | --------------- | --- | | |
| | #include | <cooperative_groups.h> | | | | | | |
| | __device__ | void | compute(float | *data, | int iteration); | | | |
| __global__ void split_arrive_wait(int iteration_count, float *data) | |
| { | |
| | __shared__ | uint64_t | bar; | | | | | |
| | ----------------------- | -------- | ---------------------------------------- | ------------ | ----- | --- | | |
| | int parity | = | 0; ∕∕ Initial | phase parity | is 0. | | | |
| | auto block | = | cooperative_groups::this_thread_block(); | | | | | |
| | if (block.thread_rank() | | | == 0) | | | | |
| { | |
| | ∕∕ Initialize | | barrier | with expected | arrival | count. | | |
| | ------------------------------ | --- | ------- | -------------- | ------- | ------ | | |
| | cuda::ptx::mbarrier_init(&bar, | | | block.size()); | | | | |
| } | |
| block.sync(); | |
| | for (int | i = | 0; i < iteration_count; | | ++i) | | | |
| | -------- | --- | ----------------------- | --- | ---- | --- | | |
| { | |
| | ∕* code | before | arrive | *∕ | | | | |
| | ------- | ------ | -------- | ------------ | --------- | --------- | | |
| | ∕∕ This | thread | arrives. | Arrival does | not block | a thread. | | |
| (void)cuda::ptx::mbarrier_arrive(&bar); | |
| | compute(data, | | i); | | | | | |
| | ------------- | --- | --- | --- | --- | --- | | |
| ∕∕ Wait for all threads participating in the barrier to complete mbarrier_ | |
| ,→arrive(). | |
| while (!cuda::ptx::mbarrier_try_wait_parity(&bar, parity)) {} | |
| | ∕∕ Flip | parity. | | | | | | |
| | ------- | ------- | ------- | --- | --- | --- | | |
| | parity | ^= 1; | | | | | | |
| | ∕* code | after | wait *∕ | | | | | |
| } | |
| } | |
| 4.9. AsynchronousBarriers 261 | |
| CUDAProgrammingGuide,Release13.1 | |
| CUDACprimitives | |
| | #include | | <cuda_awbarrier_primitives.h> | | | | | | | | | | | |
| | ---------- | --- | ----------------------------- | ------------- | --- | ------ | --- | --- | ----------- | --- | --- | --- | | |
| | #include | | <cooperative_groups.h> | | | | | | | | | | | |
| | __device__ | | void | compute(float | | *data, | | int | iteration); | | | | | |
| __global__ void split_arrive_wait(int iteration_count, float *data) | |
| { | |
| | | __shared__ | __mbarrier_t | | | bar; | | | | | | | | |
| | --- | ----------------------- | ------------ | ---------------------------------------- | --- | ------- | ----- | ------ | --- | ------ | --- | --- | | |
| | | bool parity | = | false; | ∕∕ | Initial | phase | parity | is | false. | | | | |
| | | auto block | = | cooperative_groups::this_thread_block(); | | | | | | | | | | |
| | | if (block.thread_rank() | | | | == 0) | | | | | | | | |
| { | |
| | | ∕∕ Initialize | | barrier | | with expected | | arrival | | count. | | | | |
| | --- | --------------------- | --- | ------- | --- | -------------- | --- | ------- | --- | ------ | --- | --- | | |
| | | __mbarrier_init(&bar, | | | | block.size()); | | | | | | | | |
| } | |
| block.sync(); | |
| | | for (int | i = | 0; i | < iteration_count; | | | ++i) | | | | | | |
| | --- | -------- | --- | ---- | ------------------ | --- | --- | ---- | --- | --- | --- | --- | | |
| { | |
| | | ∕* code | before | arrive | | *∕ | | | | | | | | |
| | --- | ------- | ------ | -------- | --- | ------- | ---- | --- | ----- | --------- | --- | --- | | |
| | | ∕∕ This | thread | arrives. | | Arrival | does | not | block | a thread. | | | | |
| (void)__mbarrier_arrive(&bar); | |
| | | compute(data, | | i); | | | | | | | | | | |
| | --- | ------------- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | |
| ∕∕ Wait for all threads participating in the barrier to complete __ | |
| ,→mbarrier_arrive(). | |
| | | while(!__mbarrier_try_wait_parity(&bar, | | | | | | | parity, | 1000)) | {} | | | |
| | --- | --------------------------------------- | ----- | ---- | --- | --- | --- | --- | ------- | ------ | --- | --- | | |
| | | parity | ^= 1; | | | | | | | | | | | |
| | | ∕* code | after | wait | *∕ | | | | | | | | | |
| } | |
| } | |
| | 4.9.4. | | Early | Exit | | | | | | | | | | |
| | ------ | --- | ----- | ---- | --- | --- | --- | --- | --- | --- | --- | --- | | |
| When a thread that is participating in a sequence of synchronizations must exit early from that se- | |
| quence,thatthreadmustexplicitlydropoutofparticipationbeforeexiting. Theremainingparticipat- | |
| ingthreadscanproceednormallywithsubsequentarriveandwaitoperations. | |
| | 262 | | | | | | | | | | Chapter4. | CUDAFeatures | | |
| | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --------- | ------------ | | |
| CUDAProgrammingGuide,Release13.1 | |
| CUDAC++cuda::barrier | |
| | #include | <cuda∕barrier> | | | | | | | |
| | ---------- | ---------------------- | --------------------- | --- | --- | --- | --- | | |
| | #include | <cooperative_groups.h> | | | | | | | |
| | __device__ | bool | condition_check(); | | | | | | |
| | __global__ | void | early_exit_kernel(int | | | N) | | | |
| { | |
| | __shared__ | | cuda::barrier<cuda::thread_scope_block> | | | | bar; | | |
| | ----------------------- | --- | ------------------------------------------ | --- | --- | --- | ---- | | |
| | auto block | | = cooperative_groups::this_thread_block(); | | | | | | |
| | if (block.thread_rank() | | | == | 0) | | | | |
| { | |
| | init(&bar, | | block.size()); | | | | | | |
| | ---------- | --- | -------------- | --- | --- | --- | --- | | |
| } | |
| block.sync(); | |
| | for (int | i | = 0; i < | N; ++i) | | | | | |
| | -------- | --- | -------- | ------- | --- | --- | --- | | |
| { | |
| | if (condition_check()) | | | | | | | | |
| | ---------------------- | --- | --- | --- | --- | --- | --- | | |
| { | |
| bar.arrive_and_drop(); | |
| return; | |
| } | |
| | ∕∕ Other | | threads can | proceed | normally. | | | | |
| | -------- | ------- | --------------- | ------- | ---------- | --- | --- | | |
| | auto | token | = bar.arrive(); | | | | | | |
| | ∕* code | between | arrive | and | wait | *∕ | | | |
| | ∕∕ Wait | for | all threads | | to arrive. | | | | |
| bar.wait(std::move(token)); | |
| | ∕* code | after | wait | *∕ | | | | | |
| | ------- | ----- | ---- | --- | --- | --- | --- | | |
| } | |
| } | |
| 4.9. AsynchronousBarriers 263 | |
| CUDAProgrammingGuide,Release13.1 | |
| CUDACprimitives | |
| | #include | | <cuda_awbarrier_primitives.h> | | | | | | | | | |
| | ---------- | --- | ----------------------------- | --------------------- | --- | --- | --- | --- | --- | --- | | |
| | #include | | <cooperative_groups.h> | | | | | | | | | |
| | __device__ | | bool | condition_check(); | | | | | | | | |
| | __global__ | | void | early_exit_kernel(int | | | | N) | | | | |
| { | |
| | | __shared__ | | __mbarrier_t | | bar; | | | | | | |
| | --- | ----------------------- | --- | ------------------------------------------ | --- | ---- | --- | --- | --- | --- | | |
| | | auto block | | = cooperative_groups::this_thread_block(); | | | | | | | | |
| | | if (block.thread_rank() | | | | == | 0) | | | | | |
| { | |
| | | __mbarrier_init(&bar, | | | | block.size()); | | | | | | |
| | --- | --------------------- | --- | --- | --- | -------------- | --- | --- | --- | --- | | |
| } | |
| block.sync(); | |
| | | for (int | i | = 0; | i < | N; ++i) | | | | | | |
| | --- | -------- | --- | ---- | --- | ------- | --- | --- | --- | --- | | |
| { | |
| | | if (condition_check()) | | | | | | | | | | |
| | --- | ---------------------- | --- | --- | --- | --- | --- | --- | --- | --- | | |
| { | |
| | | __mbarrier_token_t | | | | token | = __mbarrier_arrive_and_drop(&bar); | | | | | |
| | --- | ------------------ | --- | --- | --- | ----- | ----------------------------------- | --- | --- | --- | | |
| return; | |
| } | |
| | | ∕∕ Other | | threads | can | proceed | normally. | | | | | |
| | --- | ------------------ | --------------------------- | ------- | ------- | ------- | -------------------------- | ------------- | --- | --- | | |
| | | __mbarrier_token_t | | | | token | = __mbarrier_arrive(&bar); | | | | | |
| | | ∕* code | between | | arrive | and | wait | *∕ | | | | |
| | | ∕∕ Wait | for | all | threads | | to arrive. | | | | | |
| | | while | (!__mbarrier_try_wait(&bar, | | | | | token, 1000)) | {} | | | |
| | | ∕* code | after | | wait | *∕ | | | | | | |
| } | |
| } | |
| Thebar.arrive_and_drop()operationarrivesonthebarriertofulfilltheparticipatingthread’sobli- | |
| gation to arrive in the current phase, and then decrements the expected arrival count for the next | |
| phasesothatthisthreadisnolongerexpectedtoarriveonthebarrier. | |
| | 4.9.5. | | Completion | | | Function | | | | | | |
| | ------ | --- | ---------- | --- | --- | -------- | --- | --- | --- | --- | | |
| The cuda::barrier API supports an optional completion function. A CompletionFunction of | |
| cuda::barrier<Scope, CompletionFunction>isexecutedonceperphase,afterthelastthread | |
| arrives and before any thread is unblocked from the wait. Memory operations performed by the | |
| threadsthatarrivedatthebarrierduringthephasearevisibletothethreadexecutingtheComple- | |
| tionFunction, and all memory operations performed within the CompletionFunction are visible | |
| toallthreadswaitingatthebarrieroncetheyareunblockedfromthewait. | |
| | 264 | | | | | | | | Chapter4. | CUDAFeatures | | |
| | --- | --- | --- | --- | --- | --- | --- | --- | --------- | ------------ | | |
| CUDAProgrammingGuide,Release13.1 | |
| 4.9. AsynchronousBarriers 265 | |
| CUDAProgrammingGuide,Release13.1 | |
| CUDAC++cuda::barrier | |
| | #include | | <cuda∕barrier> | | | | | | | | | | | | |
| | ---------- | --- | ---------------------- | --------------------------- | ------ | --- | --- | ------ | -------- | --- | --- | --- | --- | | |
| | #include | | <cooperative_groups.h> | | | | | | | | | | | | |
| | #include | | <functional> | | | | | | | | | | | | |
| | namespace | | cg = | cooperative_groups; | | | | | | | | | | | |
| | __device__ | | int | divergent_compute(int | | | | *, | int); | | | | | | |
| | __device__ | | int | independent_computation(int | | | | | *, int); | | | | | | |
| | __global__ | | void | psum(int | *data, | | int | n, int | *acc) | | | | | | |
| { | |
| | | auto block | = | cg::this_thread_block(); | | | | | | | | | | | |
| | --- | ------------------ | ----------- | ------------------------ | ----------------- | ------ | --- | --- | --- | --- | --- | --- | --- | | |
| | | constexpr | int | BlockSize | | = 128; | | | | | | | | | |
| | | __shared__ | int | smem[BlockSize]; | | | | | | | | | | | |
| | | assert(BlockSize | | | == block.size()); | | | | | | | | | | |
| | | assert(n | % BlockSize | | == | 0); | | | | | | | | | |
| | | auto completion_fn | | | = [&] | | | | | | | | | | |
| { | |
| | | int | sum = 0; | | | | | | | | | | | | |
| | --- | --- | -------- | ---- | -------------- | --- | --- | ---- | --- | --- | --- | --- | --- | | |
| | | for | (int i | = 0; | i < BlockSize; | | | ++i) | | | | | | | |
| { | |
| | | sum | += smem[i]; | | | | | | | | | | | | |
| | --- | --- | ----------- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | |
| } | |
| | | *acc | += sum; | | | | | | | | | | | | |
| | --- | ---- | ------- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | |
| }; | |
| | | ∕* Barrier | storage. | | | | | | | | | | | | |
| | --- | ---------- | --------------- | -------- | --------------------------------------- | ------------------------- | --------------------- | --- | --- | ------- | --- | --- | --- | | |
| | | Note: | the | barrier | is | not default-constructible | | | | because | | | | | |
| | | | completion_fn | | is | not | default-constructible | | | | due | | | | |
| | | | to the | capture. | | *∕ | | | | | | | | | |
| | | using | completion_fn_t | | = | decltype(completion_fn); | | | | | | | | | |
| | | using | barrier_t | = | cuda::barrier<cuda::thread_scope_block, | | | | | | | | | | |
| completion_fn_t>; | |
| | | __shared__ | std::aligned_storage<sizeof(barrier_t), | | | | | | | | | | | | |
| | --- | ---------- | --------------------------------------- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | |
| alignof(barrier_t)> | |
| bar_storage; | |
| | | ∕∕ Initialize | | barrier. | | | | | | | | | | | |
| | --- | ----------------------- | ---- | -------- | ---------- | ----- | --------------- | --- | --- | --- | --- | --- | --- | | |
| | | barrier_t | *bar | = | (barrier_t | | *)&bar_storage; | | | | | | | | |
| | | if (block.thread_rank() | | | | == 0) | | | | | | | | | |
| { | |
| | | assert(*acc | | == | 0); | | | | | | | | | | |
| | --- | ----------------- | ----------------------------- | --- | ------------- | --- | ------------- | --------------- | --------------- | --- | --- | --- | --- | | |
| | | assert(blockDim.x | | | == blockDim.y | | | == blockDim.y | | == | 1); | | | | |
| | | new | (bar) barrier_t{block.size(), | | | | | completion_fn}; | | | | | | | |
| | | ∕* equivalent | | to: | init(bar, | | block.size(), | | completion_fn); | | | *∕ | | | |
| } | |
| block.sync(); | |
| | | ∕∕ Main | loop. | | | | | | | | | | | | |
| | --- | -------- | ----- | ---- | ------ | --- | ------------- | --- | --- | --- | --- | --- | --- | | |
| | | for (int | i = | 0; i | < n; i | += | block.size()) | | | | | | | | |
| { | |
| | 266 | | | | | | | | | | Chapter4. | | CUDAFeatures | | |
| | --- | ------------------------- | ------ | ---------------- | --- | ----------- | --------- | --- | ------- | --- | --------- | --- | ------------ | | |
| | | smem[block.thread_rank()] | | | | | = data[i] | | + *acc; | | | | | | |
| | | auto | token | = bar->arrive(); | | | | | | | | | | | |
| | | ∕∕ We | can do | independent | | computation | | | here. | | | | | | |
| bar->wait(std::move(token)); | |
| | | ∕∕ Shared-memory | | | is safe | to | re-use | in | the next | iteration | | | | | |
| | --- | ---------------- | --- | ------- | ---------- | ---- | ------ | --- | --------- | --------- | --- | --- | --- | | |
| | | ∕∕ since | all | threads | are | done | with | it, | including | the | one | | | | |
| | | ∕∕ that | did | the | reduction. | | | | | | | | | | |
| } | |
| } | |
| CUDAProgrammingGuide,Release13.1 | |
| | 4.9.6. | Tracking | Asynchronous | Memory | Operations | | |
| | ------ | -------- | ------------ | ------ | ---------- | | |
| Asynchronousbarrierscanbeusedtotrackasynchronousmemorycopies. Whenanasynchronouscopy | |
| operationisboundtoabarrier,thecopyoperationautomaticallyincrementstheexpectedcountofthe | |
| current barrier phase upon initiation and decrements it upon completion. This mechanism ensures | |
| thatthebarrier’swait()operationwillblockuntilallassociatedasynchronousmemorycopieshave | |
| completed,providingaconvenientwaytosynchronizemultipleconcurrentmemoryoperations. | |
| Starting with compute capability 9.0, asynchronous barriers in shared memory with thread- | |
| block or cluster scope can explicitly track asynchronous memory operations. We refer to | |
| these barriers as asynchronous transaction barriers. In addition to the expected arrival count, | |
| a barrier object can accept a transaction count, which can be used for tracking the comple- | |
| tion of asynchronous transactions. The transaction count tracks the number of asynchronous | |
| transactions that are outstanding and yet to be complete, in units specified by the asyn- | |
| chronous memory operation (typically bytes). The transaction count to be tracked by the cur- | |
| rent phase can be set on arrival with cuda::device::barrier_arrive_tx() or directly with | |
| cuda::device::barrier_expect_tx(). Whenabarrierusesatransactioncount,itblocksthreads | |
| at the wait operation until all the producer threads have performed an arrive and the sum of all the | |
| transactioncountsreachesanexpectedvalue. | |
| CUDAC++cuda::barrier | |
| | #include | <cuda∕barrier> | | | | | |
| | ---------- | ---------------------- | --- | --- | --- | | |
| | #include | <cooperative_groups.h> | | | | | |
| | __global__ | void track_kernel() | | | | | |
| { | |
| | __shared__ | cuda::barrier<cuda::thread_scope_block> | | | bar; | | |
| | ---------- | ------------------------------------------------ | ----- | --- | ---- | | |
| | auto | block = cooperative_groups::this_thread_block(); | | | | | |
| | if | (block.thread_rank() | == 0) | | | | |
| { | |
| | init(&bar, | block.size()); | | | | | |
| | ---------- | -------------- | --- | --- | --- | | |
| } | |
| block.sync(); | |
| | auto | token = cuda::device::barrier_arrive_tx(bar, | | | 1, 0); | | |
| | ---- | -------------------------------------------- | --- | --- | ------ | | |
| bar.wait(cuda::std::move(token)); | |
| } | |
| 4.9. AsynchronousBarriers 267 | |
| CUDAProgrammingGuide,Release13.1 | |
| CUDAC++cuda::ptx | |
| | #include | <cuda∕ptx> | | | | | | | |
| | ---------- | ---------------------- | --- | --- | --- | --- | --- | | |
| | #include | <cooperative_groups.h> | | | | | | | |
| | __global__ | void track_kernel() | | | | | | | |
| { | |
| | __shared__ | uint64_t | bar; | | | | | | |
| | ---------- | ------------------------------------------------ | ----- | --- | --- | --- | --- | | |
| | auto | block = cooperative_groups::this_thread_block(); | | | | | | | |
| | if | (block.thread_rank() | == 0) | | | | | | |
| { | |
| | cuda::ptx::mbarrier_init(&bar, | | | block.size()); | | | | | |
| | ------------------------------ | --- | --- | -------------- | --- | --- | --- | | |
| } | |
| block.sync(); | |
| uint64_t token = cuda::ptx::mbarrier_arrive_expect_tx(cuda::ptx::sem_ | |
| ,→release, cuda::ptx::scope_cluster, cuda::ptx::space_shared, &bar, 1, 0); | |
| | while | (!cuda::ptx::mbarrier_try_wait(&bar, | | | token)) | {} | | | |
| | ----- | ------------------------------------ | --- | --- | ------- | --- | --- | | |
| } | |
| Inthisexample,thecuda::device::barrier_arrive_tx()operationconstructsanarrivaltoken | |
| objectassociatedwiththephasesynchronizationpointforthecurrentphase. Then,decrementsthe | |
| arrivalcountby1andincrementstheexpectedtransactioncountby0. Sincethetransactioncount | |
| updateis0,thebarrierisnottrackinganytransactions. ThesubsequentsectiononUsingtheTensor | |
| MemoryAccelerator(TMA)includesexamplesoftrackingasynchronousmemoryoperations. | |
| | 4.9.7. | Producer-Consumer | | Pattern | Using | Barriers | | | |
| | ------ | ----------------- | --- | ------- | ----- | -------- | --- | | |
| Athreadblockcanbespatiallypartitionedtoallowdifferentthreadstoperformindependentopera- | |
| tions. Thisismostcommonlydonebyassigningthreadsfromdifferentwarpswithinthethreadblock | |
| | tospecifictasks. | Thistechniqueisreferredtoaswarpspecialization. | | | | | | | |
| | ---------------- | ---------------------------------------------- | --- | --- | --- | --- | --- | | |
| Thissectionshowsanexampleofspatialpartitioninginaproducer-consumerpattern,whereonesub- | |
| setofthreadsproducesdatathatisconcurrentlyconsumedbytheother(disjoint)subsetofthreads. | |
| Aproducer-consumerspatialpartitioningpatternrequirestwoone-sidedsynchronizationstomanage | |
| adatabufferbetweentheproducerandconsumer. | |
| | | Producer | | Consumer | | | | | |
| | --- | -------------------------------- | --- | ----------------------------- | --- | --- | --- | | |
| | | waitforbuffertobereadytobefilled | | signalbufferisreadytobefilled | | | | | |
| producedataandfillthebuffer | |
| | | signalbufferisfilled | | waitforbuffertobefilled | | | | | |
| | --- | -------------------- | --- | ----------------------- | --- | --- | --- | | |
| consumedatainfilledbuffer | |
| Producerthreadswaitforconsumerthreadstosignalthatthebufferisreadytobefilled;however,con- | |
| sumerthreadsdonotwaitforthissignal. Consumerthreadswaitforproducerthreadstosignalthat | |
| | 268 | | | | | Chapter4. | CUDAFeatures | | |
| | --- | --- | --- | --- | --- | --------- | ------------ | | |
| CUDAProgrammingGuide,Release13.1 | |
| thebufferisfilled;however,producerthreadsdonotwaitforthissignal. Forfullproducer/consumer | |
| concurrencythispatternhas(atleast)doublebufferingwhereeachbufferrequirestwobarriers. | |
| 4.9. AsynchronousBarriers 269 | |
| CUDAProgrammingGuide,Release13.1 | |
| CUDAC++cuda::barrier | |
| | #include | | <cuda∕barrier> | | | | | | | | | | |
| | -------- | --------- | -------------- | --- | ---------------------------------------- | --- | --- | --- | --- | --- | --- | | |
| | using | barrier_t | | = | cuda::barrier<cuda::thread_scope_block>; | | | | | | | | |
| __device__ void produce(barrier_t ready[], barrier_t filled[], float | |
| | | ,→*buffer, | int | buffer_len, | | float | | *in, int | N) | | | | |
| | --- | ---------- | --- | ----------- | --- | ----- | --- | -------- | --- | --- | --- | | |
| { | |
| | | for (int | i | = 0; | i < | N ∕ buffer_len; | | ++i) | | | | | |
| | --- | -------- | --- | ---- | --- | --------------- | --- | ---- | --- | --- | --- | | |
| { | |
| ready[i % 2].arrive_and_wait(); ∕* wait for buffer_(i%2) to be ready to | |
| | | ,→be filled | *∕ | | | | | | | | | | |
| | --- | ----------- | --- | ----- | --- | -------- | ------------ | --- | --- | --- | --- | | |
| | | ∕* produce, | | i.e., | | fill in, | buffer_(i%2) | | *∕ | | | | |
| barrier_t::arrival_token token = filled[i % 2].arrive(); ∕* buffer_(i | |
| | | ,→%2) is | filled | *∕ | | | | | | | | | |
| | --- | -------- | ------ | --- | --- | --- | --- | --- | --- | --- | --- | | |
| } | |
| } | |
| __device__ void consume(barrier_t ready[], barrier_t filled[], float | |
| | | ,→*buffer, | int | buffer_len, | | float | | *out, int | N) | | | | |
| | --- | ---------- | --- | ----------- | --- | ----- | --- | --------- | --- | --- | --- | | |
| { | |
| barrier_t::arrival_token token1 = ready[0].arrive(); ∕* buffer_0 is ready | |
| | | ,→for initial | | fill | *∕ | | | | | | | | |
| | --- | ------------- | --- | ---- | --- | --- | --- | --- | --- | --- | --- | | |
| barrier_t::arrival_token token2 = ready[1].arrive(); ∕* buffer_1 is ready | |
| | | ,→for initial | | fill | *∕ | | | | | | | | |
| | --- | ------------- | --- | ---- | --- | --------------- | --- | ---- | --- | --- | --- | | |
| | | for (int | i | = 0; | i < | N ∕ buffer_len; | | ++i) | | | | | |
| { | |
| filled[i % 2].arrive_and_wait(); ∕* wait for buffer_(i%2) to be filled *∕ | |
| | | ∕* consume | | buffer_(i%2) | | | *∕ | | | | | | |
| | --- | ---------- | --- | ------------ | --- | --- | --- | --- | --- | --- | --- | | |
| barrier_t::arrival_token token3 = ready[i % 2].arrive(); ∕* buffer_(i | |
| | | ,→%2) is | ready | to | be re-filled | | *∕ | | | | | | |
| | --- | -------- | ----- | --- | ------------ | --- | --- | --- | --- | --- | --- | | |
| } | |
| } | |
| __global__ void producer_consumer_pattern(int N, float *in, float *out, int | |
| ,→buffer_len) | |
| { | |
| | | constexpr | int | warpSize | | = 32; | | | | | | | |
| | --- | --------- | --- | -------- | --- | ----- | --- | --- | --- | --- | --- | | |
| ∕* Shared memory buffer declared below is of size 2 * buffer_len | |
| | | so | that | we can | alternatively | | | work between | two buffers. | | | | |
| | --- | ---------- | ---- | -------- | ------------- | ------------ | --- | ------------ | ------------ | --- | --- | | |
| | | buffer_0 | | = buffer | | and buffer_1 | | = buffer | + buffer_len | *∕ | | | |
| | | __shared__ | | extern | float | buffer[]; | | | | | | | |
| ∕* bar[0] and bar[1] track if buffers buffer_0 and buffer_1 are ready to be | |
| ,→filled, | |
| while bar[2] and bar[3] track if buffers buffer_0 and buffer_1 are | |
| | | ,→filled-in | respectively | | | *∕ | | | | | | | |
| | --- | --------------- | ---------------- | --------- | ---- | ---------------------------- | --- | --- | --- | --- | --- | | |
| | | #pragma | nv_diag_suppress | | | static_var_with_dynamic_init | | | | | | | |
| | | __shared__ | | barrier_t | | bar[4]; | | | | | | | |
| | | if (threadIdx.x | | | < 4) | | | | | | | | |
| { | |
| | 270 | | | | | | | | | Chapter4. | CUDAFeatures | | |
| | --- | -------- | --- | -------------- | --- | --- | ------------ | --- | --- | --------- | ------------ | | |
| | | init(bar | | + threadIdx.x, | | | blockDim.x); | | | | | | |
| } | |
| __syncthreads(); | |
| | | if (threadIdx.x | | | < warpSize) | | | | | | | | |
| | --- | --------------- | --- | --- | ----------- | ---------- | --- | ----------- | ------- | --- | --- | | |
| | | { produce(bar, | | | bar + | 2, buffer, | | buffer_len, | in, N); | } | | | |
| else | |
| | | { consume(bar, | | | bar + | 2, buffer, | | buffer_len, | out, N); | } | | | |
| | --- | -------------- | --- | --- | ----- | ---------- | --- | ----------- | -------- | --- | --- | | |
| } | |
| CUDAProgrammingGuide,Release13.1 | |
| 4.9. AsynchronousBarriers 271 | |
| CUDAProgrammingGuide,Release13.1 | |
| CUDAC++cuda::ptx | |
| | #include | | <cuda∕ptx> | | | | | | | | | | | |
| | -------- | --- | ---------- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | |
| __device__ void produce(barrier ready[], barrier filled[], float *buffer, | |
| | | ,→int buffer_len, | | | float | *in, | int | N) | | | | | | |
| | --- | ----------------- | --- | --- | ----- | ---- | --- | --- | --- | --- | --- | --- | | |
| { | |
| | | for (int | i | = 0; | i < | N ∕ | buffer_len; | | ++i) | | | | | |
| | --- | -------- | --- | ---- | --- | --- | ----------- | --- | ---- | --- | --- | --- | | |
| { | |
| | | uint64_t | | token1 | = | cuda::ptx::mbarrier_arrive(ready[i | | | | | % 2]); | | | |
| | --- | -------- | --- | ------ | --- | ---------------------------------- | --- | --- | --- | --- | ------ | --- | | |
| while(!cuda::ptx::mbarrier_try_wait(&ready[i % 2], token1)) {} ∕* wait | |
| | | ,→for buffer_(i%2) | | | to | be ready | | to be | filled | *∕ | | | | |
| | --- | ------------------ | --- | ----- | --- | -------- | --- | ------------ | ------ | --- | --- | --- | | |
| | | ∕* produce, | | i.e., | | fill | in, | buffer_(i%2) | | *∕ | | | | |
| uint64_t token2 = cuda::ptx::mbarrier_arrive(&filled[i % 2]); ∕* buffer_ | |
| | | ,→(i%2) | is filled | | *∕ | | | | | | | | | |
| | --- | ------- | --------- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | |
| } | |
| } | |
| __device__ void consume(barrier ready[], barrier filled[], float *buffer, | |
| | | ,→buffer_len, | | float | *out, | | int | N) | | | | | | |
| | --- | ------------- | --- | ----- | ----- | --- | --- | --- | --- | --- | --- | --- | | |
| { | |
| uint64_t token1 = cuda::ptx::mbarrier_arrive(&ready[0]); ∕* buffer_0 is | |
| | | ,→ready | for initial | | fill | *∕ | | | | | | | | |
| | --- | ------- | ----------- | --- | ---- | --- | --- | --- | --- | --- | --- | --- | | |
| uint64_t token2 = cuda::ptx::mbarrier_arrive(&ready[1]); ∕* buffer_1 is | |
| | | ,→ready | for initial | | fill | *∕ | | | | | | | | |
| | --- | -------- | ----------- | ---- | ---- | --- | ----------- | --- | ---- | --- | --- | --- | | |
| | | for (int | i | = 0; | i < | N ∕ | buffer_len; | | ++i) | | | | | |
| { | |
| uint64_t token3 = cuda::ptx::mbarrier_arrive(&filled[i % 2]); | |
| while(!cuda::ptx::mbarrier_try_wait(&filled[i % 2], token3x)) {} ∕* | |
| | | ,→wait for | buffer_(i%2) | | | to | be filled | | *∕ | | | | | |
| | --- | ---------- | ------------ | ------------ | --- | --- | --------- | --- | --- | --- | --- | --- | | |
| | | ∕* consume | | buffer_(i%2) | | | *∕ | | | | | | | |
| uint64_t token4 = cuda::ptx::mbarrier_arrive(&ready[i % 2]); ∕* buffer_ | |
| | | ,→(i%2) | is ready | | to be | re-filled | | *∕ | | | | | | |
| | --- | ------- | -------- | --- | ----- | --------- | --- | --- | --- | --- | --- | --- | | |
| } | |
| } | |
| __global__ void producer_consumer_pattern(int N, float *in, float *out, int | |
| ,→buffer_len) | |
| { | |
| | | constexpr | int | warpSize | | = | 32; | | | | | | | |
| | --- | --------- | --- | -------- | --- | --- | --- | --- | --- | --- | --- | --- | | |
| ∕* Shared memory buffer declared below is of size 2 * buffer_len | |
| | | so | that | we can | alternatively | | | work | between | two buffers. | | | | |
| | --- | ---------- | ---- | -------- | ------------- | --- | --------- | ---- | -------- | ------------ | --- | --- | | |
| | | buffer_0 | | = buffer | | and | buffer_1 | | = buffer | + buffer_len | *∕ | | | |
| | | __shared__ | | extern | float | | buffer[]; | | | | | | | |
| ∕* bar[0] and bar[1] track if buffers buffer_0 and buffer_1 are ready to be | |
| ,→filled, | |
| while bar[2] and bar[3] track if buffers buffer_0 and buffer_1 are | |
| | | ,→filled-in | respectively | | | *∕ | | | | | | | | |
| | ---- | --------------- | ---------------- | -------- | ---- | ------- | ---------------------------- | --- | --- | --- | --------- | ------------ | | |
| | | #pragma | nv_diag_suppress | | | | static_var_with_dynamic_init | | | | | | | |
| | | __shared__ | | uint64_t | | bar[4]; | | | | | | | | |
| | | if (threadIdx.x | | | < 4) | | | | | | | | | |
| | 272{ | | | | | | | | | | Chapter4. | CUDAFeatures | | |
| cuda::ptx::mbarrier_init(bar + block.thread_rank(), block.size()); | |
| } | |
| __syncthreads(); | |
| | | if (threadIdx.x | | | < warpSize) | | | | | | | | | |
| | --- | --------------- | --- | --- | ----------- | ---- | ------- | --- | ----------- | ------- | --- | --- | | |
| | | { produce(bar, | | | bar | + 2, | buffer, | | buffer_len, | in, N); | } | | | |
| else | |
| | | { consume(bar, | | | bar | + 2, | buffer, | | buffer_len, | out, N); | } | | | |
| | --- | -------------- | --- | --- | --- | ---- | ------- | --- | ----------- | -------- | --- | --- | | |
| } | |
| CUDAProgrammingGuide,Release13.1 | |
| 4.9. AsynchronousBarriers 273 | |
| CUDAProgrammingGuide,Release13.1 | |
| CUDACprimitives | |
| | #include | | <cuda_awbarrier_primitives.h> | | | | | | | | | | | |
| | -------- | --- | ----------------------------- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | |
| __device__ void produce(__mbarrier_t ready[], __mbarrier_t filled[], float | |
| | | ,→*buffer, | int | buffer_len, | | float | | *in, | int N) | | | | | |
| | --- | ---------- | --- | ----------- | --- | ----- | --- | ---- | ------ | --- | --- | --- | | |
| { | |
| | | for (int | i | = 0; | i < | N ∕ buffer_len; | | | ++i) | | | | | |
| | --- | -------- | --- | ---- | --- | --------------- | --- | --- | ---- | --- | --- | --- | | |
| { | |
| __mbarrier_token_t token1 = __mbarrier_arrive(&ready[i % 2]); ∕* wait | |
| | | ,→for buffer_(i%2) | | | to | be ready | to | be filled | | *∕ | | | | |
| | --- | ----------------------------------- | --- | ----- | --- | -------- | ------------ | --------- | ----- | ------- | --------- | --- | | |
| | | while(!__mbarrier_try_wait(&ready[i | | | | | | | % 2], | token1, | 1000)) {} | | | |
| | | ∕* produce, | | i.e., | | fill in, | buffer_(i%2) | | | *∕ | | | | |
| __mbarrier_token_t token2 = __mbarrier_arrive(filled[i % 2]); ∕* | |
| | | ,→buffer_(i%2) | | is | filled | *∕ | | | | | | | | |
| | --- | -------------- | --- | --- | ------ | --- | --- | --- | --- | --- | --- | --- | | |
| } | |
| } | |
| __device__ void consume(__mbarrier_t ready[], __mbarrier_t filled[], float | |
| | | ,→*buffer, | int | buffer_len, | | float | | *out, | int | N) | | | | |
| | --- | ---------- | --- | ----------- | --- | ----- | --- | ----- | --- | --- | --- | --- | | |
| { | |
| __mbarrier_token_t token1 = __mbarrier_arrive(&ready[0]); ∕* buffer_0 is | |
| | | ,→ready | for initial | | fill | *∕ | | | | | | | | |
| | --- | ------- | ----------- | --- | ---- | --- | --- | --- | --- | --- | --- | --- | | |
| __mbarrier_token_t token2 = __mbarrier_arrive(&ready[1]); ∕* buffer_1 is | |
| | | ,→ready | for initial | | fill | *∕ | | | | | | | | |
| | --- | -------- | ----------- | ---- | ---- | --------------- | --- | --- | ---- | --- | --- | --- | | |
| | | for (int | i | = 0; | i < | N ∕ buffer_len; | | | ++i) | | | | | |
| { | |
| __mbarrier_token_t token3 = __mbarrier_arrive(&filled[i % 2]); | |
| | | while(!__mbarrier_try_wait(&filled[i | | | | | | | % | 2], token3, | 1000)) {} | | | |
| | --- | ------------------------------------ | --- | ------------ | --- | --- | --- | --- | --- | ----------- | --------- | --- | | |
| | | ∕* consume | | buffer_(i%2) | | | *∕ | | | | | | | |
| __mbarrier_token_t token4 = __mbarrier_arrive(&ready[i % 2]); ∕* buffer_ | |
| | | ,→(i%2) | is ready | | to be | re-filled | | *∕ | | | | | | |
| | --- | ------- | -------- | --- | ----- | --------- | --- | --- | --- | --- | --- | --- | | |
| } | |
| } | |
| __global__ void producer_consumer_pattern(int N, float *in, float *out, int | |
| ,→buffer_len) | |
| { | |
| | | constexpr | int | warpSize | | = 32; | | | | | | | | |
| | --- | --------- | --- | -------- | --- | ----- | --- | --- | --- | --- | --- | --- | | |
| ∕* Shared memory buffer declared below is of size 2 * buffer_len | |
| | | so | that | we can | alternatively | | | work | between | two buffers. | | | | |
| | --- | ---------- | ---- | -------- | ------------- | ------------ | --- | ---- | ------- | ------------ | --- | --- | | |
| | | buffer_0 | | = buffer | | and buffer_1 | | = | buffer | + buffer_len | *∕ | | | |
| | | __shared__ | | extern | float | buffer[]; | | | | | | | | |
| ∕* bar[0] and bar[1] track if buffers buffer_0 and buffer_1 are ready to be | |
| ,→filled, | |
| while bar[2] and bar[3] track if buffers buffer_0 and buffer_1 are | |
| | | ,→filled-in | respectively | | | *∕ | | | | | | | | |
| | --- | --------------- | ---------------- | ------------ | ---- | ---------------------------- | --- | --- | --- | --- | --- | --- | | |
| | | #pragma | nv_diag_suppress | | | static_var_with_dynamic_init | | | | | | | | |
| | | __shared__ | | __mbarrier_t | | bar[4]; | | | | | | | | |
| | | if (threadIdx.x | | | < 4) | | | | | | | | | |
| { | |
| | 274 | | | | | | | | | | Chapter4. | CUDAFeatures | | |
| | --- | ------------------- | --- | --- | --- | -------------- | --- | --- | ------------ | --- | --------- | ------------ | | |
| | | __mbarrier_init(bar | | | | + threadIdx.x, | | | blockDim.x); | | | | | |
| } | |
| __syncthreads(); | |
| | | if (threadIdx.x | | | < warpSize) | | | | | | | | | |
| | --- | --------------- | --- | --- | ----------- | ---------- | --- | ----------- | --- | ------- | --- | --- | | |
| | | { produce(bar, | | | bar + | 2, buffer, | | buffer_len, | | in, N); | } | | | |
| else | |
| | | { consume(bar, | | | bar + | 2, buffer, | | buffer_len, | | out, | N); } | | | |
| | --- | -------------- | --- | --- | ----- | ---------- | --- | ----------- | --- | ---- | ----- | --- | | |
| } | |
| CUDAProgrammingGuide,Release13.1 | |
| In this example, the first warp is specialized as the producer and the remaining warps are special- | |
| ized as consumers. All producer and consumer threads participate (call bar.arrive() or bar. | |
| arrive_and_wait())ineachofthefourbarrierssotheexpectedarrivalcountsareequaltoblock. | |
| size(). | |
| A producer thread waits for the consumer threads to signal that the shared memory buffer can be | |
| filled. Inordertowaitforabarrier,aproducerthreadmustfirstarriveonthatready[i%2].arrive() | |
| to get a token and then ready[i%2].wait(token) with that token. For simplicity, ready[i%2]. | |
| arrive_and_wait()combinestheseoperations. | |
| bar.arrive_and_wait(); | |
| | ∕* is equivalent | to | *∕ | | | | |
| | ---------------- | --- | --- | --- | --- | | |
| bar.wait(bar.arrive()); | |
| Producerthreadscomputeandfillthereadybuffer,theythensignalthatthebufferisfilledbyarriving | |
| onthefilledbarrier,filled[i%2].arrive(). Aproducerthreaddoesnotwaitatthispoint,instead | |
| itwaitsuntilthenextiteration’sbuffer(doublebuffering)isreadytobefilled. | |
| A consumer thread begins by signaling that both buffers are ready to be filled. A consumer thread | |
| does not wait at this point, instead it waits for this iteration’s buffer to be filled, filled[i%2]. | |
| arrive_and_wait(). After the consumer threads consume the buffer they signal that the buffer | |
| isreadytobefilledagain,ready[i%2].arrive(),andthenwaitforthenextiteration’sbuffertobe | |
| filled. | |
| | 4.10. | Pipelines | | | | | |
| | ----- | --------- | --- | --- | --- | | |
| Pipelines,introducedinAdvancedSynchronizationPrimitives,areamechanismforstagingworkandco- | |
| ordinatingmulti-bufferproducer–consumerpatterns,commonlyusedtooverlapcomputewithasyn- | |
| chronousdatacopies. | |
| Thissectionprovidesdetailsonhowtousepipelinesmainlyviathecuda::pipelineAPI(withpointers | |
| toprimitiveswhereapplicable). | |
| | 4.10.1. | Initialization | | | | | |
| | ------- | -------------- | --- | --- | --- | | |
| A can be created at different thread scopes. For a scope other than | |
| cuda::pipeline | |
| cuda::thread_scope_thread, acuda::pipeline_shared_state<scope, count> objectisre- | |
| quiredtocoordinatetheparticipatingthreads. Thisstateencapsulatesthefiniteresourcesthatallow | |
| apipelinetoprocessuptocountconcurrentstages. | |
| | ∕∕ Create | a pipeline | at thread | scope | | | |
| | --------------------- | ----------------- | ---------------------------- | ------------------------ | --- | | |
| | constexpr | auto scope | = cuda::thread_scope_thread; | | | | |
| | cuda::pipeline<scope> | | pipeline | = cuda::make_pipeline(); | | | |
| | ∕∕ Create | a pipeline | at block | scope | | | |
| | constexpr | auto scope | = cuda::thread_scope_block; | | | | |
| | constexpr | auto stages_count | | = 2; | | | |
| __shared__ cuda::pipeline_shared_state<scope, stages_count> shared_state; | |
| | auto pipeline | = cuda::make_pipeline(group, | | | &shared_state); | | |
| | ------------- | ---------------------------- | --- | --- | --------------- | | |
| Pipelinescanbeeitherunifiedorpartitioned. Inaunifiedpipeline,alltheparticipatingthreadsareboth | |
| producers and consumers. In a partitioned pipeline, each participating thread is either a producer | |
| or a consumer and its role cannot change during the lifetime of the pipeline object. A thread-local | |
| 4.10. Pipelines 275 | |
| CUDAProgrammingGuide,Release13.1 | |
| pipelinecannotbepartitioned. Tocreateapartitionedpipeline,weneedtoprovideeitherthenumber | |
| ofproducersortheroleofthethreadtocuda::make_pipeline(). | |
| ∕∕ Create a partitioned pipeline at block scope where only thread 0 is a | |
| ,→producer | |
| constexpr auto scope = cuda::thread_scope_block; | |
| constexpr auto stages_count = 2; | |
| __shared__ cuda::pipeline_shared_state<scope, stages_count> shared_state; | |
| auto thread_role = (group.thread_rank() == 0) ? cuda::pipeline_role::producer | |
| ,→: cuda::pipeline_role::consumer; | |
| auto pipeline = cuda::make_pipeline(group, &shared_state, thread_role); | |
| Tosupportpartitioning,asharedcuda::pipelineincursadditionaloverheads,includingusingaset | |
| of shared memory barriers per stage for synchronization. These are used even when the pipeline is | |
| unifiedandcoulduse__syncthreads()instead. Thus, itispreferabletousethread-localpipelines | |
| whichavoidtheseoverheadswhenpossible. | |
| 4.10.2. Submitting Work | |
| Committingworktoapipelinestageinvolves: | |
| ▶ Collectively acquiring the pipeline head from a set of producer threads using pipeline. | |
| producer_acquire(). | |
| ▶ Submittingasynchronousoperations,e.g.,memcpy_async,tothepipelinehead. | |
| ▶ Collectivelycommitting(advancing)thepipelineheadusingpipeline.producer_commit(). | |
| If all resources are in use, pipeline.producer_acquire() blocks producer threads until the re- | |
| sourcesofthenextpipelinestagearereleasedbyconsumerthreads. | |
| 4.10.3. Consuming Work | |
| Consumingworkfromapreviouslycommittedstageinvolves: | |
| ▶ Collectivelywaitingforthestagetocomplete,e.g.,usingpipeline.consumer_wait()towait | |
| onthetail(oldest)stage,fromasetofconsumerthreads. | |
| ▶ Collectivelyreleasingthestageusingpipeline.consumer_release(). | |
| With cuda::pipeline<cuda:thread_scope_thread> one can also use the | |
| cuda::pipeline_consumer_wait_prior<N>() friend function to wait for all except the last | |
| Nstagestocomplete,similarto__pipeline_wait_prior(N)intheprimitivesAPI. | |
| 4.10.4. Warp Entanglement | |
| The pipeline mechanism is shared among CUDA threads in the same warp. This sharing causes se- | |
| quencesofsubmittedoperationstobeentangledwithinawarp,whichcanimpactperformanceunder | |
| certaincircumstances. | |
| Commit. Thecommitoperationiscoalescedsuchthatthepipeline’ssequenceisincrementedoncefor | |
| allconvergedthreadsthatinvokethecommitoperationandtheirsubmittedoperationsarebatched | |
| together. Ifthewarpisfullyconverged,thesequenceisincrementedbyoneandallsubmittedoper- | |
| ationswillbebatchedinthesamestageofthepipeline;ifthewarpisfullydiverged,thesequenceis | |
| incrementedby32andallsubmittedoperationswillbespreadtodifferentstages. | |
| 276 Chapter4. CUDAFeatures | |
| CUDAProgrammingGuide,Release13.1 | |
| ▶ | |
| LetPBbethewarp-sharedpipeline’sactualsequenceofoperations. | |
| | PB = {BP0, | BP1, BP2, | …, BPL} | | | | | |
| | ---------- | --------- | ------- | --- | --- | --- | | |
| ▶ LetTBbeathread’sperceivedsequenceofoperations,asifthesequencewereonlyincremented | |
| bythisthread’sinvocationofthecommitoperation. | |
| | TB = {BT0, | BT1, BT2, | …, BTL} | | | | | |
| | ---------- | --------- | ------- | --- | --- | --- | | |
| The pipeline::producer_commit() return value is from the thread’s perceived batch | |
| sequence. | |
| ▶ | |
| Anindexinathread’sperceivedsequencealwaysalignstoanequalorlargerindexintheactual | |
| warp-shared sequence. The sequences are equal only when all commit operations are invoked | |
| fromfullyconvergedthreads. | |
| | BTn BPmwheren | <= | m | | | | | |
| | --------------- | --- | --- | --- | --- | --- | | |
| Forexample,whenawarpisfullydiverged: | |
| ▶ Thewarp-sharedpipeline’sactualsequencewouldbe: PB = {0, 1, 2, 3, ..., 31}(PL=31). | |
| ▶ | |
| Theperceivedsequenceforeachthreadofthiswarpwouldbe: | |
| | ▶ Thread0: | TB = {0}(TL=0) | | | | | | |
| | ---------- | -------------- | --- | --- | --- | --- | | |
| ▶ | |
| Thread1: TB = {0}(TL=0) | |
| ▶ | |
| … | |
| | ▶ Thread31: | TB = {0}(TL=0) | | | | | | |
| | ----------- | -------------- | ------ | ------- | ------------------------- | --- | | |
| | Wait. | A CUDA | thread | invokes | pipeline::consumer_wait() | or | | |
| pipeline_consumer_wait_prior<N>() to wait for batches in the perceived se- | |
| quence TB to complete. Note that pipeline::consumer_wait() is equivalent to | |
| | pipeline_consumer_wait_prior<N>(),whereN | | | = PL. | | | | |
| | ---------------------------------------- | --- | --- | ----- | --- | --- | | |
| ThewaitpriorvariantswaitforbatchesintheactualsequenceatleastuptoandincludingPL-N.Since | |
| TL <= PL, waiting for batch up to and including PL-N includes waiting for batch TL-N. Thus, when | |
| TL < PL, the thread will unintentionally wait for additional, more recent batches. In the extreme | |
| fully-divergedwarpexampleabove,eachthreadcouldwaitforall32batches. | |
| Note | |
| Itisrecommendedthatcommitinvocationsarebyconvergedthreadstonotover-wait,bykeeping | |
| threads’perceivedsequenceofbatchesalignedwiththeactualsequence. | |
| When code preceding these operations diverges threads, then the warp should be re-converged, | |
| via__syncwarpbeforeinvokingcommitoperations. | |
| | 4.10.5. Early | Exit | | | | | | |
| | ------------- | ---- | --- | --- | --- | --- | | |
| Whenathreadthatisparticipatinginapipelinemustexitearly,thatthreadmustexplicitlydropoutof | |
| participation before exiting using cuda::pipeline::quit(). The remaining participating threads | |
| canproceednormallywithsubsequentoperations. | |
| | 4.10. Pipelines | | | | | 277 | | |
| | --------------- | --- | --- | --- | --- | --- | | |
| CUDAProgrammingGuide,Release13.1 | |
| | 4.10.6. | Tracking | Asynchronous | Memory | Operations | | | |
| | ------- | -------- | ------------ | ------ | ---------- | --- | | |
| Thefollowingexampledemonstrateshowtocollectivelycopydatafromglobaltosharedmemorywith | |
| asynchronousmemorycopiesusingapipelinetokeeptrackofthecopyoperations. Eachthreaduses | |
| its own pipeline to independently submit memory copies and then wait for them to complete and | |
| consumethedata. Formoredetailsonasynchronousdatacopies,seeSection3.2.5. | |
| | 278 | | | | Chapter4. | CUDAFeatures | | |
| | --- | --- | --- | --- | --------- | ------------ | | |
| CUDAProgrammingGuide,Release13.1 | |
| CUDAC++cuda::pipeline | |
| | #include | <cuda∕pipeline> | | | | | | | | |
| | ---------- | --------------- | -------------------- | --- | --- | --- | ---------- | --- | | |
| | __global__ | void | example_kernel(const | | | | float *in) | | | |
| { | |
| | constexpr | | int block_size | | | = 128; | | | | |
| | --------- | --- | -------------- | --- | --- | ------ | --- | --- | | |
| __shared__ __align__(sizeof(float)) float buffer[4 * block_size]; | |
| | ∕∕ Create | | a unified | | pipeline | per | thread | | | |
| | --------- | --- | --------- | --- | -------- | --- | ------ | --- | | |
| cuda::pipeline<cuda::thread_scope_thread> pipeline = cuda::make_ | |
| ,→pipeline(); | |
| | ∕∕ First | | stage | of memory | | copies | | | | |
| | -------- | --- | ----- | --------- | --- | ------ | --- | --- | | |
| pipeline.producer_acquire(); | |
| | ∕∕ Every | | thread | fetches | one | element | of the | first block | | |
| | -------------------------- | --- | ------ | ------- | --- | ------- | -------------- | ----------- | | |
| | cuda::memcpy_async(buffer, | | | | | in, | sizeof(float), | pipeline); | | |
| pipeline.producer_commit(); | |
| | ∕∕ Second | | stage | of memory | | copies | | | | |
| | --------- | --- | ----- | --------- | --- | ------ | --- | --- | | |
| pipeline.producer_acquire(); | |
| ∕∕ Every thread fetches one element of the second and third block | |
| cuda::memcpy_async(buffer + block_size, in + block_size, sizeof(float), | |
| ,→pipeline); | |
| cuda::memcpy_async(buffer + 2 * block_size, in + 2 * block_size, | |
| | ,→sizeof(float), | | pipeline); | | | | | | | |
| | ---------------- | --- | ---------- | --- | --- | --- | --- | --- | | |
| pipeline.producer_commit(); | |
| | ∕∕ Third | | stage | of memory | | copies | | | | |
| | -------- | --- | ----- | --------- | --- | ------ | --- | --- | | |
| pipeline.producer_acquire(); | |
| | ∕∕ Every | | thread | fetches | one | element | of the | last block | | |
| | -------- | --- | ------ | ------- | --- | ------- | ------ | ---------- | | |
| cuda::memcpy_async(buffer + 3 * block_size, in + 3 * block_size, | |
| | ,→sizeof(float), | | pipeline); | | | | | | | |
| | ---------------- | --- | ---------- | --- | --- | --- | --- | --- | | |
| pipeline.producer_commit(); | |
| | ∕∕ Wait | for | the | oldest | stage | (waits | for first | stage) | | |
| | ------- | --- | --- | ------ | ----- | ------ | --------- | ------ | | |
| pipeline.consumer_wait(); | |
| pipeline.consumer_release(); | |
| | ∕∕ __syncthreads(); | | | | | | | | | |
| | ------------------- | ---- | ---- | ------ | ----- | ------ | ---------- | ------ | | |
| | ∕∕ Use | data | from | the | first | stage | | | | |
| | ∕∕ Wait | for | the | oldest | stage | (waits | for second | stage) | | |
| pipeline.consumer_wait(); | |
| pipeline.consumer_release(); | |
| | ∕∕ __syncthreads(); | | | | | | | | | |
| | ------------------- | ---- | ---- | ------ | ------ | ------ | --------- | ------ | | |
| | ∕∕ Use | data | from | the | second | stage | | | | |
| | ∕∕ Wait | for | the | oldest | stage | (waits | for third | stage) | | |
| pipeline.consumer_wait(); | |
| pipeline.consumer_release(); | |
| | ∕∕ __syncthreads(); | | | | | | | | | |
| | ------------------- | --- | --- | --- | --- | --- | --- | --- | | |
| 4.10. P∕∕ipeUlisneesdata 279 | |
| | | | from | the | third | stage | | | | |
| | --- | --- | ---- | --- | ----- | ----- | --- | --- | | |
| } | |
| CUDAProgrammingGuide,Release13.1 | |
| CUDACprimitives | |
| | #include | <cuda_pipeline.h> | | | | | | | |
| | ---------- | ------------------------- | --- | --- | ---------- | --- | --- | | |
| | __global__ | void example_kernel(const | | | float *in) | | | | |
| { | |
| | constexpr | int block_size | | = 128; | | | | | |
| | --------- | -------------- | --- | ------ | --- | --- | --- | | |
| __shared__ __align__(sizeof(float)) float buffer[4 * block_size]; | |
| | ∕∕ First | batch | of memory | copies | | | | | |
| | ------------------------------- | ------ | --------- | ----------- | ------------------- | ----- | --- | | |
| | ∕∕ Every | thread | fetches | one element | of the first | block | | | |
| | __pipeline_memcpy_async(buffer, | | | | in, sizeof(float)); | | | | |
| __pipeline_commit(); | |
| | ∕∕ Second | batch | of memory | copies | | | | | |
| | --------- | ----- | --------- | ------ | --- | --- | --- | | |
| ∕∕ Every thread fetches one element of the second and third block | |
| __pipeline_memcpy_async(buffer + block_size, in + block_size, | |
| ,→sizeof(float)); | |
| __pipeline_memcpy_async(buffer + 2 * block_size, in + 2 * block_size, | |
| ,→sizeof(float)); | |
| __pipeline_commit(); | |
| | ∕∕ Third | batch | of memory | copies | | | | | |
| | -------- | ------ | --------- | ----------- | ----------- | ----- | --- | | |
| | ∕∕ Every | thread | fetches | one element | of the last | block | | | |
| __pipeline_memcpy_async(buffer + 3 * block_size, in + 3 * block_size, | |
| ,→sizeof(float)); | |
| __pipeline_commit(); | |
| ∕∕ Wait for all except the last two batches of memory copies (waits for | |
| | ,→first | batch) | | | | | | | |
| | ------- | ------ | --- | --- | --- | --- | --- | | |
| __pipeline_wait_prior(2); | |
| | ∕∕ __syncthreads(); | | | | | | | | |
| | ------------------- | --------- | --------- | ----- | --- | --- | --- | | |
| | ∕∕ Use | data from | the first | batch | | | | | |
| ∕∕ Wait for all except the last batch of memory copies (waits for second | |
| ,→batch) | |
| __pipeline_wait_prior(1); | |
| | ∕∕ __syncthreads(); | | | | | | | | |
| | ------------------- | --------- | ---------- | ----- | --- | --- | --- | | |
| | ∕∕ Use | data from | the second | batch | | | | | |
| ∕∕ Wait for all batches of memory copies (waits for third batch) | |
| __pipeline_wait_prior(0); | |
| | ∕∕ __syncthreads(); | | | | | | | | |
| | ------------------- | --------- | -------- | ----- | --- | --- | --- | | |
| | ∕∕ Use | data from | the last | batch | | | | | |
| } | |
| | 280 | | | | | Chapter4. | CUDAFeatures | | |
| | --- | --- | --- | --- | --- | --------- | ------------ | | |
| CUDAProgrammingGuide,Release13.1 | |
| 4.10.7. Producer-Consumer Pattern using Pipelines | |
| InSection4.9.7,weshowedhowathreadblockcanbespatiallypartitionedtoimplementaproducer- | |
| consumer pattern using asynchronousbarriers. With cuda::pipeline, this can be simplified using | |
| asinglepartitionedpipelinewithonestageperdatabufferinsteadoftwoasynchronousbarriersper | |
| buffer. | |
| 4.10. Pipelines 281 | |
| CUDAProgrammingGuide,Release13.1 | |
| CUDAC++cuda::pipeline | |
| | #include | | <cuda∕pipeline> | | | | | | | | | | | |
| | -------- | ---------------- | ---------------------- | ------------------------------------------- | --- | ---------------------------- | --- | --- | --- | --- | --- | --- | | |
| | #include | | <cooperative_groups.h> | | | | | | | | | | | |
| | #pragma | nv_diag_suppress | | | | static_var_with_dynamic_init | | | | | | | | |
| | using | pipeline | | = cuda::pipeline<cuda::thread_scope_block>; | | | | | | | | | | |
| __device__ void produce(pipeline &pipe, int num_stages, int stage, int num_ | |
| ,→batches, int batch, float *buffer, int buffer_len, float *in, int N) | |
| { | |
| | | if (batch | < | num_batches) | | | | | | | | | | |
| | --- | --------- | --- | ------------ | --- | --- | --- | --- | --- | --- | --- | --- | | |
| { | |
| pipe.producer_acquire(); | |
| ∕* copy data from in(batch) to buffer(stage) using asynchronous memory | |
| | | ,→copies | *∕ | | | | | | | | | | | |
| | --- | -------- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | |
| pipe.producer_commit(); | |
| } | |
| } | |
| __device__ void consume(pipeline &pipe, int num_stages, int stage, int num_ | |
| ,→batches, int batch, float *buffer, int buffer_len, float *out, int N) | |
| { | |
| pipe.consumer_wait(); | |
| | | ∕* consume | buffer(stage) | | | and | update | out(batch) | | *∕ | | | | |
| | --- | ---------- | ------------- | --- | --- | --- | ------ | ---------- | --- | --- | --- | --- | | |
| pipe.consumer_release(); | |
| } | |
| __global__ void producer_consumer_pattern(float *in, float *out, int N, int | |
| ,→buffer_len) | |
| { | |
| | | auto block | = | cooperative_groups::this_thread_block(); | | | | | | | | | | |
| | --- | ---------- | --- | ---------------------------------------- | --- | --- | --- | --- | --- | --- | --- | --- | | |
| ∕* Shared memory buffer declared below is of size 2 * buffer_len | |
| | | so that | we | can | alternatively | | work | between | two | buffers. | | | | |
| | --- | ---------- | ----------- | -------- | ------------- | --------- | ----------- | -------- | ------------ | -------- | --- | --- | | |
| | | buffer_0 | | = buffer | and | buffer_1 | | = buffer | + buffer_len | | *∕ | | | |
| | | __shared__ | extern | | float | buffer[]; | | | | | | | | |
| | | const int | num_batches | | | = N ∕ | buffer_len; | | | | | | | |
| ∕∕ Create a partitioned pipeline with 2 stages where half the threads are | |
| | | ,→producers | and | the | other | half | are | consumers. | | | | | | |
| | --- | ----------------- | ---- | ---------- | -------------- | ------------------------- | --- | -------------- | --- | ---- | --- | --- | | |
| | | constexpr | auto | scope | = | cuda::thread_scope_block; | | | | | | | | |
| | | constexpr | int | num_stages | | = | 2; | | | | | | | |
| | | cuda::std::size_t | | | producer_count | | | = block.size() | | ∕ 2; | | | | |
| __shared__ cuda::pipeline_shared_state<scope, num_stages> shared_state; | |
| pipeline pipe = cuda::make_pipeline(block, &shared_state, producer_count); | |
| | | ∕∕ Fill | the | pipeline | | | | | | | | | | |
| | --- | ----------------------- | --- | -------- | --- | ----------------- | --- | --- | --- | --- | --- | --- | | |
| | | if (block.thread_rank() | | | | < producer_count) | | | | | | | | |
| { | |
| | | for (int | s | = 0; | s < | num_stages; | | ++s) | | | | | | |
| | --- | -------- | --- | ---- | --- | ----------- | --- | ---- | --- | --- | --- | --- | | |
| { | |
| | 282 | | | | | | | | | bufferC,habputefrfe4r. | | _ClUeDnA, Fiena,tures | | |
| | --- | ------------- | --- | --- | ----------- | --- | --- | ------------ | --- | ---------------------- | --- | ---------------------- | | |
| | | produce(pipe, | | | num_stages, | | s, | num_batches, | | s, | | | | |
| ,→N); | |
| } | |
| } | |
| | | ∕∕ Process | the | batches | | | | | | | | | | |
| | --- | ----------- | --- | ------- | --- | ------------ | --- | ---- | --- | --- | --- | --- | | |
| | | int stage | = | 0; | | | | | | | | | | |
| | | for (size_t | | b = 0; | b < | num_batches; | | ++b) | | | | | | |
| { | |
| | | if (block.thread_rank() | | | | < | producer_count) | | | | | | | |
| | --- | ----------------------- | --- | --- | --- | --- | --------------- | --- | --- | --- | --- | --- | | |
| { | |
| | | ∕∕ | Prefetch | the | next | batch | | | | | | | | |
| | --- | --- | -------- | --- | ---- | ----- | --- | --- | --- | --- | --- | --- | | |
| produce(pipe, num_stages, stage, num_batches, b + num_stages, buffer, | |
| | | ,→buffer_len, | | in, N); | | | | | | | | | | |
| | --- | ------------- | --- | ------- | --- | --- | --- | --- | --- | --- | --- | --- | | |
| } | |
| else | |
| { | |
| | | ∕∕ | Consume | the | oldest | batch | | | | | | | | |
| | --- | --- | ------- | --- | ------ | ----- | --- | --- | --- | --- | --- | --- | | |
| consume(pipe, num_stages, stage, num_batches, b, buffer, buffer_len, | |
| | | ,→out, N); | | | | | | | | | | | | |
| | --- | ---------- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | |
| } | |
| | | stage | = (stage | | + 1) | % num_stages; | | | | | | | | |
| | --- | ----- | -------- | --- | ---- | ------------- | --- | --- | --- | --- | --- | --- | | |
| } | |
| } | |
| CUDAProgrammingGuide,Release13.1 | |
| In this example, we use half of the threads in the thread block as producers and the other half as | |
| consumers. As a first step, we need to create a cuda::pipeline object. Since we want some | |
| threads to be producers and some to be consumers, we need to use a partitioned pipeline with | |
| cuda::thread_scope_block. Partitioned pipelines require a cuda::pipeline_shared_state to | |
| coordinate the participating threads. We initialize the state for a 2-stage pipeline in thread-block | |
| scope and then call cuda::make_pipeline(). Next, producer threads fill the pipeline by submit- | |
| tingasynchronouscopiesfromintobuffer. Atthispointalldatacopiesarein-flight. Finally,inthe | |
| mainloop,wegooverallofthebatchesofdataanddependingonwhetherathreadisaproduceror | |
| consumer, we either submit another asynchronous copy for a future batch or consume the current | |
| batch. | |
| 4.11. Asynchronous Data Copies | |
| BuildingonSection3.2.5,thissectionprovidesdetailedguidanceandexamplesforasynchronousdata | |
| movementwithintheGPUmemoryhierarchy. ItcoversLDGSTSforelement-wisecopies, theTensor | |
| MemoryAccelerator(TMA)forbulk(one-dimensionalandmulti-dimensional)transfers,andSTASfor | |
| registertodistributedsharedmemorycopies,andshowshowthesemechanismsintegratewithasyn- | |
| chronousbarriersandpipelines. | |
| 4.11.1. Using LDGSTS | |
| ManyCUDAapplicationsrequirefrequentdatamovementbetweenglobalandsharedmemory. Often, | |
| this involves copying smaller data elements or performing irregular memory access patterns. The | |
| primarygoalofLDGSTS(CC8.0+,seePTXdocumentation)istoprovideanefficientasynchronousdata | |
| transfermechanismfromglobalmemorytosharedmemoryforsmaller,element-wisedatatransfers | |
| whileenablingbetterutilizationofcomputeresourcesthroughoverlappedexecution. | |
| Dimensions. LDGSTSsupportscopying4,8,or16bytes. Copying4or8bytesalwayshappensinthe | |
| socalledL1ACCESSmode,inwhichcasedataisalsocachedintheL1,whilecopying16-bytesenables | |
| theL1BYPASSmode,inwhichcasetheL1isnotpolluted. | |
| Sourceanddestination. TheonlydirectionsupportedforasynchronouscopyoperationswithLDGSTS | |
| isfromglobal to sharedmemory. The pointers need to be aligned to 4, 8, or 16 bytes depending on | |
| thesizeofthedatabeingcopied. Bestperformanceisachievedwhenthealignmentofbothshared | |
| memoryandglobalmemoryis128bytes. | |
| Asynchronicity. Data transfers using LDGSTS are asynchronous and are modeled as async thread | |
| operations(seeAsyncThreadandAsyncProxy). Thisallowstheinitiatingthreadtocontinuecomputing | |
| whilethehardwareasynchronouslycopiesthedata. Whetherthedatatransferoccursasynchronously | |
| inpracticeisuptothehardwareimplementationandmaychangeinthefuture. | |
| LDGSTSmustprovideasignalwhentheoperationiscomplete. LDGSTScanusesharedmemorybar- | |
| riersorpipelinesasmechanismstoprovidecompletionsignals. Bydefault,eachthreadonlywaitsfor | |
| itsownLDGSTScopies. Thus,ifyouuseLDGSTStoprefetchsomedatathatwillbesharedwithother | |
| threads,a__syncthreads()isnecessaryaftersynchronizingwiththeLDGSTScompletionmecha- | |
| nism. | |
| 4.11. AsynchronousDataCopies 283 | |
| CUDAProgrammingGuide,Release13.1 | |
| | | | Table | 18: | Asynchronous | | copies | with | possible | source | and des- | | | | |
| | --- | --- | -------- | --- | ------------ | ------ | ------ | ---------- | -------- | ---------- | -------- | --- | --- | | |
| | | | tination | | memory | spaces | and | completion | | mechanisms | using | | | | |
| LDGSTS.Anemptycellindicatesthatasource-destinationpair | |
| isnotsupported. | |
| | Direction | | | AsynchronousCopy(LDGSTS,CC8.0+) | | | | | | | | | | | |
| | --------- | -------- | --- | ------------------------------- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | |
| | Source | Destina- | | CompletionMecha- | | | | API | | | | | | | |
| | | tion | | nism | | | | | | | | | | | |
| | global | global | | | | | | | | | | | | | |
| shared::ctaglobal | |
| | global | shared::ctasharedmemorybar- | | | | | | cuda::memcpy_async, | | | | | coop- | | |
| | ------ | --------------------------- | --- | ------------- | --- | --- | --- | ----------------------------- | --- | --- | --- | --- | ----- | | |
| | | | | rier,pipeline | | | | erative_groups::memcpy_async, | | | | | | | |
| __pipeline_memcpy_async | |
| | global | shared::cluster | | | | | | | | | | | | | |
| | ------ | --------------- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | |
| shared::clusstherared::cta | |
| shared::ctashared::cta | |
| Inthefollowingsections,wewilldemonstratehowtouseLDGSTSthroughexamplesandexplainthe | |
| differencesbetweenthedifferentAPIs. | |
| | 4.11.1.1 | BatchingLoadsinConditionalCode | | | | | | | | | | | | | |
| | -------- | ------------------------------ | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | |
| Inthisstencilexample,thefirstwarpofthethreadblockisresponsibleforcollectivelyloadingallthe | |
| requireddatafromthecenteraswellastheleftandrighthalos. Withsynchronouscopies,duetothe | |
| conditionalnatureofthecode,thecompilermaychoosetogenerateasequenceofload-from-global | |
| (LDG) store-to-shared (STS) instructions instead of 3 LDGs followed by 3 STSs, which would be the | |
| optimalwaytoloadthedatatohidetheglobalmemorylatency. | |
| __global__ void stencil_kernel(const float *left, const float *center, const | |
| | ,→float | *right) | | | | | | | | | | | | | |
| | ------- | ------- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | |
| { | |
| ∕∕ Left halo (8 elements) - center (32 elements) - right halo (8 elements) | |
| | | __shared__ | float | buffer[8 | | + | 32 + | 8]; | | | | | | | |
| | --- | ----------- | ------- | -------------- | ---------- | ----------- | ---- | ---- | ----- | ---- | --- | --- | --- | | |
| | | const int | tid | = threadIdx.x; | | | | | | | | | | | |
| | | if (tid | < 8) | { | | | | | | | | | | | |
| | | buffer[tid] | | = | left[tid]; | | ∕∕ | Left | halo | | | | | | |
| | | } else | if (tid | >= | 32 - | 8) { | | | | | | | | | |
| | | buffer[tid | | + 16] | = | right[tid]; | | ∕∕ | Right | halo | | | | | |
| } | |
| | | if (tid | < 32) | { | | | | | | | | | | | |
| | --- | ---------- | ----- | ---- | -------------- | --- | --- | --------- | --- | --- | --- | --- | --- | | |
| | | buffer[tid | | + 8] | = center[tid]; | | | ∕∕ Center | | | | | | | |
| } | |
| __syncthreads(); | |
| | | ∕∕ Compute | stencil | | | | | | | | | | | | |
| | --- | ---------- | ------- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | |
| } | |
| Toensurethatthedataisloadedintheoptimalway,wecanreplacethesynchronousmemorycopies | |
| | 284 | | | | | | | | | | Chapter4. | CUDAFeatures | | | |
| | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --------- | ------------ | --- | | |
| CUDAProgrammingGuide,Release13.1 | |
| withasynchronouscopiesthatloaddatadirectlyfromglobalmemorytosharedmemory. Thisnotonly | |
| reducesregisterusagebycopyingthedatadirectlytosharedmemory,butalsoensuresallloadsfrom | |
| globalmemoryarein-flight. | |
| 4.11. AsynchronousDataCopies 285 | |
| CUDAProgrammingGuide,Release13.1 | |
| CUDAC++cuda::memcpy_async | |
| | #include | <cooperative_groups.h> | | | | | | | | | | | |
| | -------- | ---------------------- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | |
| | #include | <cuda∕barrier> | | | | | | | | | | | |
| __global__ void stencil_kernel(const float *left, const float *center, | |
| | | ,→const float | *right) | | | | | | | | | | |
| | --- | ------------- | ------- | --- | --- | --- | --- | --- | --- | --- | --- | | |
| { | |
| | | auto block | = | cooperative_groups::this_thread_block(); | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment