diff --git a/docs/how-to/hip_porting_driver_api.rst b/docs/how-to/hip_porting_driver_api.rst index ccc7f21e93..9fdce8758c 100644 --- a/docs/how-to/hip_porting_driver_api.rst +++ b/docs/how-to/hip_porting_driver_api.rst @@ -128,7 +128,7 @@ Initialization and termination functions ----------------------------------------- HIP-Clang generates initialization and termination functions for each translation unit for host code compilation. The initialization functions call ``__hipRegisterFatBinary`` to register the fat binary embedded in the ELF file. They also call ``__hipRegisterFunction`` and ``__hipRegisterVar`` to register kernel functions and device-side global variables. The termination functions call ``__hipUnregisterFatBinary``. -HIP-Clang emits a global variable ``__hip_gpubin_handle`` of type ``void**`` with ``linkonce`` linkage and an initial value of 0 for each host translation unit. Each initialization function checks ``__hip_gpubin_handle`` and registers the fat binary only if ``__hip_gpubin_handle`` is 0. It saves the return value of ``__hip_gpubin_handle`` to ``__hip_gpubin_handle``. This is to guarantee that the fat binary is only registered once. A similar check is performed in the termination functions. +HIP-Clang emits a global variable ``__hip_gpubin_handle`` of type ``void**`` with ``linkonce`` linkage and an initial value of 0 for each host translation unit. Each initialization function checks ``__hip_gpubin_handle`` and registers the fat binary only if ``__hip_gpubin_handle`` is 0. It saves the return value of ``__hip_gpubin_handle`` to ``__hip_gpubin_handle``. This ensures that the fat binary is registered once. A similar check is performed in the termination functions. Kernel launching ---------------- @@ -138,7 +138,7 @@ HIP-Clang supports kernel launching using either the CUDA ``<<<>>>`` syntax, ``h When the executable or shared library is loaded by the dynamic linker, the initialization functions are called. In the initialization functions, the code objects containing all kernels are loaded when ``__hipRegisterFatBinary`` is called. When ``__hipRegisterFunction`` is called, the stub functions are associated with the corresponding kernels in the code objects. HIP-Clang implements two sets of APIs for launching kernels. -By default, when HIP-Clang encounters the ``<<<>>>`` statement in the host code, it first calls ``hipConfigureCall`` to set up the threads and grids. It then calls the stub function with the given arguments. The stub function calls ``hipSetupArgument`` for each kernel argument, then calls ``hipLaunchByPtr`` with a function pointer to the stub function. In ``hipLaunchByPtr``, the real kernel associated with the stub function is launched. +By default, when HIP-Clang encounters the ``<<<>>>`` statement in the host code, it first calls ``hipConfigureCall`` to set up the threads and grids. It then calls the stub function with the given arguments. The stub function calls ``hipSetupArgument`` for each kernel argument, then calls ``hipLaunchByPtr`` with a function pointer to the stub function. In ``hipLaunchByPtr``, the actual kernel associated with the stub function is launched. NVCC implementation notes ========================= @@ -181,7 +181,7 @@ Compilation options The ``hipModule_t`` interface does not support the ``cuModuleLoadDataEx`` function, which is used to control PTX compilation options. HIP-Clang does not use PTX, so it does not support these compilation options. -In fact, HIP-Clang code objects always contain fully compiled code for a device-specific instruction set and do not require additional compilation as a part of the load step. +In fact, HIP-Clang code objects contain fully compiled code for a device-specific instruction set and don't require additional compilation as a part of the load step. The corresponding HIP function ``hipModuleLoadDataEx`` behaves like ``hipModuleLoadData`` on the HIP-Clang path (where compilation options are not used) and like ``cuModuleLoadDataEx`` on the NVCC path. For example: diff --git a/docs/how-to/stream_ordered_allocator.rst b/docs/how-to/stream_ordered_allocator.rst index 5cef65a2c9..3279fa639e 100644 --- a/docs/how-to/stream_ordered_allocator.rst +++ b/docs/how-to/stream_ordered_allocator.rst @@ -24,7 +24,11 @@ Disadvantages of SOMA: Using SOMA ===================================== -You can allocate memory using ``hipMallocAsync()`` with stream-ordered semantics. This restricts the asynchronous accesses to the memory to occur between the stream executions of the allocation and deallocation. Accessing memory outside this promised stream order can lead to undefined behavior such as use-before-allocation or use-after-free errors. The allocator might reallocate memory as long as the compliant memory accesses are guaranteed not to overlap temporally. ``hipFreeAsync()`` frees memory from the pool with stream-ordered semantics. +You can allocate memory using ``hipMallocAsync()`` with stream-ordered +semantics. This restricts the asynchronous access to the memory between the stream executions of the allocation and deallocation. Accessing +memory if the compliant memory accesses won't overlap +temporally. ``hipFreeAsync()`` frees memory from the pool with stream-ordered +semantics. Here is how to use stream ordered memory allocation: @@ -140,7 +144,7 @@ Set pools The ``hipMallocAsync()`` function uses the current memory pool and also provides the opportunity to create and access different pools using ``hipMemPoolCreate()`` and ``hipMallocFromPoolAsync()`` functions respectively. -Unlike NVIDIA CUDA, where stream-ordered memory allocation can be implicit, in AMD HIP, it's always explicit. This requires you to manage memory allocation for each stream in HIP while ensuring precise control over memory usage and synchronization. +Unlike NVIDIA CUDA, where stream-ordered memory allocation can be implicit, ROCm HIP is explicit. This requires managing memory allocation for each stream in HIP while ensuring precise control over memory usage and synchronization. .. code-block:: cpp @@ -348,7 +352,7 @@ To reset these attributes to the current value, use ``hipMemPoolSetAttribute()`` Memory reuse policies --------------------- -The allocator might reallocate memory as long as the compliant memory accesses are guaranteed not to overlap temporally. To optimize the memory usage, disable or enable the following memory pool reuse policy attribute flags: +The allocator might reallocate memory as long as the compliant memory accesses will not to overlap temporally. To optimize the memory usage, disable or enable the following memory pool reuse policy attribute flags: - ``hipMemPoolReuseFollowEventDependencies``: Checks event dependencies before allocating additional GPU memory. - ``hipMemPoolReuseAllowOpportunistic``: Checks freed allocations to determine if the stream order semantic indicated by the free operation has been met. @@ -357,7 +361,7 @@ The allocator might reallocate memory as long as the compliant memory accesses a Device accessibility for multi-GPU support ------------------------------------------ -Allocations are initially accessible only from the device where they reside. +Allocations are initially accessible from the device where they reside. Interprocess memory handling ============================= @@ -466,6 +470,7 @@ Here is how to read the pool exported in the preceding example: } .. _shareable-handle: + Shareable handle ----------------