Skip to content

Commit

Permalink
Update hip_porting_driver_api.rst and stream_ordered_allocator.rst
Browse files Browse the repository at this point in the history
Co-authored-by: Leo Paoletti <[email protected]>
  • Loading branch information
neon60 and lpaoletti committed Oct 4, 2024
1 parent e03616c commit 4892711
Show file tree
Hide file tree
Showing 2 changed files with 12 additions and 7 deletions.
6 changes: 3 additions & 3 deletions docs/how-to/hip_porting_driver_api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -128,7 +128,7 @@ Initialization and termination functions
-----------------------------------------

HIP-Clang generates initialization and termination functions for each translation unit for host code compilation. The initialization functions call ``__hipRegisterFatBinary`` to register the fat binary embedded in the ELF file. They also call ``__hipRegisterFunction`` and ``__hipRegisterVar`` to register kernel functions and device-side global variables. The termination functions call ``__hipUnregisterFatBinary``.
HIP-Clang emits a global variable ``__hip_gpubin_handle`` of type ``void**`` with ``linkonce`` linkage and an initial value of 0 for each host translation unit. Each initialization function checks ``__hip_gpubin_handle`` and registers the fat binary only if ``__hip_gpubin_handle`` is 0. It saves the return value of ``__hip_gpubin_handle`` to ``__hip_gpubin_handle``. This is to guarantee that the fat binary is only registered once. A similar check is performed in the termination functions.
HIP-Clang emits a global variable ``__hip_gpubin_handle`` of type ``void**`` with ``linkonce`` linkage and an initial value of 0 for each host translation unit. Each initialization function checks ``__hip_gpubin_handle`` and registers the fat binary only if ``__hip_gpubin_handle`` is 0. It saves the return value of ``__hip_gpubin_handle`` to ``__hip_gpubin_handle``. This ensures that the fat binary is registered once. A similar check is performed in the termination functions.

Kernel launching
----------------
Expand All @@ -138,7 +138,7 @@ HIP-Clang supports kernel launching using either the CUDA ``<<<>>>`` syntax, ``h
When the executable or shared library is loaded by the dynamic linker, the initialization functions are called. In the initialization functions, the code objects containing all kernels are loaded when ``__hipRegisterFatBinary`` is called. When ``__hipRegisterFunction`` is called, the stub functions are associated with the corresponding kernels in the code objects.

HIP-Clang implements two sets of APIs for launching kernels.
By default, when HIP-Clang encounters the ``<<<>>>`` statement in the host code, it first calls ``hipConfigureCall`` to set up the threads and grids. It then calls the stub function with the given arguments. The stub function calls ``hipSetupArgument`` for each kernel argument, then calls ``hipLaunchByPtr`` with a function pointer to the stub function. In ``hipLaunchByPtr``, the real kernel associated with the stub function is launched.
By default, when HIP-Clang encounters the ``<<<>>>`` statement in the host code, it first calls ``hipConfigureCall`` to set up the threads and grids. It then calls the stub function with the given arguments. The stub function calls ``hipSetupArgument`` for each kernel argument, then calls ``hipLaunchByPtr`` with a function pointer to the stub function. In ``hipLaunchByPtr``, the actual kernel associated with the stub function is launched.

NVCC implementation notes
=========================
Expand Down Expand Up @@ -181,7 +181,7 @@ Compilation options

The ``hipModule_t`` interface does not support the ``cuModuleLoadDataEx`` function, which is used to control PTX compilation options.
HIP-Clang does not use PTX, so it does not support these compilation options.
In fact, HIP-Clang code objects always contain fully compiled code for a device-specific instruction set and do not require additional compilation as a part of the load step.
In fact, HIP-Clang code objects contain fully compiled code for a device-specific instruction set and don't require additional compilation as a part of the load step.
The corresponding HIP function ``hipModuleLoadDataEx`` behaves like ``hipModuleLoadData`` on the HIP-Clang path (where compilation options are not used) and like ``cuModuleLoadDataEx`` on the NVCC path.

For example:
Expand Down
13 changes: 9 additions & 4 deletions docs/how-to/stream_ordered_allocator.rst
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,11 @@ Disadvantages of SOMA:
Using SOMA
=====================================

You can allocate memory using ``hipMallocAsync()`` with stream-ordered semantics. This restricts the asynchronous accesses to the memory to occur between the stream executions of the allocation and deallocation. Accessing memory outside this promised stream order can lead to undefined behavior such as use-before-allocation or use-after-free errors. The allocator might reallocate memory as long as the compliant memory accesses are guaranteed not to overlap temporally. ``hipFreeAsync()`` frees memory from the pool with stream-ordered semantics.
You can allocate memory using ``hipMallocAsync()`` with stream-ordered
semantics. This restricts the asynchronous access to the memory between the stream executions of the allocation and deallocation. Accessing
memory if the compliant memory accesses won't overlap
temporally. ``hipFreeAsync()`` frees memory from the pool with stream-ordered
semantics.

Here is how to use stream ordered memory allocation:

Expand Down Expand Up @@ -140,7 +144,7 @@ Set pools

The ``hipMallocAsync()`` function uses the current memory pool and also provides the opportunity to create and access different pools using ``hipMemPoolCreate()`` and ``hipMallocFromPoolAsync()`` functions respectively.

Unlike NVIDIA CUDA, where stream-ordered memory allocation can be implicit, in AMD HIP, it's always explicit. This requires you to manage memory allocation for each stream in HIP while ensuring precise control over memory usage and synchronization.
Unlike NVIDIA CUDA, where stream-ordered memory allocation can be implicit, ROCm HIP is explicit. This requires managing memory allocation for each stream in HIP while ensuring precise control over memory usage and synchronization.

.. code-block:: cpp
Expand Down Expand Up @@ -348,7 +352,7 @@ To reset these attributes to the current value, use ``hipMemPoolSetAttribute()``
Memory reuse policies
---------------------

The allocator might reallocate memory as long as the compliant memory accesses are guaranteed not to overlap temporally. To optimize the memory usage, disable or enable the following memory pool reuse policy attribute flags:
The allocator might reallocate memory as long as the compliant memory accesses will not to overlap temporally. To optimize the memory usage, disable or enable the following memory pool reuse policy attribute flags:

- ``hipMemPoolReuseFollowEventDependencies``: Checks event dependencies before allocating additional GPU memory.
- ``hipMemPoolReuseAllowOpportunistic``: Checks freed allocations to determine if the stream order semantic indicated by the free operation has been met.
Expand All @@ -357,7 +361,7 @@ The allocator might reallocate memory as long as the compliant memory accesses a
Device accessibility for multi-GPU support
------------------------------------------

Allocations are initially accessible only from the device where they reside.
Allocations are initially accessible from the device where they reside.

Interprocess memory handling
=============================
Expand Down Expand Up @@ -466,6 +470,7 @@ Here is how to read the pool exported in the preceding example:
}
.. _shareable-handle:

Shareable handle
----------------

Expand Down

0 comments on commit 4892711

Please sign in to comment.