Skip to content

Commit

Permalink
PR feedbacks
Browse files Browse the repository at this point in the history
WIP

WIP
  • Loading branch information
neon60 committed Oct 4, 2024
1 parent a17d057 commit 6aa9ba9
Show file tree
Hide file tree
Showing 3 changed files with 109 additions and 50 deletions.
40 changes: 35 additions & 5 deletions docs/how-to/hip_runtime_api/memory_management.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,16 +12,46 @@ high-performance applications. Both allocating and copying memory can result in
bottlenecks, which can significantly impact performance.

The programming model is based on a system with a host and a device, each having
its own distinct memory. Kernels operate on device memory, while host functions operate on host memory.
The runtime
offers functions for allocating, freeing, and copying device memory, along
with transferring data between host and device memory.
its own distinct memory. Kernels operate on device memory, while host functions
operate on host memory.

How to manage the different memory types is described in the following chapters:
The runtime offers functions for allocating, freeing, and copying device memory,
along with transferring data between host and device memory.

The description of these memory type can be located at the following page:

* :ref:`device_memory`
* :ref:`host_memory`

The different memory managements are described in the following pages:

* :ref:`coherence_control`
* :ref:`unified_memory`
* :ref:`virtual_memory`
* :ref:`stream_ordered_memory_allocator_how-to`

Memory allocation
================================================================================

The following API calls with result in these allocations:

.. list-table:: Memory coherence control
:widths: 25, 35, 20, 20
:header-rows: 1
:align: center

* - API
- System allocated
- :cpp:func:`hipMallocManaged`
- :cpp:func:`hipHostMalloc`
- :cpp:func:`hipMalloc`
* - Data location
- Host
- Host
- Host
- Device
* - Allocation
- :ref:`Pageable <pageable_host_memory>`
- :ref:`Managed <unified_memory>`
- :ref:`Pinned <pinned_host_memory>`
- Pinned
89 changes: 59 additions & 30 deletions docs/how-to/hip_runtime_api/memory_management/coherence_control.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,21 +9,21 @@
Coherence control
*******************************************************************************

Memory coherence describes how different parts of a system see the memory of a specific part of the system, e.g. how the CPU sees the GPUs memory or vice versa.
In HIP, host and device memory can be allocated with two different types of coherence:

* **Coarse-grained coherence** means that memory is only considered up to date at
kernel boundaries, which can be enforced through :cpp:func:`hipDeviceSynchronize`,
hipStreamSynchronize, or any blocking operation that acts on the null
stream (e.g. :cpp:func:`hipMemcpy`). For example, cacheable memory is a type of
coarse-grained memory where an up-to-date copy of the data can be stored
elsewhere (e.g. in an L2 cache).
* **Fine-grained coherence** means the coherence is supported while a CPU/GPU
kernel is running. This can be useful if both host and device are operating on
the same dataspace using system-scope atomic operations (e.g. updating an
error code or flag to a buffer). Fine-grained memory implies that up-to-date
data may be made visible to others regardless of kernel boundaries as
discussed above.
Memory coherence describes how different parts of a system see the memory of a
specific part of the system, e.g. how the CPU sees the GPUs memory or vice versa.
In HIP, host and device memory can be allocated with two different types of
coherence:

* **Coarse-grained coherence** means that memory is only considered up to date
after synchronization, which can be enforced through :cpp:func:`hipDeviceSynchronize`,
:cpp:func:`hipStreamSynchronize`, or any blocking operation that acts on the
null stream (e.g. :cpp:func:`hipMemcpy`). One reason for this can be writes to
caches, that the other part of the system can't access, so they are only
visible once the caches have been flushed.
* **Fine-grained coherence** means the memory is coherent even while it is being
modified by one of the parts of the system. Fine-grained coherence implies
that up to date data is visible to others regardless of kernel boundaries.
This can be useful if both host and device are operating on the same data.

.. note::

Expand All @@ -33,12 +33,41 @@ In HIP, host and device memory can be allocated with two different types of cohe

.. TODO: Is this still valid? What about Mi300?
Developers should use coarse-grained coherence where they can to reduce
host-device interconnect communication and also Mi200 accelerators hardware
based floating point instructions are working on coarse grained memory regions.
Developers should use coarse-grained coherence where they can reduce host-device
interconnect communication and also Mi200 accelerators hardware based floating
point instructions are working on coarse grained memory regions.

The availability of fine- and coarse-grained memory pools can be checked with
``rocminfo``.
``rocminfo``:

.. code-block:: sh
$ rocminfo
...
*******
Agent 1
*******
Name: AMD EPYC 7742 64-Core Processor
...
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: FINE GRAINED
...
Pool 3
Segment: GLOBAL; FLAGS: COARSE GRAINED
...
*******
Agent 9
*******
Name: gfx90a
...
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
...
The memory coherence control is described in the following table.

.. list-table:: Memory coherence control
:widths: 25, 35, 20, 20
Expand Down Expand Up @@ -84,17 +113,17 @@ The availability of fine- and coarse-grained memory pools can be checked with

:sup:`1` The :cpp:func:`hipHostMalloc` memory allocation coherence mode can be
affected by the ``HIP_HOST_COHERENT`` environment variable, if the
``hipHostMallocCoherent=0``, ``hipHostMallocNonCoherent=0``,
``hipHostMallocMapped=0`` and one of the other flag is set to 1. At this case,
if the ``HIP_HOST_COHERENT`` is not defined, or defined as 0, the host memory
allocation is coarse-grained.
``hipHostMallocCoherent``, ``hipHostMallocNonCoherent``, ``hipHostMallocMapped``
are unset while one of the other flag is set. At this case, if the
``HIP_HOST_COHERENT`` environment variable is not defined, or defined as 0, the
host memory allocation is coarse-grained.

.. note::

* At ``hipHostMallocMapped=1`` case the allocated host memory is
* When ``hipHostMallocMapped`` flag is set, the allocated host memory is
fine-grained and the ``hipHostMallocNonCoherent`` flag is ignored.
* The ``hipHostMallocCoherent=1`` and ``hipHostMallocNonCoherent=1`` state is
illegal.
* It's an illegal state, if the ``hipHostMallocCoherent`` and
``hipHostMallocNonCoherent`` flags are set.

Visibility of synchronization functions
================================================================================
Expand All @@ -104,7 +133,7 @@ at coarse-grained coherence, it depends on the used synchronization function.
The synchronization functions effect and visibility on different coherence
memory types collected in the following table.

.. list-table:: HIP API
.. list-table:: HIP synchronize functions effect and visibility

* - HIP API
- :cpp:func:`hipStreamSynchronize`
Expand Down Expand Up @@ -141,13 +170,13 @@ Developers can control the release scope for :cpp:func:`hipEvents`:
A stronger system-level fence can be specified when the event is created with
:cpp:func:`hipEventCreateWithFlags`:

* :cpp:func:`hipEventReleaseToSystem`: Perform a system-scope release operation
* ``hipEventReleaseToSystem``: Perform a system-scope release operation
when the event is recorded. This will make **both fine-grained and
coarse-grained host memory visible to other agents in the system**, but may
involve heavyweight operations such as cache flushing. Fine-grained memory
will typically use lighter-weight in-kernel synchronization mechanisms such as
an atomic operation and thus does not need to use.
:cpp:func:`hipEventReleaseToSystem`.
* :cpp:func:`hipEventDisableTiming`: Events created with this flag will not
``hipEventReleaseToSystem``.
* ``hipEventDisableTiming``: Events created with this flag will not
record profiling data and provide the best performance if used for
synchronization.
30 changes: 15 additions & 15 deletions docs/how-to/hip_runtime_api/memory_management/host_memory.rst
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,8 @@ The difference between memory transfers of explicit memory management and unifie

Unified memory management is described further in :doc:`/how-to/hip_runtime_api/memory_management/unified_memory`.

.. _pageable_host_memory:

Pageable memory
================================================================================

Expand Down Expand Up @@ -105,22 +107,17 @@ application. The following example shows the pageable host memory usage in HIP.
also provides non-blocking versions :cpp:func:`hipMallocAsync` and
:cpp:func:`hipFreeAsync` which take in a stream as an additional argument.

.. _pinned_host_memory:

Pinned memory
================================================================================

Pinned memory (or page-locked memory, or non-pageable memory) is host memory
that is mapped into the address space of all GPUs, meaning that the pointer can
be used on both host and device. Accessing host-resident pinned memory in device
kernels is generally not recommended for performance, as it can force the data
to traverse the host-device interconnect (e.g. PCIe), which is much slower than
the on-device bandwidth (>40x on MI200).

Much like how a process can be locked to a CPU core by setting affinity, a
pinned memory allocator does this with the memory storage system. On multi-socket
systems it is important to ensure that pinned memory is located on the same
socket as the owning process, or else each cache line will be moved through the
CPU-CPU interconnect, thereby increasing latency and potentially decreasing
bandwidth.
Pinned memory (or page-locked memory) is stored in pages that are locked to
specific sectors in RAM and cannot be migrated. The pointer can be used on both
host and device. Accessing host-resident pinned memory in device kernels is
generally not recommended for performance, as it can force the data to traverse
the host-device interconnect (e.g. PCIe), which is much slower than
the on-device bandwidth.

Advantage of pinned memory is the improved transfer times between host and
device. For transfer operations, such as :cpp:func:`hipMemcpy` or :cpp:func:`hipMemcpyAsync`,
Expand Down Expand Up @@ -220,12 +217,15 @@ host memory:
``HIP_HOST_COHERENT`` environment variable for specific allocation. For
further details, check :ref:`coherence_control`.

All allocation flags are independent and can be used in most of the combination
without restriction, for instance, :cpp:func:`hipHostMalloc` can be called with both
All allocation flags are independent and can be used in any of the combinations
with on exception. For example, :cpp:func:`hipHostMalloc` can be called with both
``hipHostMallocPortable`` and ``hipHostMallocMapped`` flags set. Both usage
models described above use the same allocation flags, and the difference is in
how the surrounding code uses the host memory.

The one exception is when the ``hipHostMallocCoherent`` and
``hipHostMallocNonCoherent``flags are set, what is an illegal state.
.. note::
By default, each GPU selects a Numa CPU node that has the least Numa distance
Expand Down

0 comments on commit 6aa9ba9

Please sign in to comment.