From 6aa9ba94e1b898c4dff0f168b7eb6404b87fcc18 Mon Sep 17 00:00:00 2001 From: Istvan Kiss Date: Fri, 4 Oct 2024 14:28:35 +0200 Subject: [PATCH] PR feedbacks WIP WIP --- .../hip_runtime_api/memory_management.rst | 40 +++++++-- .../memory_management/coherence_control.rst | 89 ++++++++++++------- .../memory_management/host_memory.rst | 30 +++---- 3 files changed, 109 insertions(+), 50 deletions(-) diff --git a/docs/how-to/hip_runtime_api/memory_management.rst b/docs/how-to/hip_runtime_api/memory_management.rst index ca6f8f96e3..49e5704689 100644 --- a/docs/how-to/hip_runtime_api/memory_management.rst +++ b/docs/how-to/hip_runtime_api/memory_management.rst @@ -12,16 +12,46 @@ high-performance applications. Both allocating and copying memory can result in bottlenecks, which can significantly impact performance. The programming model is based on a system with a host and a device, each having -its own distinct memory. Kernels operate on device memory, while host functions operate on host memory. -The runtime -offers functions for allocating, freeing, and copying device memory, along -with transferring data between host and device memory. +its own distinct memory. Kernels operate on device memory, while host functions +operate on host memory. -How to manage the different memory types is described in the following chapters: +The runtime offers functions for allocating, freeing, and copying device memory, +along with transferring data between host and device memory. + +The description of these memory type can be located at the following page: * :ref:`device_memory` * :ref:`host_memory` + +The different memory managements are described in the following pages: + * :ref:`coherence_control` * :ref:`unified_memory` * :ref:`virtual_memory` * :ref:`stream_ordered_memory_allocator_how-to` + +Memory allocation +================================================================================ + +The following API calls with result in these allocations: + +.. list-table:: Memory coherence control + :widths: 25, 35, 20, 20 + :header-rows: 1 + :align: center + + * - API + - System allocated + - :cpp:func:`hipMallocManaged` + - :cpp:func:`hipHostMalloc` + - :cpp:func:`hipMalloc` + * - Data location + - Host + - Host + - Host + - Device + * - Allocation + - :ref:`Pageable ` + - :ref:`Managed ` + - :ref:`Pinned ` + - Pinned diff --git a/docs/how-to/hip_runtime_api/memory_management/coherence_control.rst b/docs/how-to/hip_runtime_api/memory_management/coherence_control.rst index 9b32fc1c45..69084c51ee 100644 --- a/docs/how-to/hip_runtime_api/memory_management/coherence_control.rst +++ b/docs/how-to/hip_runtime_api/memory_management/coherence_control.rst @@ -9,21 +9,21 @@ Coherence control ******************************************************************************* -Memory coherence describes how different parts of a system see the memory of a specific part of the system, e.g. how the CPU sees the GPUs memory or vice versa. -In HIP, host and device memory can be allocated with two different types of coherence: - -* **Coarse-grained coherence** means that memory is only considered up to date at - kernel boundaries, which can be enforced through :cpp:func:`hipDeviceSynchronize`, - hipStreamSynchronize, or any blocking operation that acts on the null - stream (e.g. :cpp:func:`hipMemcpy`). For example, cacheable memory is a type of - coarse-grained memory where an up-to-date copy of the data can be stored - elsewhere (e.g. in an L2 cache). -* **Fine-grained coherence** means the coherence is supported while a CPU/GPU - kernel is running. This can be useful if both host and device are operating on - the same dataspace using system-scope atomic operations (e.g. updating an - error code or flag to a buffer). Fine-grained memory implies that up-to-date - data may be made visible to others regardless of kernel boundaries as - discussed above. +Memory coherence describes how different parts of a system see the memory of a +specific part of the system, e.g. how the CPU sees the GPUs memory or vice versa. +In HIP, host and device memory can be allocated with two different types of +coherence: + +* **Coarse-grained coherence** means that memory is only considered up to date + after synchronization, which can be enforced through :cpp:func:`hipDeviceSynchronize`, + :cpp:func:`hipStreamSynchronize`, or any blocking operation that acts on the + null stream (e.g. :cpp:func:`hipMemcpy`). One reason for this can be writes to + caches, that the other part of the system can't access, so they are only + visible once the caches have been flushed. +* **Fine-grained coherence** means the memory is coherent even while it is being + modified by one of the parts of the system. Fine-grained coherence implies + that up to date data is visible to others regardless of kernel boundaries. + This can be useful if both host and device are operating on the same data. .. note:: @@ -33,12 +33,41 @@ In HIP, host and device memory can be allocated with two different types of cohe .. TODO: Is this still valid? What about Mi300? -Developers should use coarse-grained coherence where they can to reduce -host-device interconnect communication and also Mi200 accelerators hardware -based floating point instructions are working on coarse grained memory regions. +Developers should use coarse-grained coherence where they can reduce host-device +interconnect communication and also Mi200 accelerators hardware based floating +point instructions are working on coarse grained memory regions. The availability of fine- and coarse-grained memory pools can be checked with -``rocminfo``. +``rocminfo``: + +.. code-block:: sh + + $ rocminfo + ... + ******* + Agent 1 + ******* + Name: AMD EPYC 7742 64-Core Processor + ... + Pool Info: + Pool 1 + Segment: GLOBAL; FLAGS: FINE GRAINED + ... + Pool 3 + Segment: GLOBAL; FLAGS: COARSE GRAINED + ... + ******* + Agent 9 + ******* + Name: gfx90a + ... + Pool Info: + Pool 1 + Segment: GLOBAL; FLAGS: COARSE GRAINED + ... + + +The memory coherence control is described in the following table. .. list-table:: Memory coherence control :widths: 25, 35, 20, 20 @@ -84,17 +113,17 @@ The availability of fine- and coarse-grained memory pools can be checked with :sup:`1` The :cpp:func:`hipHostMalloc` memory allocation coherence mode can be affected by the ``HIP_HOST_COHERENT`` environment variable, if the -``hipHostMallocCoherent=0``, ``hipHostMallocNonCoherent=0``, -``hipHostMallocMapped=0`` and one of the other flag is set to 1. At this case, -if the ``HIP_HOST_COHERENT`` is not defined, or defined as 0, the host memory -allocation is coarse-grained. +``hipHostMallocCoherent``, ``hipHostMallocNonCoherent``, ``hipHostMallocMapped`` +are unset while one of the other flag is set. At this case, if the +``HIP_HOST_COHERENT`` environment variable is not defined, or defined as 0, the +host memory allocation is coarse-grained. .. note:: - * At ``hipHostMallocMapped=1`` case the allocated host memory is + * When ``hipHostMallocMapped`` flag is set, the allocated host memory is fine-grained and the ``hipHostMallocNonCoherent`` flag is ignored. - * The ``hipHostMallocCoherent=1`` and ``hipHostMallocNonCoherent=1`` state is - illegal. + * It's an illegal state, if the ``hipHostMallocCoherent`` and + ``hipHostMallocNonCoherent`` flags are set. Visibility of synchronization functions ================================================================================ @@ -104,7 +133,7 @@ at coarse-grained coherence, it depends on the used synchronization function. The synchronization functions effect and visibility on different coherence memory types collected in the following table. -.. list-table:: HIP API +.. list-table:: HIP synchronize functions effect and visibility * - HIP API - :cpp:func:`hipStreamSynchronize` @@ -141,13 +170,13 @@ Developers can control the release scope for :cpp:func:`hipEvents`: A stronger system-level fence can be specified when the event is created with :cpp:func:`hipEventCreateWithFlags`: -* :cpp:func:`hipEventReleaseToSystem`: Perform a system-scope release operation +* ``hipEventReleaseToSystem``: Perform a system-scope release operation when the event is recorded. This will make **both fine-grained and coarse-grained host memory visible to other agents in the system**, but may involve heavyweight operations such as cache flushing. Fine-grained memory will typically use lighter-weight in-kernel synchronization mechanisms such as an atomic operation and thus does not need to use. - :cpp:func:`hipEventReleaseToSystem`. -* :cpp:func:`hipEventDisableTiming`: Events created with this flag will not + ``hipEventReleaseToSystem``. +* ``hipEventDisableTiming``: Events created with this flag will not record profiling data and provide the best performance if used for synchronization. diff --git a/docs/how-to/hip_runtime_api/memory_management/host_memory.rst b/docs/how-to/hip_runtime_api/memory_management/host_memory.rst index ad8b0ef124..829abc6d65 100644 --- a/docs/how-to/hip_runtime_api/memory_management/host_memory.rst +++ b/docs/how-to/hip_runtime_api/memory_management/host_memory.rst @@ -32,6 +32,8 @@ The difference between memory transfers of explicit memory management and unifie Unified memory management is described further in :doc:`/how-to/hip_runtime_api/memory_management/unified_memory`. +.. _pageable_host_memory: + Pageable memory ================================================================================ @@ -105,22 +107,17 @@ application. The following example shows the pageable host memory usage in HIP. also provides non-blocking versions :cpp:func:`hipMallocAsync` and :cpp:func:`hipFreeAsync` which take in a stream as an additional argument. +.. _pinned_host_memory: + Pinned memory ================================================================================ -Pinned memory (or page-locked memory, or non-pageable memory) is host memory -that is mapped into the address space of all GPUs, meaning that the pointer can -be used on both host and device. Accessing host-resident pinned memory in device -kernels is generally not recommended for performance, as it can force the data -to traverse the host-device interconnect (e.g. PCIe), which is much slower than -the on-device bandwidth (>40x on MI200). - -Much like how a process can be locked to a CPU core by setting affinity, a -pinned memory allocator does this with the memory storage system. On multi-socket -systems it is important to ensure that pinned memory is located on the same -socket as the owning process, or else each cache line will be moved through the -CPU-CPU interconnect, thereby increasing latency and potentially decreasing -bandwidth. +Pinned memory (or page-locked memory) is stored in pages that are locked to +specific sectors in RAM and cannot be migrated. The pointer can be used on both +host and device. Accessing host-resident pinned memory in device kernels is +generally not recommended for performance, as it can force the data to traverse +the host-device interconnect (e.g. PCIe), which is much slower than +the on-device bandwidth. Advantage of pinned memory is the improved transfer times between host and device. For transfer operations, such as :cpp:func:`hipMemcpy` or :cpp:func:`hipMemcpyAsync`, @@ -220,12 +217,15 @@ host memory: ``HIP_HOST_COHERENT`` environment variable for specific allocation. For further details, check :ref:`coherence_control`. -All allocation flags are independent and can be used in most of the combination -without restriction, for instance, :cpp:func:`hipHostMalloc` can be called with both +All allocation flags are independent and can be used in any of the combinations +with on exception. For example, :cpp:func:`hipHostMalloc` can be called with both ``hipHostMallocPortable`` and ``hipHostMallocMapped`` flags set. Both usage models described above use the same allocation flags, and the difference is in how the surrounding code uses the host memory. +The one exception is when the ``hipHostMallocCoherent`` and +``hipHostMallocNonCoherent``flags are set, what is an illegal state. + .. note:: By default, each GPU selects a Numa CPU node that has the least Numa distance