From 6aa9ba94e1b898c4dff0f168b7eb6404b87fcc18 Mon Sep 17 00:00:00 2001
From: Istvan Kiss <neon60@gmail.com>
Date: Fri, 4 Oct 2024 14:28:35 +0200
Subject: [PATCH] PR feedbacks

WIP

WIP
---
 .../hip_runtime_api/memory_management.rst     | 40 +++++++--
 .../memory_management/coherence_control.rst   | 89 ++++++++++++-------
 .../memory_management/host_memory.rst         | 30 +++----
 3 files changed, 109 insertions(+), 50 deletions(-)

diff --git a/docs/how-to/hip_runtime_api/memory_management.rst b/docs/how-to/hip_runtime_api/memory_management.rst
index ca6f8f96e3..49e5704689 100644
--- a/docs/how-to/hip_runtime_api/memory_management.rst
+++ b/docs/how-to/hip_runtime_api/memory_management.rst
@@ -12,16 +12,46 @@ high-performance applications. Both allocating and copying memory can result in
 bottlenecks, which can significantly impact performance.
 
 The programming model is based on a system with a host and a device, each having
-its own distinct memory. Kernels operate on device memory, while host functions operate on host memory.
-The runtime
-offers functions for allocating, freeing, and copying device memory, along
-with transferring data between host and device memory.
+its own distinct memory. Kernels operate on device memory, while host functions
+operate on host memory.
 
-How to manage the different memory types is described in the following chapters:
+The runtime offers functions for allocating, freeing, and copying device memory,
+along with transferring data between host and device memory.
+
+The description of these memory type can be located at the following page:
 
 * :ref:`device_memory`
 * :ref:`host_memory`
+
+The different memory managements are described in the following pages:
+
 * :ref:`coherence_control`
 * :ref:`unified_memory`
 * :ref:`virtual_memory`
 * :ref:`stream_ordered_memory_allocator_how-to`
+
+Memory allocation
+================================================================================
+
+The following API calls with result in these allocations:
+
+.. list-table:: Memory coherence control
+    :widths: 25, 35, 20, 20
+    :header-rows: 1
+    :align: center
+
+    * - API
+      - System allocated 
+      - :cpp:func:`hipMallocManaged`
+      - :cpp:func:`hipHostMalloc`
+      - :cpp:func:`hipMalloc`
+    * - Data location
+      - Host
+      - Host
+      - Host
+      - Device
+    * - Allocation
+      - :ref:`Pageable <pageable_host_memory>`
+      - :ref:`Managed <unified_memory>`
+      - :ref:`Pinned <pinned_host_memory>`
+      - Pinned
diff --git a/docs/how-to/hip_runtime_api/memory_management/coherence_control.rst b/docs/how-to/hip_runtime_api/memory_management/coherence_control.rst
index 9b32fc1c45..69084c51ee 100644
--- a/docs/how-to/hip_runtime_api/memory_management/coherence_control.rst
+++ b/docs/how-to/hip_runtime_api/memory_management/coherence_control.rst
@@ -9,21 +9,21 @@
 Coherence control
 *******************************************************************************
 
-Memory coherence describes how different parts of a system see the memory of a specific part of the system, e.g. how the CPU sees the GPUs memory or vice versa.
-In HIP, host and device memory can be allocated with two different types of coherence:
-
-* **Coarse-grained coherence** means that memory is only considered up to date at 
-  kernel boundaries, which can be enforced through :cpp:func:`hipDeviceSynchronize`,
-  hipStreamSynchronize, or any blocking operation that acts on the null
-  stream (e.g. :cpp:func:`hipMemcpy`). For example, cacheable memory is a type of
-  coarse-grained memory where an up-to-date copy of the data can be stored
-  elsewhere (e.g. in an L2 cache).
-* **Fine-grained coherence** means the coherence is supported while a CPU/GPU 
-  kernel is running. This can be useful if both host and device are operating on
-  the same dataspace using system-scope atomic operations (e.g. updating an
-  error code or flag to a buffer). Fine-grained memory implies that up-to-date
-  data may be made visible to others regardless of kernel boundaries as
-  discussed above.
+Memory coherence describes how different parts of a system see the memory of a 
+specific part of the system, e.g. how the CPU sees the GPUs memory or vice versa.
+In HIP, host and device memory can be allocated with two different types of
+coherence:
+
+* **Coarse-grained coherence** means that memory is only considered up to date
+  after synchronization, which can be enforced through :cpp:func:`hipDeviceSynchronize`,
+  :cpp:func:`hipStreamSynchronize`, or any blocking operation that acts on the
+  null stream (e.g. :cpp:func:`hipMemcpy`). One reason for this can be writes to
+  caches, that the other part of the system can't access, so they are only
+  visible once the caches have been flushed.
+* **Fine-grained coherence** means the memory is coherent even while it is being
+  modified by one of the parts of the system. Fine-grained coherence implies
+  that up to date data is visible to others regardless of kernel boundaries.
+  This can be useful if both host and device are operating on the same data.
 
 .. note::
 
@@ -33,12 +33,41 @@ In HIP, host and device memory can be allocated with two different types of cohe
 
 .. TODO: Is this still valid? What about Mi300?
 
-Developers should use coarse-grained coherence where they can to reduce
-host-device interconnect communication and also Mi200 accelerators hardware
-based floating point instructions are working on coarse grained memory regions.
+Developers should use coarse-grained coherence where they can reduce host-device
+interconnect communication and also Mi200 accelerators hardware based floating
+point instructions are working on coarse grained memory regions.
 
 The availability of fine- and coarse-grained memory pools can be checked with
-``rocminfo``.
+``rocminfo``:
+
+.. code-block:: sh
+
+  $ rocminfo
+  ...
+  *******
+  Agent 1
+  *******
+  Name:                    AMD EPYC 7742 64-Core Processor
+  ...
+  Pool Info:
+  Pool 1
+  Segment:                 GLOBAL; FLAGS: FINE GRAINED
+  ...
+  Pool 3
+  Segment:                 GLOBAL; FLAGS: COARSE GRAINED
+  ...
+  *******
+  Agent 9
+  *******
+  Name:                    gfx90a
+  ...
+  Pool Info:
+  Pool 1
+  Segment:                 GLOBAL; FLAGS: COARSE GRAINED
+  ...
+
+
+The memory coherence control is described in the following table.
 
 .. list-table:: Memory coherence control
     :widths: 25, 35, 20, 20
@@ -84,17 +113,17 @@ The availability of fine- and coarse-grained memory pools can be checked with
 
 :sup:`1` The :cpp:func:`hipHostMalloc` memory allocation coherence mode can be
 affected by the ``HIP_HOST_COHERENT`` environment variable, if the 
-``hipHostMallocCoherent=0``, ``hipHostMallocNonCoherent=0``,
-``hipHostMallocMapped=0`` and one of the other flag is set to 1. At this case,
-if the ``HIP_HOST_COHERENT`` is not defined, or defined as 0, the host memory
-allocation is coarse-grained.
+``hipHostMallocCoherent``, ``hipHostMallocNonCoherent``, ``hipHostMallocMapped``
+are unset while one of the other flag is set. At this case, if the 
+``HIP_HOST_COHERENT`` environment variable is not defined, or defined as 0, the
+host memory allocation is coarse-grained.
 
 .. note::
 
-  * At ``hipHostMallocMapped=1`` case the allocated host memory is 
+  * When ``hipHostMallocMapped`` flag is set, the allocated host memory is 
     fine-grained and the ``hipHostMallocNonCoherent`` flag is ignored.
-  * The ``hipHostMallocCoherent=1`` and ``hipHostMallocNonCoherent=1`` state is
-    illegal. 
+  * It's an illegal state, if the ``hipHostMallocCoherent`` and
+    ``hipHostMallocNonCoherent`` flags are set.
 
 Visibility of synchronization functions
 ================================================================================
@@ -104,7 +133,7 @@ at coarse-grained coherence, it depends on the used synchronization function.
 The synchronization functions effect and visibility on different coherence 
 memory types collected in the following table.
 
-.. list-table:: HIP API
+.. list-table:: HIP synchronize functions effect and visibility
 
     * - HIP API
       - :cpp:func:`hipStreamSynchronize`
@@ -141,13 +170,13 @@ Developers can control the release scope for :cpp:func:`hipEvents`:
 A stronger system-level fence can be specified when the event is created with 
 :cpp:func:`hipEventCreateWithFlags`:
 
-* :cpp:func:`hipEventReleaseToSystem`: Perform a system-scope release operation
+* ``hipEventReleaseToSystem``: Perform a system-scope release operation
   when the event is recorded. This will make **both fine-grained and
   coarse-grained host memory visible to other agents in the system**, but may
   involve heavyweight operations such as cache flushing. Fine-grained memory
   will typically use lighter-weight in-kernel synchronization mechanisms such as
   an atomic operation and thus does not need to use.
-  :cpp:func:`hipEventReleaseToSystem`.
-* :cpp:func:`hipEventDisableTiming`: Events created with this flag will not
+  ``hipEventReleaseToSystem``.
+* ``hipEventDisableTiming``: Events created with this flag will not
   record profiling data and provide the best performance if used for
   synchronization.
diff --git a/docs/how-to/hip_runtime_api/memory_management/host_memory.rst b/docs/how-to/hip_runtime_api/memory_management/host_memory.rst
index ad8b0ef124..829abc6d65 100644
--- a/docs/how-to/hip_runtime_api/memory_management/host_memory.rst
+++ b/docs/how-to/hip_runtime_api/memory_management/host_memory.rst
@@ -32,6 +32,8 @@ The difference between memory transfers of explicit memory management and unifie
 
 Unified memory management is described further in :doc:`/how-to/hip_runtime_api/memory_management/unified_memory`.
 
+.. _pageable_host_memory:
+
 Pageable memory
 ================================================================================
 
@@ -105,22 +107,17 @@ application. The following example shows the pageable host memory usage in HIP.
   also provides non-blocking versions :cpp:func:`hipMallocAsync` and
   :cpp:func:`hipFreeAsync` which take in a stream as an additional argument.
 
+.. _pinned_host_memory:
+
 Pinned memory
 ================================================================================
 
-Pinned memory (or page-locked memory, or non-pageable memory) is host memory
-that is mapped into the address space of all GPUs, meaning that the pointer can
-be used on both host and device. Accessing host-resident pinned memory in device
-kernels is generally not recommended for performance, as it can force the data
-to traverse the host-device interconnect (e.g. PCIe), which is much slower than
-the on-device bandwidth (>40x on MI200).
-
-Much like how a process can be locked to a CPU core by setting affinity, a
-pinned memory allocator does this with the memory storage system. On multi-socket
-systems it is important to ensure that pinned memory is located on the same
-socket as the owning process, or else each cache line will be moved through the
-CPU-CPU interconnect, thereby increasing latency and potentially decreasing
-bandwidth.
+Pinned memory (or page-locked memory) is stored in pages that are locked to
+specific sectors in RAM and cannot be migrated. The pointer can be used on both
+host and device. Accessing host-resident pinned memory in device kernels is
+generally not recommended for performance, as it can force the data to traverse
+the host-device interconnect (e.g. PCIe), which is much slower than
+the on-device bandwidth.
 
 Advantage of pinned memory is the improved transfer times between host and
 device. For transfer operations, such as :cpp:func:`hipMemcpy` or :cpp:func:`hipMemcpyAsync`,
@@ -220,12 +217,15 @@ host memory:
   ``HIP_HOST_COHERENT`` environment variable for specific allocation. For
   further details, check :ref:`coherence_control`.
 
-All allocation flags are independent and can be used in most of the combination
-without restriction, for instance, :cpp:func:`hipHostMalloc` can be called with both
+All allocation flags are independent and can be used in any of the combinations
+with on exception. For example, :cpp:func:`hipHostMalloc` can be called with both
 ``hipHostMallocPortable`` and ``hipHostMallocMapped`` flags set. Both usage
 models described above use the same allocation flags, and the difference is in
 how the surrounding code uses the host memory.
 
+The one exception is when the ``hipHostMallocCoherent`` and
+``hipHostMallocNonCoherent``flags are set, what is an illegal state.
+
 .. note:: 
   
   By default, each GPU selects a Numa CPU node that has the least Numa distance