feat: initial AMD GPU Support (#223)

* feat: initial AMD GPU Support After looking into flash attention support, only a few cards are supported. This check will prevent errors from appearing during the install. Everything still works with out it. style: fix updates to amd_go_fast to fit with coding standards * feat: `--amd` flag for amd specific optimizations * tests: reqs.rocm.txt consistency with reqs.txt check * style: fix * chore: update pre-commit torch pin * docs: improved readme; note improved amd support --------- Co-authored-by: tazlin <[email protected]>
Haidra-Org · Jul 8, 2024 · 7d5c4f2 · 7d5c4f2
1 parent 9f20abf
commit 7d5c4f2
Show file tree

Hide file tree

Showing 18 changed files with 315 additions and 34 deletions.
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -38,7 +38,7 @@ repos:
         - python-dotenv
         - aiohttp
         - horde_safety==0.2.3
-        - torch==2.2.2
+        - torch==2.3.1
         - ruamel.yaml
         - horde_engine==2.11.1
         - horde_sdk==0.10.0

diff --git a/README.md b/README.md
@@ -8,23 +8,27 @@ If you want the latest information or have questions, come to [the #local-worker
 
 This repo contains the latest implementation for the [AI Horde](https://aihorde.net) Worker. This will turn your graphics card(s) into a worker for the AI Horde where you will create images for others. You you will receive in turn earn 'kudos' which will give you priority for your own generations.
 
+## Important Info
 
-Please note that **AMD card are not currently well supported**, but may be in the future. If you are willing to try with your AMD card, join the [discord discussion](https://discord.com/channels/781145214752129095/1076124012305993768).
+- **An SSD is strongly recommended** especially if you are offering more than one model.
+  - If you only have an HDD available to you, you can only offer one model and will have to be able to load 3-8gb off disk within 60 seconds or the worker will not function.
+- Do not set threads higher than 2 unless you have a data-center grade card (48gb+ VRAM)
+- Your memory usage will increase up until the number of queued jobs (`queue_size` in the config).
+  - If you have **less than 32gb of system ram**, you should should stick to `queue_size: 1`.
+  - If you have **less than 16gb of system ram** or you experience frequent memory-related crashes:
+    - Do not offer SDXL/SD21 models. You can do this by adding ` ALL SDXL` and `ALL SD21` to your `models_to_skip` if you are using the `TOP N` model load option to automatically remove these heavier models from your offerings.
+    - Set `allow_post_processing` and `allow_controlnet` to false
+    - Set `queue_size: 0`
+- If you plan on running SDXL, you will need to ensure at least 9 gb of system ram remains free while the worker is running.
+- If you have an 8 gb card, SDXL will only reliably work at max_power values close to 32. 42 was too high for tests on a 2080 in certain cases.
 
+### AMD
+~~Please note that **AMD cards are not currently well supported**, but may be in the future.~~
 
-## Some important details you should know before you start
+> Update: **AMD** now has been shown to have better support but for **linux machines only** - linux must be installed on the bare metal machine; windows systems, WSL or linux containers still do not work. You can now follow this guide using  `horde-bridge-rocm.sh` and `update-runtime-rocm.sh` where appropriate.
+
+If you are willing to try with your AMD card, join the [discord discussion](https://discord.com/channels/781145214752129095/1076124012305993768). P
 
-- If you are upgrading from `AI-Horde-Worker`, you will have to manually move your models folder to the `horde-worker-reGen` folder. This folder may be named `models` or `nataili` (depending on when you installed) and should contain a folder named `compvis`.
-  - We recommend you start with a fresh bridge data file (`bridgeData_template.yaml` -> `bridgeData.yaml`). See Configure section
-- When submitting debug information **do not publish `.log` files in the discord server channels - send them to tazlin directly** as we cannot guarantee that your API key would not be in it (though, this warning should relax over time).
-- Do not set threads higher than 2.
-- Your memory usage will increase up until the number of queued jobs. You should set your queue size to at least 1.
-- If you have a low amount of **system** memory (16gb or under), do not attempt a queue size greater than 1 if you have more than one model set to load.
-- If you plan on running SDXL, you will need to ensure at least 9 gb of system ram remains free.
-- If you have an 8 gb card, SDXL will only reliably work at max_power values close to 32. 42 was too high for tests on a 2080 in certain cases.
-- **An SSD is strongly recommended** especially if you are offering more than one model. 
-  - If you only have an HDD available to you, you can only offer one model and will have to be able to load 3-8gb off disk within 60 seconds or the worker will not function.
-
 # Installing
 
 **Please see the prior section before proceeding.**
@@ -88,13 +92,13 @@ Continue with the [Basic Usage](#Basic-Usage) instructions
 The below instructions refers to `horde-bridge` or `update-runtime`. Depending on your OS, append `.cmd` for windows, or `.sh` for linux
 - for example, `horde-bridge.cmd` and `update-runtime.cmd` for windows
 
+> Note: If you have an **AMD** card you should use `horde-bridge-rocm.sh` and `update-runtime-rocm.sh` where appropriate
+
 You can double click the provided script files below from a file explorer or run it from a terminal like `bash`, `cmd` depending on your OS. The latter option will allow you to **see errors in case of a crash**, so it's recommended.
 
 
 ### Configure
 
-#### Manually
-
 1. Make a copy of `bridgeData_template.yaml` to `bridgeData.yaml`
 1. Edit `bridgeData.yaml` and follow the instructions within to fill in your details.
 
@@ -112,7 +116,7 @@ You can double click the provided script files below from a file explorer or run
 
 #### Stopping the worker
 
-* In the terminal in which it's running, simply press `Ctrl+C` together.
+* In the terminal in which it's running, press `Ctrl+C` together.
 * The worker will finish the current jobs before exiting.
 
 

diff --git a/environment.rocm.yaml b/environment.rocm.yaml
@@ -0,0 +1,9 @@
+name: ldm
+channels:
+  - conda-forge
+  - defaults
+# These should only contain the minimal essentials to get the binaries going, everything else is managed in requirements.txt to keep it universal.
+dependencies:
+  - git
+  - pip
+  - python==3.11.6
diff --git a/horde-bridge-rocm.sh b/horde-bridge-rocm.sh
@@ -0,0 +1,49 @@
+#!/bin/bash
+# Get the directory of the current script
+SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
+
+# Build the absolute path to the Conda environment
+CONDA_ENV_PATH="$SCRIPT_DIR/conda/envs/linux/lib"
+
+# Add the Conda environment to LD_LIBRARY_PATH
+export LD_LIBRARY_PATH="$CONDA_ENV_PATH:$LD_LIBRARY_PATH"
+
+# Set torch garbage cleanup. Amd defaults cause problems.
+export PYTORCH_HIP_ALLOC_CONF=garbage_collection_threshold:0.6,max_split_size_mb:2048
+
+# List of directories to check
+dirs=(
+    "/usr/lib"
+    "/usr/local/lib"
+    "/lib"
+    "/lib64"
+    "/usr/lib/x86_64-linux-gnu"
+)
+
+# Check each directory
+for dir in "${dirs[@]}"; do
+    if [ -f "$dir/libjemalloc.so.2" ]; then
+        export LD_PRELOAD="$dir/libjemalloc.so.2"
+        printf "Using jemalloc from $dir\n"
+        break
+    fi
+done
+
+# If jemalloc was not found, print a warning
+if [ -z "$LD_PRELOAD" ]; then
+    printf "WARNING: jemalloc not found. You may run into memory issues! We recommend running `sudo apt install libjemalloc2`\n"
+    # Press q to quit or any other key to continue
+    read -n 1 -s -r -p "Press q to quit or any other key to continue: " key
+    if [ "$key" = "q" ]; then
+        printf "\n"
+        exit 1
+    fi
+fi
+
+
+if ./runtime-rocm.sh python -s download_models.py; then
+    echo "Model Download OK. Starting worker..."
+    ./runtime-rocm.sh python -s run_worker.py --amd $*
+else
+    echo "download_models.py exited with error code. Aborting"
+fi
diff --git a/horde_worker_regen/amd_go_fast/amd_go_fast.py b/horde_worker_regen/amd_go_fast/amd_go_fast.py
@@ -0,0 +1,37 @@
+import torch
+from loguru import logger
+
+if "AMD" in torch.cuda.get_device_name() or "Radeon" in torch.cuda.get_device_name():
+    try:  # this import is handled via  script, skipping it in mypy. If this fails somehow the module will simply not run.
+        from flash_attn import flash_attn_func  # type: ignore
+
+        sdpa = torch.nn.functional.scaled_dot_product_attention
+
+        def sdpa_hijack(query, key, value, attn_mask=None, dropout_p=0.0, is_causal=False, scale=None):
+            if query.shape[3] <= 128 and attn_mask is None and query.dtype != torch.float32:
+                hidden_states = flash_attn_func(
+                    q=query.transpose(1, 2),
+                    k=key.transpose(1, 2),
+                    v=value.transpose(1, 2),
+                    dropout_p=dropout_p,
+                    causal=is_causal,
+                    softmax_scale=scale,
+                ).transpose(1, 2)
+            else:
+                hidden_states = sdpa(
+                    query=query,
+                    key=key,
+                    value=value,
+                    attn_mask=attn_mask,
+                    dropout_p=dropout_p,
+                    is_causal=is_causal,
+                    scale=scale,
+                )
+            return hidden_states
+
+        torch.nn.functional.scaled_dot_product_attention = sdpa_hijack
+        logger.debug("# # # AMD GO FAST # # #")
+    except ImportError as e:
+        logger.debug(f"# # # AMD GO SLOW {e} # # #")
+else:
+    logger.debug(f"# # # AMD GO SLOW Could not detect AMD GPU from: {torch.cuda.get_device_name()} # # #")
diff --git a/horde_worker_regen/amd_go_fast/install_amd_go_fast.sh b/horde_worker_regen/amd_go_fast/install_amd_go_fast.sh
@@ -0,0 +1,20 @@
+#!/bin/bash
+
+# Determine if the user has a flash attention supported card.
+SUPPORTED_CARD=$(rocminfo | grep -c -e gfx1100 -e gfx1101 -e gfx1102)
+
+if [ "$SUPPORTED_CARD" -gt 0 ]; then
+    if ! python -s -m pip install -U git+https://github.com/ROCm/flash-attention@howiejay/navi_support; then
+		echo "Tried to install flash attention and failed!"
+	else
+		echo "Installed flash attn."
+		PY_SITE_DIR=$(python -c "import sysconfig; print(sysconfig.get_path('purelib'))")
+		if ! cp horde_worker_regen/amd_go_fast/amd_go_fast.py "${PY_SITE_DIR}"/hordelib/nodes/; then
+			echo "Failed to install AMD GO FAST."
+		else
+			echo "Installed AMD GO FAST."
+		fi
+    fi
+else
+	echo "Did not detect support for AMD GO FAST"
+fi
diff --git a/horde_worker_regen/process_management/main_entry_point.py b/horde_worker_regen/process_management/main_entry_point.py
@@ -10,12 +10,15 @@ def start_working(
     ctx: BaseContext,
     bridge_data: reGenBridgeData,
     horde_model_reference_manager: ModelReferenceManager,
+    *,
+    amd_gpu: bool = False,
 ) -> None:
     """Create and start process manager."""
     process_manager = HordeWorkerProcessManager(
         ctx=ctx,
         bridge_data=bridge_data,
         horde_model_reference_manager=horde_model_reference_manager,
+        amd_gpu=amd_gpu,
     )
 
     process_manager.start()
diff --git a/horde_worker_regen/process_management/process_manager.py b/horde_worker_regen/process_management/process_manager.py
@@ -1006,6 +1006,8 @@ def num_total_processes(self) -> int:
 
     _lru: LRUCache
 
+    _amd_gpu: bool
+
     def __init__(
         self,
         *,
@@ -1016,6 +1018,7 @@ def __init__(
         target_vram_overhead_bytes_map: Mapping[int, int] | None = None,  # FIXME
         max_safety_processes: int = 1,
         max_download_processes: int = 1,
+        amd_gpu: bool = False,
     ) -> None:
         """Initialise the process manager.
 
@@ -1031,6 +1034,7 @@ def __init__(
                 Defaults to 1.
             max_download_processes (int, optional): The maximum number of download processes that can run at once. \
                 Defaults to 1.
+            amd_gpu (bool, optional): Whether or not the GPU is an AMD GPU. Defaults to False.
         """
         self.session_start_time = time.time()
 
@@ -1051,6 +1055,8 @@ def __init__(
         self.max_inference_processes = self.bridge_data.queue_size + self.bridge_data.max_threads
         self._lru = LRUCache(self.max_inference_processes)
 
+        self._amd_gpu = amd_gpu
+
         # If there is only one model to load and only one inference process, then we can only run one job at a time
         # and there is no point in having more than one inference process
         if len(self.bridge_data.image_models_to_load) == 1 and self.max_concurrent_inference_processes == 1:
@@ -1268,6 +1274,10 @@ def start_safety_processes(self) -> None:
                     self._disk_lock,
                     cpu_only,
                 ),
+                kwargs={
+                    "high_memory_mode": self.bridge_data.high_memory_mode,
+                    "amd_gpu": self._amd_gpu,
+                },
             )
 
             process.start()
@@ -1325,7 +1335,10 @@ def _start_inference_process(self, pid: int) -> HordeProcessInfo:
                 self._disk_lock,
                 self._aux_model_lock,
             ),
-            kwargs={"high_memory_mode": self.bridge_data.high_memory_mode},
+            kwargs={
+                "high_memory_mode": self.bridge_data.high_memory_mode,
+                "amd_gpu": self._amd_gpu,
+            },
         )
         process.start()
         # Add the process to the process map

diff --git a/horde_worker_regen/process_management/worker_entry_points.py b/horde_worker_regen/process_management/worker_entry_points.py
@@ -21,6 +21,7 @@ def start_inference_process(
     aux_model_lock: Lock,
     *,
     high_memory_mode: bool = False,
+    amd_gpu: bool = False,
 ) -> None:
     """Start an inference process.
 
@@ -32,6 +33,8 @@ def start_inference_process(
         disk_lock (Lock): The lock to use for disk access.
         aux_model_lock (Lock): The lock to use for auxiliary model downloading.
         high_memory_mode (bool, optional): If true, the process will attempt to use more memory. Defaults to False.
+        amd_gpu (bool, optional): If true, the process will attempt to use AMD GPU-specific optimisations.
+            Defaults to False.
     """
     with contextlib.nullcontext():  # contextlib.redirect_stdout(None), contextlib.redirect_stderr(None):
         logger.remove()
@@ -46,22 +49,26 @@ def start_inference_process(
                 verbosity_count=5,  # FIXME
             )
 
-            logger.debug(f"Initialising hordelib with process_id={process_id} and high_memory_mode={high_memory_mode}")
+            logger.debug(
+                f"Initialising hordelib with process_id={process_id}, high_memory_mode={high_memory_mode} "
+                f"and amd_gpu={amd_gpu}",
+            )
+
+            extra_comfyui_args = ["--disable-smart-memory"]
+
+            if amd_gpu:
+                extra_comfyui_args.append("--use-pytorch-cross-attention")
+
+            if high_memory_mode:
+                extra_comfyui_args.append("--highvram")
 
             with logger.catch(reraise=True):
                 hordelib.initialise(
                     setup_logging=None,
                     process_id=process_id,
                     logging_verbosity=0,
                     force_normal_vram_mode=not high_memory_mode,
-                    extra_comfyui_args=(
-                        ["--disable-smart-memory"]
-                        if not high_memory_mode
-                        else [
-                            "--disable-smart-memory",
-                            "--highvram",
-                        ]
-                    ),
+                    extra_comfyui_args=extra_comfyui_args,
                 )
         except Exception as e:
             logger.critical(f"Failed to initialise hordelib: {type(e).__name__} {e}")
@@ -89,6 +96,7 @@ def start_safety_process(
     cpu_only: bool = True,
     *,
     high_memory_mode: bool = False,
+    amd_gpu: bool = False,
 ) -> None:
     """Start a safety process.
 
@@ -99,6 +107,8 @@ def start_safety_process(
         disk_lock (Lock): The lock to use for disk access.
         cpu_only (bool, optional): If true, the process will not use the GPU. Defaults to True.
         high_memory_mode (bool, optional): If true, the process will attempt to use more memory. Defaults to False.
+        amd_gpu (bool, optional): If true, the process will attempt to use AMD GPU-specific optimisations.
+            Defaults to False.
     """
     with contextlib.nullcontext():  # contextlib.redirect_stdout(), contextlib.redirect_stderr():
         logger.remove()
@@ -115,12 +125,20 @@ def start_safety_process(
 
             logger.debug(f"Initialising hordelib with process_id={process_id} and high_memory_mode={high_memory_mode}")
 
+            extra_comfyui_args = ["--disable-smart-memory"]
+
+            if amd_gpu:
+                extra_comfyui_args.append("--use-pytorch-cross-attention")
+
+            if high_memory_mode:
+                extra_comfyui_args.append("--highvram")
+
             with logger.catch(reraise=True):
                 hordelib.initialise(
                     setup_logging=None,
                     process_id=process_id,
                     logging_verbosity=0,
-                    extra_comfyui_args=["--disable-smart-memory"],
+                    extra_comfyui_args=extra_comfyui_args,
                 )
         except Exception as e:
             logger.critical(f"Failed to initialise hordelib: {type(e).__name__} {e}")

diff --git a/horde_worker_regen/run_worker.py b/horde_worker_regen/run_worker.py
@@ -12,7 +12,7 @@
 from loguru import logger
 
 
-def main(ctx: BaseContext, load_from_env_vars: bool = False) -> None:
+def main(ctx: BaseContext, load_from_env_vars: bool = False, *, amd_gpu: bool = False) -> None:
     """Check for a valid config and start the driver ('main') process for the reGen worker."""
     from horde_model_reference.model_reference_manager import ModelReferenceManager
     from pydantic import ValidationError
@@ -91,6 +91,7 @@ def ensure_model_db_downloaded() -> ModelReferenceManager:
         ctx=ctx,
         bridge_data=bridge_data,
         horde_model_reference_manager=horde_model_reference_manager,
+        amd_gpu=amd_gpu,
     )
 
 
@@ -136,6 +137,13 @@ def init() -> None:
         default=False,
         help="Load the config only from environment variables. This is useful for running the worker in a container.",
     )
+    parser.add_argument(
+        "--amd",
+        "--amd-gpu",
+        action="store_true",
+        default=False,
+        help="Enable AMD GPU-specific optimisations",
+    )
 
     args = parser.parse_args()