Skip to content

Commit

Permalink
feat: initial AMD GPU Support (#223)
Browse files Browse the repository at this point in the history
* feat: initial AMD GPU Support

After looking into flash attention support, only a few cards are supported. This check will prevent errors from appearing during the install. Everything still works with out it.

style: fix

updates to amd_go_fast to fit with coding standards

* feat: `--amd` flag for amd specific optimizations

* tests: reqs.rocm.txt consistency with reqs.txt check

* style: fix

* chore: update pre-commit torch pin

* docs: improved readme; note improved amd support

---------

Co-authored-by: tazlin <[email protected]>
  • Loading branch information
niales and tazlin authored Jul 8, 2024
1 parent 9f20abf commit 7d5c4f2
Show file tree
Hide file tree
Showing 18 changed files with 315 additions and 34 deletions.
2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ repos:
- python-dotenv
- aiohttp
- horde_safety==0.2.3
- torch==2.2.2
- torch==2.3.1
- ruamel.yaml
- horde_engine==2.11.1
- horde_sdk==0.10.0
Expand Down
36 changes: 20 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,23 +8,27 @@ If you want the latest information or have questions, come to [the #local-worker

This repo contains the latest implementation for the [AI Horde](https://aihorde.net) Worker. This will turn your graphics card(s) into a worker for the AI Horde where you will create images for others. You you will receive in turn earn 'kudos' which will give you priority for your own generations.

## Important Info

Please note that **AMD card are not currently well supported**, but may be in the future. If you are willing to try with your AMD card, join the [discord discussion](https://discord.com/channels/781145214752129095/1076124012305993768).
- **An SSD is strongly recommended** especially if you are offering more than one model.
- If you only have an HDD available to you, you can only offer one model and will have to be able to load 3-8gb off disk within 60 seconds or the worker will not function.
- Do not set threads higher than 2 unless you have a data-center grade card (48gb+ VRAM)
- Your memory usage will increase up until the number of queued jobs (`queue_size` in the config).
- If you have **less than 32gb of system ram**, you should should stick to `queue_size: 1`.
- If you have **less than 16gb of system ram** or you experience frequent memory-related crashes:
- Do not offer SDXL/SD21 models. You can do this by adding ` ALL SDXL` and `ALL SD21` to your `models_to_skip` if you are using the `TOP N` model load option to automatically remove these heavier models from your offerings.
- Set `allow_post_processing` and `allow_controlnet` to false
- Set `queue_size: 0`
- If you plan on running SDXL, you will need to ensure at least 9 gb of system ram remains free while the worker is running.
- If you have an 8 gb card, SDXL will only reliably work at max_power values close to 32. 42 was too high for tests on a 2080 in certain cases.

### AMD
~~Please note that **AMD cards are not currently well supported**, but may be in the future.~~

## Some important details you should know before you start
> Update: **AMD** now has been shown to have better support but for **linux machines only** - linux must be installed on the bare metal machine; windows systems, WSL or linux containers still do not work. You can now follow this guide using `horde-bridge-rocm.sh` and `update-runtime-rocm.sh` where appropriate.
If you are willing to try with your AMD card, join the [discord discussion](https://discord.com/channels/781145214752129095/1076124012305993768). P

- If you are upgrading from `AI-Horde-Worker`, you will have to manually move your models folder to the `horde-worker-reGen` folder. This folder may be named `models` or `nataili` (depending on when you installed) and should contain a folder named `compvis`.
- We recommend you start with a fresh bridge data file (`bridgeData_template.yaml` -> `bridgeData.yaml`). See Configure section
- When submitting debug information **do not publish `.log` files in the discord server channels - send them to tazlin directly** as we cannot guarantee that your API key would not be in it (though, this warning should relax over time).
- Do not set threads higher than 2.
- Your memory usage will increase up until the number of queued jobs. You should set your queue size to at least 1.
- If you have a low amount of **system** memory (16gb or under), do not attempt a queue size greater than 1 if you have more than one model set to load.
- If you plan on running SDXL, you will need to ensure at least 9 gb of system ram remains free.
- If you have an 8 gb card, SDXL will only reliably work at max_power values close to 32. 42 was too high for tests on a 2080 in certain cases.
- **An SSD is strongly recommended** especially if you are offering more than one model.
- If you only have an HDD available to you, you can only offer one model and will have to be able to load 3-8gb off disk within 60 seconds or the worker will not function.

# Installing

**Please see the prior section before proceeding.**
Expand Down Expand Up @@ -88,13 +92,13 @@ Continue with the [Basic Usage](#Basic-Usage) instructions
The below instructions refers to `horde-bridge` or `update-runtime`. Depending on your OS, append `.cmd` for windows, or `.sh` for linux
- for example, `horde-bridge.cmd` and `update-runtime.cmd` for windows

> Note: If you have an **AMD** card you should use `horde-bridge-rocm.sh` and `update-runtime-rocm.sh` where appropriate

You can double click the provided script files below from a file explorer or run it from a terminal like `bash`, `cmd` depending on your OS. The latter option will allow you to **see errors in case of a crash**, so it's recommended.
### Configure
#### Manually
1. Make a copy of `bridgeData_template.yaml` to `bridgeData.yaml`
1. Edit `bridgeData.yaml` and follow the instructions within to fill in your details.
Expand All @@ -112,7 +116,7 @@ You can double click the provided script files below from a file explorer or run
#### Stopping the worker
* In the terminal in which it's running, simply press `Ctrl+C` together.
* In the terminal in which it's running, press `Ctrl+C` together.
* The worker will finish the current jobs before exiting.


Expand Down
9 changes: 9 additions & 0 deletions environment.rocm.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
name: ldm
channels:
- conda-forge
- defaults
# These should only contain the minimal essentials to get the binaries going, everything else is managed in requirements.txt to keep it universal.
dependencies:
- git
- pip
- python==3.11.6
49 changes: 49 additions & 0 deletions horde-bridge-rocm.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
#!/bin/bash
# Get the directory of the current script
SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"

# Build the absolute path to the Conda environment
CONDA_ENV_PATH="$SCRIPT_DIR/conda/envs/linux/lib"

# Add the Conda environment to LD_LIBRARY_PATH
export LD_LIBRARY_PATH="$CONDA_ENV_PATH:$LD_LIBRARY_PATH"

# Set torch garbage cleanup. Amd defaults cause problems.
export PYTORCH_HIP_ALLOC_CONF=garbage_collection_threshold:0.6,max_split_size_mb:2048

# List of directories to check
dirs=(
"/usr/lib"
"/usr/local/lib"
"/lib"
"/lib64"
"/usr/lib/x86_64-linux-gnu"
)

# Check each directory
for dir in "${dirs[@]}"; do
if [ -f "$dir/libjemalloc.so.2" ]; then
export LD_PRELOAD="$dir/libjemalloc.so.2"
printf "Using jemalloc from $dir\n"
break
fi
done

# If jemalloc was not found, print a warning
if [ -z "$LD_PRELOAD" ]; then
printf "WARNING: jemalloc not found. You may run into memory issues! We recommend running `sudo apt install libjemalloc2`\n"
# Press q to quit or any other key to continue
read -n 1 -s -r -p "Press q to quit or any other key to continue: " key
if [ "$key" = "q" ]; then
printf "\n"
exit 1
fi
fi


if ./runtime-rocm.sh python -s download_models.py; then
echo "Model Download OK. Starting worker..."
./runtime-rocm.sh python -s run_worker.py --amd $*
else
echo "download_models.py exited with error code. Aborting"
fi
37 changes: 37 additions & 0 deletions horde_worker_regen/amd_go_fast/amd_go_fast.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
import torch
from loguru import logger

if "AMD" in torch.cuda.get_device_name() or "Radeon" in torch.cuda.get_device_name():
try: # this import is handled via script, skipping it in mypy. If this fails somehow the module will simply not run.
from flash_attn import flash_attn_func # type: ignore

sdpa = torch.nn.functional.scaled_dot_product_attention

def sdpa_hijack(query, key, value, attn_mask=None, dropout_p=0.0, is_causal=False, scale=None):
if query.shape[3] <= 128 and attn_mask is None and query.dtype != torch.float32:
hidden_states = flash_attn_func(
q=query.transpose(1, 2),
k=key.transpose(1, 2),
v=value.transpose(1, 2),
dropout_p=dropout_p,
causal=is_causal,
softmax_scale=scale,
).transpose(1, 2)
else:
hidden_states = sdpa(
query=query,
key=key,
value=value,
attn_mask=attn_mask,
dropout_p=dropout_p,
is_causal=is_causal,
scale=scale,
)
return hidden_states

torch.nn.functional.scaled_dot_product_attention = sdpa_hijack
logger.debug("# # # AMD GO FAST # # #")
except ImportError as e:
logger.debug(f"# # # AMD GO SLOW {e} # # #")
else:
logger.debug(f"# # # AMD GO SLOW Could not detect AMD GPU from: {torch.cuda.get_device_name()} # # #")
20 changes: 20 additions & 0 deletions horde_worker_regen/amd_go_fast/install_amd_go_fast.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
#!/bin/bash

# Determine if the user has a flash attention supported card.
SUPPORTED_CARD=$(rocminfo | grep -c -e gfx1100 -e gfx1101 -e gfx1102)

if [ "$SUPPORTED_CARD" -gt 0 ]; then
if ! python -s -m pip install -U git+https://github.com/ROCm/flash-attention@howiejay/navi_support; then
echo "Tried to install flash attention and failed!"
else
echo "Installed flash attn."
PY_SITE_DIR=$(python -c "import sysconfig; print(sysconfig.get_path('purelib'))")
if ! cp horde_worker_regen/amd_go_fast/amd_go_fast.py "${PY_SITE_DIR}"/hordelib/nodes/; then
echo "Failed to install AMD GO FAST."
else
echo "Installed AMD GO FAST."
fi
fi
else
echo "Did not detect support for AMD GO FAST"
fi
3 changes: 3 additions & 0 deletions horde_worker_regen/process_management/main_entry_point.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,12 +10,15 @@ def start_working(
ctx: BaseContext,
bridge_data: reGenBridgeData,
horde_model_reference_manager: ModelReferenceManager,
*,
amd_gpu: bool = False,
) -> None:
"""Create and start process manager."""
process_manager = HordeWorkerProcessManager(
ctx=ctx,
bridge_data=bridge_data,
horde_model_reference_manager=horde_model_reference_manager,
amd_gpu=amd_gpu,
)

process_manager.start()
15 changes: 14 additions & 1 deletion horde_worker_regen/process_management/process_manager.py
Original file line number Diff line number Diff line change
Expand Up @@ -1006,6 +1006,8 @@ def num_total_processes(self) -> int:

_lru: LRUCache

_amd_gpu: bool

def __init__(
self,
*,
Expand All @@ -1016,6 +1018,7 @@ def __init__(
target_vram_overhead_bytes_map: Mapping[int, int] | None = None, # FIXME
max_safety_processes: int = 1,
max_download_processes: int = 1,
amd_gpu: bool = False,
) -> None:
"""Initialise the process manager.
Expand All @@ -1031,6 +1034,7 @@ def __init__(
Defaults to 1.
max_download_processes (int, optional): The maximum number of download processes that can run at once. \
Defaults to 1.
amd_gpu (bool, optional): Whether or not the GPU is an AMD GPU. Defaults to False.
"""
self.session_start_time = time.time()

Expand All @@ -1051,6 +1055,8 @@ def __init__(
self.max_inference_processes = self.bridge_data.queue_size + self.bridge_data.max_threads
self._lru = LRUCache(self.max_inference_processes)

self._amd_gpu = amd_gpu

# If there is only one model to load and only one inference process, then we can only run one job at a time
# and there is no point in having more than one inference process
if len(self.bridge_data.image_models_to_load) == 1 and self.max_concurrent_inference_processes == 1:
Expand Down Expand Up @@ -1268,6 +1274,10 @@ def start_safety_processes(self) -> None:
self._disk_lock,
cpu_only,
),
kwargs={
"high_memory_mode": self.bridge_data.high_memory_mode,
"amd_gpu": self._amd_gpu,
},
)

process.start()
Expand Down Expand Up @@ -1325,7 +1335,10 @@ def _start_inference_process(self, pid: int) -> HordeProcessInfo:
self._disk_lock,
self._aux_model_lock,
),
kwargs={"high_memory_mode": self.bridge_data.high_memory_mode},
kwargs={
"high_memory_mode": self.bridge_data.high_memory_mode,
"amd_gpu": self._amd_gpu,
},
)
process.start()
# Add the process to the process map
Expand Down
38 changes: 28 additions & 10 deletions horde_worker_regen/process_management/worker_entry_points.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ def start_inference_process(
aux_model_lock: Lock,
*,
high_memory_mode: bool = False,
amd_gpu: bool = False,
) -> None:
"""Start an inference process.
Expand All @@ -32,6 +33,8 @@ def start_inference_process(
disk_lock (Lock): The lock to use for disk access.
aux_model_lock (Lock): The lock to use for auxiliary model downloading.
high_memory_mode (bool, optional): If true, the process will attempt to use more memory. Defaults to False.
amd_gpu (bool, optional): If true, the process will attempt to use AMD GPU-specific optimisations.
Defaults to False.
"""
with contextlib.nullcontext(): # contextlib.redirect_stdout(None), contextlib.redirect_stderr(None):
logger.remove()
Expand All @@ -46,22 +49,26 @@ def start_inference_process(
verbosity_count=5, # FIXME
)

logger.debug(f"Initialising hordelib with process_id={process_id} and high_memory_mode={high_memory_mode}")
logger.debug(
f"Initialising hordelib with process_id={process_id}, high_memory_mode={high_memory_mode} "
f"and amd_gpu={amd_gpu}",
)

extra_comfyui_args = ["--disable-smart-memory"]

if amd_gpu:
extra_comfyui_args.append("--use-pytorch-cross-attention")

if high_memory_mode:
extra_comfyui_args.append("--highvram")

with logger.catch(reraise=True):
hordelib.initialise(
setup_logging=None,
process_id=process_id,
logging_verbosity=0,
force_normal_vram_mode=not high_memory_mode,
extra_comfyui_args=(
["--disable-smart-memory"]
if not high_memory_mode
else [
"--disable-smart-memory",
"--highvram",
]
),
extra_comfyui_args=extra_comfyui_args,
)
except Exception as e:
logger.critical(f"Failed to initialise hordelib: {type(e).__name__} {e}")
Expand Down Expand Up @@ -89,6 +96,7 @@ def start_safety_process(
cpu_only: bool = True,
*,
high_memory_mode: bool = False,
amd_gpu: bool = False,
) -> None:
"""Start a safety process.
Expand All @@ -99,6 +107,8 @@ def start_safety_process(
disk_lock (Lock): The lock to use for disk access.
cpu_only (bool, optional): If true, the process will not use the GPU. Defaults to True.
high_memory_mode (bool, optional): If true, the process will attempt to use more memory. Defaults to False.
amd_gpu (bool, optional): If true, the process will attempt to use AMD GPU-specific optimisations.
Defaults to False.
"""
with contextlib.nullcontext(): # contextlib.redirect_stdout(), contextlib.redirect_stderr():
logger.remove()
Expand All @@ -115,12 +125,20 @@ def start_safety_process(

logger.debug(f"Initialising hordelib with process_id={process_id} and high_memory_mode={high_memory_mode}")

extra_comfyui_args = ["--disable-smart-memory"]

if amd_gpu:
extra_comfyui_args.append("--use-pytorch-cross-attention")

if high_memory_mode:
extra_comfyui_args.append("--highvram")

with logger.catch(reraise=True):
hordelib.initialise(
setup_logging=None,
process_id=process_id,
logging_verbosity=0,
extra_comfyui_args=["--disable-smart-memory"],
extra_comfyui_args=extra_comfyui_args,
)
except Exception as e:
logger.critical(f"Failed to initialise hordelib: {type(e).__name__} {e}")
Expand Down
10 changes: 9 additions & 1 deletion horde_worker_regen/run_worker.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
from loguru import logger


def main(ctx: BaseContext, load_from_env_vars: bool = False) -> None:
def main(ctx: BaseContext, load_from_env_vars: bool = False, *, amd_gpu: bool = False) -> None:
"""Check for a valid config and start the driver ('main') process for the reGen worker."""
from horde_model_reference.model_reference_manager import ModelReferenceManager
from pydantic import ValidationError
Expand Down Expand Up @@ -91,6 +91,7 @@ def ensure_model_db_downloaded() -> ModelReferenceManager:
ctx=ctx,
bridge_data=bridge_data,
horde_model_reference_manager=horde_model_reference_manager,
amd_gpu=amd_gpu,
)


Expand Down Expand Up @@ -136,6 +137,13 @@ def init() -> None:
default=False,
help="Load the config only from environment variables. This is useful for running the worker in a container.",
)
parser.add_argument(
"--amd",
"--amd-gpu",
action="store_true",
default=False,
help="Enable AMD GPU-specific optimisations",
)

args = parser.parse_args()

Expand Down
Loading

0 comments on commit 7d5c4f2

Please sign in to comment.