feat: latest torch/comfyui; perf improvments; fix: SSL cert issues #309

tazlin · 2024-10-04T13:48:55Z

New Features/Updates

Updated PyTorch version to 2.4.1 and the CUDA default to cu124.
Added very_fast_disk_mode configuration option for concurrent model loading.
- The default is false.
- This causes all workers with very_fast_disk_mode: false to only load one model at a time when it is being explicitly preloaded. There are some cases where it still might attempt to load more than one but it should be far less often.
Updated horde dependencies to the latest versions.
- fix: ignore numba bytecode dumps; config for ignored messages hordelib#342
- feat: comfyui 3bb4dec; torch >=2.4.1 and cu124 by default hordelib#343

Fixes and Improvements

Improved the stability and performance of high_performance_mode.
- Jobs which are expected to be brief now do not block job pops. Additionally, less time is spent in general waiting if this mode is on.
Improved the stability and performance of max_threads values greater than one.
- xx90 series cards will likely see a large improvement with max_threads: 2 and a bit of tuning.
  - Important: You almost certainly will want high_performance_mode if you have a xx90 card.
- Note that cascade and flux, as well as high_memory_mode can still lead to additional instability with threads at 2.
- xx80 series cards may benefit from max_threads: 2 in SD1.5-only setups without controlnets/post-processing or in other conservative configurations.
Improved process management with enhanced deadlock detection and handling.
- Particularly, hang ups where all of the process were available and waiting should be more readily detected and corrected.
Optimized image processing by using rawpng directly, reducing redundant operations.
- The repeated call to PIL.Image.open(...) was highly inefficient, especially for very large images.
- The already encoded png sent from ComfyUI is used instead
Added SSL context using certifi to resolve certificate resolution issues.
Updated documentation to reflect changes in CUDA version and new configuration options.
Fixed a bug where the download_models.py would not exit if the compvis models failed to download. This would cause the worker to crash unexpectedly as it expects the models to be available on worker start.

To see the specific tasks where the Asana app for GitHub is being used, see below:
- https://app.asana.com/0/0/1208486016776942

This already functionally happens in hordelib. We can use the bytestream passed from it directly.

This is tied to a bug of unclear root-cause but whose practical effect is that the root signing certificate was not being found on a relatively fresh windows 10 pro machine. `certifi` should already be being pulled in, but I've marked it as an explicit requirement and I anticipate that there should not be side effects on machines which were previously running fine.

The new config option `very_fast_disk_mode` overrides this.

The recent change to preloading being single-model only causes the logic here to think that the model isn't loaded when it should be. If there is one holding up the line, we'll wait to see what happens.

tazlin · 2024-10-04T17:52:36Z

@CodiumAI-Agent /describe

CodiumAI-Agent · 2024-10-04T17:53:35Z

Title

feat: latest torch/comfyui; perf improvments; fix: SSL cert issues

User description

todo

PR Type

enhancement, bug fix, documentation

Description

Updated PyTorch version to 2.4.1 and CUDA to cu124 across multiple files, including Dockerfiles and runtime scripts.
Added very_fast_disk_mode configuration option for concurrent model loading.
Improved process management with enhanced deadlock detection and handling.
Optimized image processing by using rawpng directly, reducing redundant operations.
Updated horde dependencies to the latest versions.
Added SSL context using certifi to resolve certificate issues.
Updated documentation to reflect changes in CUDA version and new configuration options.

Changes walkthrough 📝

Relevant files

Miscellaneous

3 files

__init__.py `Version bump to 9.1.0` horde_worker_regen/init.py Updated version from `9.0.7` to `9.1.0`.	+1/-1
_version_meta.json `Update recommended version to 9.1.0` horde_worker_regen/_version_meta.json Updated recommended version to `9.1.0`.	+1/-1
pyproject.toml `Update project version to 9.1.0` pyproject.toml Updated project version to `9.1.0`.	+1/-1

Enhancement

3 files

data_model.py `Add very_fast_disk_mode configuration option` horde_worker_regen/bridge_data/data_model.py Added `very_fast_disk_mode` configuration option.	+3/-0
inference_process.py `Optimize image processing and improve logging` horde_worker_regen/process_management/inference_process.py Improved logging for model preloading. Optimized image processing by using `rawpng` directly.	+4/-6
process_manager.py `Enhance process management and SSL handling` horde_worker_regen/process_management/process_manager.py Added SSL context using `certifi`. Enhanced process management with new methods and properties. Improved deadlock detection and handling.	+165/-14

Dependencies

11 files

update-runtime.sh `Update PyTorch and CUDA versions in runtime script` update-runtime.sh Updated PyTorch version to 2.4.1 and CUDA to cu124.	+2/-2
horde-bridge.cmd `Update horde dependencies in Windows script` horde-bridge.cmd Updated horde dependencies to latest versions.	+1/-1
update-runtime.cmd `Update PyTorch and CUDA versions in Windows runtime script` update-runtime.cmd Updated PyTorch version to 2.4.1 and CUDA to cu124.	+3/-3
.pre-commit-config.yaml `Update dependencies in pre-commit configuration` .pre-commit-config.yaml Updated dependencies to latest versions.	+4/-4
Dockerfile.12.1.1-22.04 `Update Dockerfile with new CUDA version and dependencies` Dockerfiles/Dockerfile.12.1.1-22.04 Updated CUDA version to cu124. Added installation of `opencv-python-headless`.	+3/-3
Dockerfile.12.2.2-22.04 `Update Dockerfile with new CUDA version and dependencies` Dockerfiles/Dockerfile.12.2.2-22.04 Updated CUDA version to cu124. Added installation of `opencv-python-headless`.	+3/-3
Dockerfile.12.3.2-22.04 `Update Dockerfile with new CUDA version and dependencies` Dockerfiles/Dockerfile.12.3.2-22.04 Updated CUDA version to cu124. Added installation of `opencv-python-headless`.	+3/-3
Dockerfile.12.4-22.04 `Add Dockerfile for CUDA 12.4` Dockerfiles/Dockerfile.12.4-22.04 Added new Dockerfile for CUDA 12.4.	+30/-0
requirements.rocm.txt `Update ROCm requirements with latest dependencies` requirements.rocm.txt Updated torch and horde dependencies to latest versions.	+4/-4
requirements.txt `Update requirements with latest dependencies and SSL fix` requirements.txt Updated torch and horde dependencies to latest versions. Added `certifi` for SSL certificate resolution.	+7/-5
tox.ini `Update test environment with new PyTorch version` tox.ini Updated PyTorch version in test environment to cu124.	+1/-1

Documentation

2 files

README_advanced.md `Update README with new CUDA version instructions` README_advanced.md Updated PyTorch installation instructions to use CUDA cu124.	+1/-1
bridgeData_template.yaml `Add very_fast_disk_mode to configuration template` bridgeData_template.yaml Added `very_fast_disk_mode` option to configuration template.	+4/-0

💡 PR-Agent usage: Comment /help "your question" on any pull request to receive relevant information

This is response to some observed behavior that control+c exiting would lead to all processes getting killed before all jobs were finished (with threads=2)

This was incomplete and masked by 22.04 also using 3.11 as it's Python version. python3.11-dev is technically optional, but included because it's needed on the AMD side. venv creation HAS to be called with the full version, otherwise the dist default is used. pip only needs to be updated inside the venv.

tazlin added 21 commits October 3, 2024 13:46

fix: don't run image.save() twice

66f2cd3

This already functionally happens in hordelib. We can use the bytestream passed from it directly.

feat: use torch 2.4.1 and cu124 by default

10386d9

feat: use latest horde deps w/ latest comfyui+fixes

42e2bf4

build/fix: condense and update dockerfiles

f1b80da

chore: version bump

de071ce

fix: pop more often with threads>1

ddc4d78

fix: wait less time w/ high perf. mode

624282c

fix: dont pause at all for short jobs on high perf mode

1befebb

fix: wait even less w/ high perf mode

6be6715

docs/fix: clarify certain stats/config in logs and docstrings

925e111

fix: use sqrt as intended

15682bc

fix: exit(1) on compvis model dl failure

d662404

fix: don't concurrently preload more than 1 model

7fe8794

The new config option `very_fast_disk_mode` overrides this.

fix: don't spam preload delay messages

683b1ca

fix: include conditional to not spam delay messages

e73fcfb

fix: give models a chance to load before failing

bbc99c7

The recent change to preloading being single-model only causes the logic here to think that the model isn't loaded when it should be. If there is one holding up the line, we'll wait to see what happens.

fix: correct version pins across dep files

9aaf862

fix: use latest horde model reference

155820c

style: fix

17f5099

fix: better deadlock detection when all procs. aren't busy

cfc62eb

tazlin mentioned this pull request Oct 4, 2024

fix: use a certifi ssl context for r2 uploads #306

Closed

tazlin added 4 commits October 4, 2024 11:18

fix: be slightly less aggressive w/ pops w/ high perf/threads

0c54357

fix: don't give conflicting advice about high_memory_mode and threads

d2f839e

chore: log a message to see if inf. proc. preload_models is called

6d397e6

fix: don't suggest high_memory_mode with <=32 sys ram

ced9ced

tazlin added 2 commits October 4, 2024 15:00

fix: avoid killing all processes before jobs are finished

a9a06be

This is response to some observed behavior that control+c exiting would lead to all processes getting killed before all jobs were finished (with threads=2)

chore: version bump

6c106ca

tazlin and others added 5 commits October 4, 2024 16:00

fix: micromamba updated cli syntax for update-runtime

7bef6fe

fix: conflicting torchvision dep in update runtime

c01927f

fix: flag ending processes correctly

e52b354

fix: correctly download via load_large_models

107aa96

tazlin force-pushed the raw-png branch from 86a994c to 5e203d9 Compare October 6, 2024 08:15

tazlin linked an issue Oct 6, 2024 that may be closed by this pull request

horde-bridge script proceeds to start worker even if downloads fail #92

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: latest torch/comfyui; perf improvments; fix: SSL cert issues #309

feat: latest torch/comfyui; perf improvments; fix: SSL cert issues #309

tazlin commented Oct 4, 2024 •

edited

Loading

tazlin commented Oct 4, 2024

CodiumAI-Agent commented Oct 4, 2024

feat: latest torch/comfyui; perf improvments; fix: SSL cert issues #309

Are you sure you want to change the base?

feat: latest torch/comfyui; perf improvments; fix: SSL cert issues #309

Conversation

tazlin commented Oct 4, 2024 • edited Loading

New Features/Updates

Fixes and Improvements

tazlin commented Oct 4, 2024

CodiumAI-Agent commented Oct 4, 2024

Title

User description

PR Type

Description

Changes walkthrough 📝

tazlin commented Oct 4, 2024 •

edited

Loading