Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: latest torch/comfyui; perf improvments; fix: SSL cert issues #309

Draft
wants to merge 32 commits into
base: main
Choose a base branch
from

Conversation

tazlin
Copy link
Member

@tazlin tazlin commented Oct 4, 2024

New Features/Updates

Fixes and Improvements

  • Improved the stability and performance of high_performance_mode.
    • Jobs which are expected to be brief now do not block job pops. Additionally, less time is spent in general waiting if this mode is on.
  • Improved the stability and performance of max_threads values greater than one.
    • xx90 series cards will likely see a large improvement with max_threads: 2 and a bit of tuning.
      • Important: You almost certainly will want high_performance_mode if you have a xx90 card.
    • Note that cascade and flux, as well as high_memory_mode can still lead to additional instability with threads at 2.
    • xx80 series cards may benefit from max_threads: 2 in SD1.5-only setups without controlnets/post-processing or in other conservative configurations.
  • Improved process management with enhanced deadlock detection and handling.
    • Particularly, hang ups where all of the process were available and waiting should be more readily detected and corrected.
  • Optimized image processing by using rawpng directly, reducing redundant operations.
    • The repeated call to PIL.Image.open(...) was highly inefficient, especially for very large images.
    • The already encoded png sent from ComfyUI is used instead
  • Added SSL context using certifi to resolve certificate resolution issues.
  • Updated documentation to reflect changes in CUDA version and new configuration options.
  • Fixed a bug where the download_models.py would not exit if the compvis models failed to download. This would cause the worker to crash unexpectedly as it expects the models to be available on worker start.

This already functionally happens in hordelib. We can use the bytestream passed from it directly.
This is tied to a bug of unclear root-cause but whose practical effect is that the root signing certificate was not being found on a relatively fresh windows 10 pro machine. `certifi` should already be being pulled in, but I've marked it as an explicit requirement and I anticipate that there should not be side effects on machines which were previously running fine.
The new config option `very_fast_disk_mode` overrides this.
The recent change to preloading being single-model only causes the logic here to think that the model isn't loaded when it should be. If there is one holding up the line, we'll wait to see what happens.
@tazlin
Copy link
Member Author

tazlin commented Oct 4, 2024

@CodiumAI-Agent /describe

@CodiumAI-Agent
Copy link

Title

feat: latest torch/comfyui; perf improvments; fix: SSL cert issues


User description

todo


PR Type

enhancement, bug fix, documentation


Description

  • Updated PyTorch version to 2.4.1 and CUDA to cu124 across multiple files, including Dockerfiles and runtime scripts.
  • Added very_fast_disk_mode configuration option for concurrent model loading.
  • Improved process management with enhanced deadlock detection and handling.
  • Optimized image processing by using rawpng directly, reducing redundant operations.
  • Updated horde dependencies to the latest versions.
  • Added SSL context using certifi to resolve certificate issues.
  • Updated documentation to reflect changes in CUDA version and new configuration options.

Changes walkthrough 📝

Relevant files
Miscellaneous
3 files
__init__.py
Version bump to 9.1.0                                                                       

horde_worker_regen/init.py

  • Updated version from 9.0.7 to 9.1.0.
+1/-1     
_version_meta.json
Update recommended version to 9.1.0                                           

horde_worker_regen/_version_meta.json

  • Updated recommended version to 9.1.0.
+1/-1     
pyproject.toml
Update project version to 9.1.0                                                   

pyproject.toml

  • Updated project version to 9.1.0.
+1/-1     
Enhancement
3 files
data_model.py
Add very_fast_disk_mode configuration option                         

horde_worker_regen/bridge_data/data_model.py

  • Added very_fast_disk_mode configuration option.
+3/-0     
inference_process.py
Optimize image processing and improve logging                       

horde_worker_regen/process_management/inference_process.py

  • Improved logging for model preloading.
  • Optimized image processing by using rawpng directly.
  • +4/-6     
    process_manager.py
    Enhance process management and SSL handling                           

    horde_worker_regen/process_management/process_manager.py

  • Added SSL context using certifi.
  • Enhanced process management with new methods and properties.
  • Improved deadlock detection and handling.
  • +165/-14
    Dependencies
    11 files
    update-runtime.sh
    Update PyTorch and CUDA versions in runtime script             

    update-runtime.sh

    • Updated PyTorch version to 2.4.1 and CUDA to cu124.
    +2/-2     
    horde-bridge.cmd
    Update horde dependencies in Windows script                           

    horde-bridge.cmd

    • Updated horde dependencies to latest versions.
    +1/-1     
    update-runtime.cmd
    Update PyTorch and CUDA versions in Windows runtime script

    update-runtime.cmd

    • Updated PyTorch version to 2.4.1 and CUDA to cu124.
    +3/-3     
    .pre-commit-config.yaml
    Update dependencies in pre-commit configuration                   

    .pre-commit-config.yaml

    • Updated dependencies to latest versions.
    +4/-4     
    Dockerfile.12.1.1-22.04
    Update Dockerfile with new CUDA version and dependencies 

    Dockerfiles/Dockerfile.12.1.1-22.04

  • Updated CUDA version to cu124.
  • Added installation of opencv-python-headless.
  • +3/-3     
    Dockerfile.12.2.2-22.04
    Update Dockerfile with new CUDA version and dependencies 

    Dockerfiles/Dockerfile.12.2.2-22.04

  • Updated CUDA version to cu124.
  • Added installation of opencv-python-headless.
  • +3/-3     
    Dockerfile.12.3.2-22.04
    Update Dockerfile with new CUDA version and dependencies 

    Dockerfiles/Dockerfile.12.3.2-22.04

  • Updated CUDA version to cu124.
  • Added installation of opencv-python-headless.
  • +3/-3     
    Dockerfile.12.4-22.04
    Add Dockerfile for CUDA 12.4                                                         

    Dockerfiles/Dockerfile.12.4-22.04

    • Added new Dockerfile for CUDA 12.4.
    +30/-0   
    requirements.rocm.txt
    Update ROCm requirements with latest dependencies               

    requirements.rocm.txt

    • Updated torch and horde dependencies to latest versions.
    +4/-4     
    requirements.txt
    Update requirements with latest dependencies and SSL fix 

    requirements.txt

  • Updated torch and horde dependencies to latest versions.
  • Added certifi for SSL certificate resolution.
  • +7/-5     
    tox.ini
    Update test environment with new PyTorch version                 

    tox.ini

    • Updated PyTorch version in test environment to cu124.
    +1/-1     
    Documentation
    2 files
    README_advanced.md
    Update README with new CUDA version instructions                 

    README_advanced.md

    • Updated PyTorch installation instructions to use CUDA cu124.
    +1/-1     
    bridgeData_template.yaml
    Add very_fast_disk_mode to configuration template               

    bridgeData_template.yaml

    • Added very_fast_disk_mode option to configuration template.
    +4/-0     

    💡 PR-Agent usage: Comment /help "your question" on any pull request to receive relevant information

    This is response to some observed behavior that control+c exiting would lead to all processes getting killed before all jobs were finished (with threads=2)
    tazlin and others added 5 commits October 4, 2024 16:00
    This was incomplete and masked by 22.04 also using 3.11 as it's Python version.
    python3.11-dev is technically optional, but included because it's needed on the AMD side.
    venv creation HAS to be called with the full version, otherwise the dist default is used.
    pip only needs to be updated inside the venv.
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    None yet
    Projects
    None yet
    Development

    Successfully merging this pull request may close these issues.

    horde-bridge script proceeds to start worker even if downloads fail
    3 participants