Replace processes when a model is unloaded #36

zten · 2023-11-27T03:05:11Z

On Linux, it seems like there is a tremendous amount of memory allocated by something outside Python when you allow one worker to process many jobs on many different models. In order to limit the damage from that behavior, we'll try to replace the processes when the model is scheduled to be unloaded.

I realize this probably makes it a little harder to decouple a process from a model, but it's a huge stability improvement.

This also switches the model management strategy a tiny bit by allocating a model to every open worker before trying to unload a model. Previously, you would have
N processes = threads + queue, but jobs were very likely to be scheduled on only the threads number of workers.

zten · 2023-11-27T17:04:35Z

I have some bugs to solve with this that I discovered from an overnight run of my workers, the hung worker replacement isn't working correctly.

horde_worker_regen/process_management/process_manager.py

On Linux, it seems like there is a tremendous amount of memory allocated by something outside Python when you allow one worker to process many jobs on many different models. In order to limit the damage from that behavior, we'll try to replace the processes when the model is scheduled to be unloaded. I realize this probably makes it a _little_ harder to decouple a process from a model, but it's a huge stability improvement. This also switches the model management strategy a tiny bit by allocating a model to every open worker before trying to unload a model. Previously, you would have N processes = `threads` + `queue`, but jobs were very likely to be scheduled on only the `threads` number of workers.

In order to clean up jobs in progress we need to know where the job went in the event that we kill a process.

zten · 2023-12-11T04:02:55Z

This still isn't quite right, because I've seen BrokenPipeError get tossed in other spots, like just starting inference normally.

tazlin · 2023-12-12T13:25:19Z

Let me know when you feel this branch is feature stable and I'll work on promoting it to main when you feel ready.

tazlin · 2023-12-14T16:33:55Z

Given that we are still in a somewhat limited beta, and I have seen these changes work locally for me for extended periods of time, I am going to merge this into main as it is generally more efficient and overall a net improvement. Further, it may address some problems that people are somewhat frequently encountering.

tazlin

The code is in the general spirit of the contribution requirements of this project and furthers the goal of worker stability. I have been able to run this branch (re-based on the current changes on main) for an extended period of time. Considering we are in a beta phase, I feel this is stable enough to be on main.

zten force-pushed the lru-cache branch 3 times, most recently from 4badeec to 79284b5 Compare November 27, 2023 03:28

db0 requested a review from tazlin December 8, 2023 13:14

tazlin reviewed Dec 8, 2023

View reviewed changes

horde_worker_regen/process_management/process_manager.py Show resolved Hide resolved

zten added 5 commits December 10, 2023 13:55

Only replace processes that are in progress

a019867

Track process ID for an inference job

9cd97e1

In order to clean up jobs in progress we need to know where the job went in the event that we kill a process.

I think I goofed on the exception handling; let's try again

63fadcd

Catch errors sending termination requests

e732c39

zten force-pushed the lru-cache branch from e25414b to e732c39 Compare December 11, 2023 04:02

zten added 3 commits December 10, 2023 20:05

Formatting fix

62cb867

Cut down on duplicate model unload requests

daf5eb0

Use open workers first

b6ce1ad

tazlin approved these changes Dec 14, 2023

View reviewed changes

tazlin merged commit fcb6a50 into Haidra-Org:main Dec 14, 2023
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace processes when a model is unloaded #36

Replace processes when a model is unloaded #36

zten commented Nov 27, 2023

zten commented Nov 27, 2023

zten commented Dec 11, 2023

tazlin commented Dec 12, 2023

tazlin commented Dec 14, 2023

tazlin left a comment

Replace processes when a model is unloaded #36

Replace processes when a model is unloaded #36

Conversation

zten commented Nov 27, 2023

zten commented Nov 27, 2023

zten commented Dec 11, 2023

tazlin commented Dec 12, 2023

tazlin commented Dec 14, 2023

tazlin left a comment

Choose a reason for hiding this comment