Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Low Throughput and High Latency with TorchServe Deployment on AWS #3359

Open
dummyuser-123 opened this issue Nov 6, 2024 · 0 comments
Open

Comments

@dummyuser-123
Copy link

📚 The doc issue

I have created a custom handler for my image-to-image translation Stable Diffusion project, containerized it using Docker, and deployed it on an g6.xlarge instance on AWS. Currently, I am experiencing low throughput and high latency issues. I am testing the TorchServe API by sending 20 requests per minute from two devices using threading (a total of 40 requests per minute).

Device 1:

Sr. No Image Resolution Time (sec)
0 564x705 42
1 465x710 47
2 848x484 48
3 564x705 60
4 848x484 75
5 848x484 76
6 465x710 83
7 465x710 96
9 565x848 100
8 848x484 116
10 848x484 120
11 465x710 127
12 465x710 137
13 848x484 145
14 563x788 149
15 465x710 160
16 465x710 173
17 564x705 173
18 563x788 178
19 565x848 181

Device 2:

Sr. No Image Resolution Time (sec)
0 563x788 41
3 563x788 45
1 563x788 59
2 848x484 56
4 465x710 96
5 465x710 93
7 564x705 94
6 564x705 98
8 848x484 91
9 848x484 115
10 564x705 112
15 565x848 140
14 563x788 143
12 564x705 152
13 565x848 149
11 848x484 155
17 564x705 158
18 848x484 155
16 564x705 161
19 465x710 156

Config.properties:

inference_address=http://0.0.0.0:8080
management_address=http://0.0.0.0:8081
metrics_address=http://0.0.0.0:8083
enable_envvars_config=true
install_py_dep_per_model=true
load_models=stable-diffusion.mar
model_store=/home/model-server/model-store

models={
    "stable-diffusion": {
        "1.0": {
            "defaultVersion": true,
            "marName": "stable-diffusion.mar",
            "minWorkers": 3,
            "maxWorkers": 4,
            "batchSize": 5,
            "maxBatchDelay": 3000,
            "responseTimeout": 180
        }
    }
}

  1. If there are 3 workers and the batch size is 5, how will the workers handle the batch? Will the 5 requests (1 batch) be divided among the 3 workers, or will each of the 3 workers handle 5 requests (1 batch) individually?

  2. Will including number_of_netty_threads and netty_client_threads in the config.properties significantly improve throughput and latency?

Due to budget constraints, I can only afford one GPU instance. How can I optimize this setup to achieve better latency and throughput ? If anyone could give some suggestion then it would be an great help for me.

Suggest a potential alternative/fix

No response

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant