You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
According to the Sagemaker Multimodel Server documentation the server caches 'frequently' used models in memory (to my understanding in RAM) in order to increase response time via avoiding to load the model again and again.
First Question would be: What does 'frequently' mean?
If I query the same model again and again with a delay of 30s between the invoke_endpoint calls, the server seems to load the model again into memory leading to long response times of 3s instead of the usual ~0.5s obtained via calling the model in <30s interval.
To reproduce
Deploy a Sagemaker Multimodel Server using boto3
Generate a sagemaker runtime_client using boto3 and execute the following code:
for i in range(20):
start = time.time()
response = rt_client.invoke_endpoint(
EndpointName=self.endpoint_name,
ContentType='application/x-npy',
TargetModel='model_store/custom_model_1.tar.gz', # Constantly the same model
Body=payload, # Byte encoded numpy array
)
end = time.time()
response_time = end - start
print(f'Request took {response_time}s'.)
time.sleep(30)
Expected behavior
First call is slow (about 3s) and the following 19 calls lie in the expected ~0.5s range, which is the time it takes to call the endpoint when the model is already loaded.
Once i set the time.sleep() argument lower than 30s, f.e. to 20s, the calls are most of the time as fast as expected.
Ist there any way to influence the timing of the unloading behavior?
To my understanding I would expect that the model stays in memory as long as the memory is not needed for loading other more frequently used models. However, this does not seem to be the case, as each call takes the full 3s.
Screenshots or logs
Time sleep 30s:
Call: 0 of 20 with 4 samples took: 2.847299098968506s.
Call: 1 of 20 with 4 samples took: 3.017570734024048s.
Call: 2 of 20 with 4 samples took: 2.866020917892456s.
Call: 3 of 20 with 4 samples took: 2.888610363006592s.
Call: 4 of 20 with 4 samples took: 3.0125389099121094s.
Call: 5 of 20 with 4 samples took: 2.9569602012634277s.
Call: 6 of 20 with 4 samples took: 2.8126561641693115s.
Call: 7 of 20 with 4 samples took: 2.912917375564575s.
Call: 8 of 20 with 4 samples took: 2.866114854812622s.
Call: 9 of 20 with 4 samples took: 2.9781384468078613s.
Call: 10 of 20 with 4 samples took: 3.4418649673461914s.
Call: 11 of 20 with 4 samples took: 2.79472017288208s.
Call: 12 of 20 with 4 samples took: 2.992703437805176s.
Call: 13 of 20 with 4 samples took: 2.954014301300049s.
Call: 14 of 20 with 4 samples took: 2.9481523036956787s.
Call: 15 of 20 with 4 samples took: 2.928661346435547s.
Call: 16 of 20 with 4 samples took: 2.8345978260040283s.
Call: 17 of 20 with 4 samples took: 2.922405481338501s.
Call: 18 of 20 with 4 samples took: 2.982257843017578s.
Call: 19 of 20 with 4 samples took: 2.8227620124816895s.
Time sleep(20)s
Call: 0 of 20 with 4 samples took: 3.329136848449707s.
Call: 1 of 20 with 4 samples took: 0.5629911422729492s.
Call: 2 of 20 with 4 samples took: 0.5595850944519043s.
Call: 3 of 20 with 4 samples took: 0.5578911304473877s.
Call: 4 of 20 with 4 samples took: 0.5557725429534912s.
Call: 5 of 20 with 4 samples took: 0.5681345462799072s.
Call: 6 of 20 with 4 samples took: 0.5488979816436768s.
Call: 7 of 20 with 4 samples took: 0.5555169582366943s.
Call: 8 of 20 with 4 samples took: 0.5792186260223389s.
Call: 9 of 20 with 4 samples took: 0.9297688007354736s.
Call: 10 of 20 with 4 samples took: 0.6043572425842285s.
Call: 11 of 20 with 4 samples took: 0.572312593460083s.
Call: 12 of 20 with 4 samples took: 0.5600907802581787s.
Call: 13 of 20 with 4 samples took: 2.9460437297821045s.
Call: 14 of 20 with 4 samples took: 0.5780775547027588s.
Call: 15 of 20 with 4 samples took: 0.5762953758239746s.
Call: 16 of 20 with 4 samples took: 0.5773897171020508s.
Call: 17 of 20 with 4 samples took: 0.5769815444946289s.
Call: 18 of 20 with 4 samples took: 0.5663411617279053s.
Call: 19 of 20 with 4 samples took: 0.579679012298584s.
System information
Custom Docker Image:
Inference Framework: SkLearn
Sagemaker Inference Toolkit: 1.6.1
Multimodel Server: 1.1.8
Python version: 3.9
processing unit type CPU (ml.t2.medium)
The text was updated successfully, but these errors were encountered:
Describe the bug
According to the Sagemaker Multimodel Server documentation the server caches 'frequently' used models in memory (to my understanding in RAM) in order to increase response time via avoiding to load the model again and again.
First Question would be: What does 'frequently' mean?
If I query the same model again and again with a delay of 30s between the invoke_endpoint calls, the server seems to load the model again into memory leading to long response times of 3s instead of the usual ~0.5s obtained via calling the model in <30s interval.
To reproduce
Expected behavior
First call is slow (about 3s) and the following 19 calls lie in the expected ~0.5s range, which is the time it takes to call the endpoint when the model is already loaded.
Once i set the
time.sleep()
argument lower than 30s, f.e. to 20s, the calls are most of the time as fast as expected.Ist there any way to influence the timing of the unloading behavior?
To my understanding I would expect that the model stays in memory as long as the memory is not needed for loading other more frequently used models. However, this does not seem to be the case, as each call takes the full 3s.
Screenshots or logs
Time sleep 30s:
Time sleep(20)s
System information
The text was updated successfully, but these errors were encountered: