-
Notifications
You must be signed in to change notification settings - Fork 6.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
model_fn and input_fn called multiple times #1073
Comments
How did you resolve this please, as I am getting the same issue. |
Same issue here =/ |
Same issue here. If model_fu provides functionality of loading model, do me need to load it for every batch? |
Same issue here ..!!! Anyone found a solution to this.? |
How is this issue solved. same issue here too.. |
Has anyone found a solution? I'm facing the same issue, the function runs 4 times, it seems like 1 time per GPU available. |
Can you show your code? I would like to reproduce it |
Is there any update on this? It seems there's a problem with sagemaker-inference-toolkit. huggingface-sagemaker-inference-toolkit has the same issue: aws/sagemaker-huggingface-inference-toolkit#133 |
In general as far as I'm aware, it's expected that the I've worked pretty closely with SageMaker but am not part of their core inference engineering team, so the following is based on an imperfect (and potentially outdated) understanding: I believe both the sagemaker-scikit-learn-container and sagemaker-huggingface-inference-toolkit (for Hugging Face DLCs) use AWSLabs multi-model-server for their base inference server. The core sagemaker-inference-toolkit depends on it too as mentioned in the readme, but I know other DLCs like PyTorch and TensorFlow have been using their own ecosystems' serving stacks TorchServe and TFX. It does make sense for the stack to support multiple worker threads so you can effectively utilize resources like instances with multiple GPUs, or a large number of CPU cores - and in general the stack should be configurable, but (IMO) it's a bit difficult to navigate with the serving stacks for these containers being split across so many different layers of code repository... To explicitly control/limit the number of worker threads created to best utilize the hardware, I'd suggest trying environment variables:
Hope this helps, but it'd be great to hear from anybody who manages to clarify exactly which env vars are sufficient to control the number of model workers spawned on these containers. |
Thanks a lot @athewsey. Your points are very useful. |
I am using the prebuilt SageMaker SKLearn container (https://github.com/aws/sagemaker-scikit-learn-container) version 0.20.0.
In the entry_point,I include a script which carries out the batch transformation job.
I noticed that the model_fn() was called multiple times in the cloudwatch log
The input_fn() was also called multiple times
Precisely, it's called every 10 minutes.
I used ml.m4.xlarge, BatchStrategy = SingleRecord and SplitType of None. I also used the environmental variable SAGEMAKER_MODEL_SERVER_TIMEOUT = '9999' to overcome the 60s timeout. I expected that the model_fn or input_fn would only be called once, but in this case, they were called multiple times. In the end, the container crashed with "Internal Server Error".
I saw a similar related issue before #341 where the model_fn was called on each invocation. But in this case, there is no /invocations, the model_fn, input_fn, predict_fn, and output_fn were called multiple time. In the end, the container crashed with Internal Server Error.
The text was updated successfully, but these errors were encountered: