Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

model_fn and input_fn called multiple times #1073

Open
aunz opened this issue Mar 4, 2020 · 10 comments
Open

model_fn and input_fn called multiple times #1073

aunz opened this issue Mar 4, 2020 · 10 comments

Comments

@aunz
Copy link

aunz commented Mar 4, 2020

I am using the prebuilt SageMaker SKLearn container (https://github.com/aws/sagemaker-scikit-learn-container) version 0.20.0.
In the entry_point,I include a script which carries out the batch transformation job.

def model_fn():
    ...    

def input_fn():
    ...

def predict_fn():
    '''
        A long running process to preprocess the data before calling model
        https://aws.amazon.com/blogs/machine-learning/preprocess-input-data-before-making-predictions-using-amazon-sagemaker-inference-pipelines-and-scikit-learn/
    '''
    time.sleep(60 * 11) # sleep for 11 mins to simulate a long running process
    ....

def output_fn():
    ....

I noticed that the model_fn() was called multiple times in the cloudwatch log

21:11:43 model_fn called /opt/ml/model 0.3710819465747405
21:11:43 model_fn called /opt/ml/model 0.1368146211634631
21:11:44 model_fn called /opt/ml/model 0.09153953459183728

The input_fn() was also called multiple times

20:41:31 input_data <class 'str'> application/json 0.3936440317990033 {
20:51:30 input_data <class 'str'> application/json 0.4852180186010707 {
21:01:30 input_data <class 'str'> application/json 0.9954036507047136 {
21:11:30 input_data <class 'str'> application/json 0.0806271844985188 {

Precisely, it's called every 10 minutes.

I used ml.m4.xlarge, BatchStrategy = SingleRecord and SplitType of None. I also used the environmental variable SAGEMAKER_MODEL_SERVER_TIMEOUT = '9999' to overcome the 60s timeout. I expected that the model_fn or input_fn would only be called once, but in this case, they were called multiple times. In the end, the container crashed with "Internal Server Error".

I saw a similar related issue before #341 where the model_fn was called on each invocation. But in this case, there is no /invocations, the model_fn, input_fn, predict_fn, and output_fn were called multiple time. In the end, the container crashed with Internal Server Error.

@ikennanwosu
Copy link

How did you resolve this please, as I am getting the same issue.

@EKami
Copy link

EKami commented Nov 9, 2020

Same issue here =/

@raydazn
Copy link

raydazn commented Dec 14, 2020

Same issue here. If model_fu provides functionality of loading model, do me need to load it for every batch?

@uday1212
Copy link

Same issue here ..!!! Anyone found a solution to this.?

@naresh129
Copy link

How is this issue solved. same issue here too..

@llealgt
Copy link

llealgt commented Oct 11, 2024

Has anyone found a solution? I'm facing the same issue, the function runs 4 times, it seems like 1 time per GPU available.

@HubGab-Git
Copy link

Can you show your code? I would like to reproduce it

@kurtgdl
Copy link

kurtgdl commented Dec 7, 2024

Is there any update on this? It seems there's a problem with sagemaker-inference-toolkit. huggingface-sagemaker-inference-toolkit has the same issue: aws/sagemaker-huggingface-inference-toolkit#133

@athewsey
Copy link
Contributor

athewsey commented Dec 16, 2024

In general as far as I'm aware, it's expected that the model_fn will be called multiple times because the default behaviour is for the server to load multiple copies of your model and use those to serve concurrent requests on multiple worker threads.

I've worked pretty closely with SageMaker but am not part of their core inference engineering team, so the following is based on an imperfect (and potentially outdated) understanding:

I believe both the sagemaker-scikit-learn-container and sagemaker-huggingface-inference-toolkit (for Hugging Face DLCs) use AWSLabs multi-model-server for their base inference server. The core sagemaker-inference-toolkit depends on it too as mentioned in the readme, but I know other DLCs like PyTorch and TensorFlow have been using their own ecosystems' serving stacks TorchServe and TFX.

It does make sense for the stack to support multiple worker threads so you can effectively utilize resources like instances with multiple GPUs, or a large number of CPU cores - and in general the stack should be configurable, but (IMO) it's a bit difficult to navigate with the serving stacks for these containers being split across so many different layers of code repository...

To explicitly control/limit the number of worker threads created to best utilize the hardware, I'd suggest trying environment variables:

  • SAGEMAKER_MODEL_SERVER_WORKERS (as per SM Inference Toolkit parameters.py)
  • MMS_DEFAULT_WORKERS_PER_MODEL, MMS_NETTY_CLIENT_THREADS, and possibly also MMS_NUMBER_OF_NETTY_THREADS (as per MMS configuration doc and underlying ConfigManager)

input_fn being called multiple times for a single request is more concerning as that seems like a retry. You may have to set MMS-specific timeout & payload size configurations if the SAGEMAKER_ one isn't getting picked up. For example, in the past for large-payload/long-time inference on Hugging Face v4.28 container, I used MMS_DEFAULT_RESPONSE_TIMEOUT, MMS_MAX_REQUEST_SIZE, and MMS_MAX_RESPONSE_SIZE.

Hope this helps, but it'd be great to hear from anybody who manages to clarify exactly which env vars are sufficient to control the number of model workers spawned on these containers.

@kurtgdl
Copy link

kurtgdl commented Dec 17, 2024

Thanks a lot @athewsey. Your points are very useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants