-
Notifications
You must be signed in to change notification settings - Fork 159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
--model_dir is inconsistent, confusing, and unnecessary #340
Comments
--model_dir
is inconsistent, confusing, and unnecessary
sorry for the delayed response here. The main use case for an S3 I see your point with
but as you point out later, the argument name can be defined with whatever you supply, so if you're looking to write a framework-agnostic script that still runs on each of the images (not sure what the use case would be), you could always define it as As for why the TF parameter is named "model_dir", I believe it's to match TF's I can see how this has led to confusion, so I've added an item to our internal backlog to revisit this. It would be a breaking change, so it's something that would take some more consideration. |
Thanks for explaining the background and considering it on the backlog! 😄 My $0.02 is still that the risk of beginners misunderstanding the basics (thinking they need to upload their model) outweighs the value of consistency with Also just FYI for anybody who arrives here and chooses to use |
I am trying to use this arg to save trained TF models to the places I like but I'd say this arg doesn't behave as what I expect. Actually and honestly, I haven't yet figure out how it works and all the documents talking about it sounds confusing to me. Echo to OP. |
Definitely found this to be a source of headaches and bugs and errors as well. |
Raising here (where I believe the implementation is?) as opposed to on SageMaker SDK - which as I understand just documents the functionality.
Every other SageMaker framework container that I've seen (see docs for MXNet, PyTorch, Scikit-Learn, even Chainer) advises determining the local path to store your model via a pattern like:
...which is nice and flexible: I can consume all my parameters through argparse, and choose when debugging locally whether to specify via env var or CLI. I simply save my model to
args.model_dir
.(FWIW the few other frameworks that I've checked closely don't seem to actually pass it in through the CLI: Just the env var default is getting used)
I don't understand why the TF container insists on passing in the slightly different
--model_dir
, which:--model-dir
, because argparse shifts hyphens to underscores on the output object by defaultI'd argue this doesn't work well for power-users (because it's inconsistent with consensus in other frameworks), but also doesn't work well for beginners: who have to grapple with this "dir" parameter that isn't a directory at all but an S3 location, and will be confused into thinking they have to upload their model there when in fact SageMaker will do it for them.
For extra evidence of this, see also this repo's #115 (looks like a case of confusion caused) and #130 (user has specifically commented in code that the param is externally useless).
I don't really understand what the use case for the S3 URI would ever be in user code? Regardless of whether it's single- or multi-instance training, SageMaker can upload the final contents of
SM_MODEL_DIR
(you can just only save the model on your rank 0 instance for multi, right?).Nothing in the amazon-sagemaker-script-mode samples seems to use the S3 URIs, and in many cases (e.g. A, B, C, D) they specifically implement a separate
--model_output_dir
to work around the issue.aws/sagemaker-python-sdk/issues/1355 was resolved as a documentation change... To be clear I'm not claiming that the documentation is inaccurate or that there aren't other ways to achieve the same result. I'm asking for a (potentially API-breaking) implementation change, on the basis that the current API is bad for developers.
Would be great to hear what people are using
--model_dir
S3 URIs for if I'm wrong!The text was updated successfully, but these errors were encountered: