--model_dir is inconsistent, confusing, and unnecessary #340

athewsey · 2020-04-07T08:48:45Z

Raising here (where I believe the implementation is?) as opposed to on SageMaker SDK - which as I understand just documents the functionality.

Every other SageMaker framework container that I've seen (see docs for MXNet, PyTorch, Scikit-Learn, even Chainer) advises determining the local path to store your model via a pattern like:

parser = argparse.ArgumentParser()
parser.add_argument('--model-dir', type=str, default=os.environ['SM_MODEL_DIR'])

...which is nice and flexible: I can consume all my parameters through argparse, and choose when debugging locally whether to specify via env var or CLI. I simply save my model to args.model_dir.

(FWIW the few other frameworks that I've checked closely don't seem to actually pass it in through the CLI: Just the env var default is getting used)

I don't understand why the TF container insists on passing in the slightly different --model_dir, which:

Can't easily be used together with --model-dir, because argparse shifts hyphens to underscores on the output object by default
Per SageMaker SDK issue #1355, reports the destination S3 URI for single-instance training but the local model folder for multi-instance training!

I'd argue this doesn't work well for power-users (because it's inconsistent with consensus in other frameworks), but also doesn't work well for beginners: who have to grapple with this "dir" parameter that isn't a directory at all but an S3 location, and will be confused into thinking they have to upload their model there when in fact SageMaker will do it for them.

For extra evidence of this, see also this repo's #115 (looks like a case of confusion caused) and #130 (user has specifically commented in code that the param is externally useless).

I don't really understand what the use case for the S3 URI would ever be in user code? Regardless of whether it's single- or multi-instance training, SageMaker can upload the final contents of SM_MODEL_DIR (you can just only save the model on your rank 0 instance for multi, right?).

Nothing in the amazon-sagemaker-script-mode samples seems to use the S3 URIs, and in many cases (e.g. A, B, C, D) they specifically implement a separate --model_output_dir to work around the issue.

aws/sagemaker-python-sdk/issues/1355 was resolved as a documentation change... To be clear I'm not claiming that the documentation is inaccurate or that there aren't other ways to achieve the same result. I'm asking for a (potentially API-breaking) implementation change, on the basis that the current API is bad for developers.

Would be great to hear what people are using --model_dir S3 URIs for if I'm wrong!

The text was updated successfully, but these errors were encountered:

laurenyu · 2020-04-21T21:39:46Z

sorry for the delayed response here.

The main use case for an S3 model_dir is when using TensorFlow's support for S3 checkpointing during distributed training. TF (the framework itself) does its own thing with uploading to S3, which is why it has different behavior from the other frameworks.

I see your point with

parser.add_argument('--model-dir', type=str, default=os.environ['SM_MODEL_DIR'])

but as you point out later, the argument name can be defined with whatever you supply, so if you're looking to write a framework-agnostic script that still runs on each of the images (not sure what the use case would be), you could always define it as model-output-dir or something else. As you also point out, none of the frameworks actually pass this in as a hyperparameter by default, so you could always save the value of os.environ['SM_MODEL_DIR'] to a variable of your choice in your script.

As for why the TF parameter is named "model_dir", I believe it's to match TF's RunConfig parameter name.

I can see how this has led to confusion, so I've added an item to our internal backlog to revisit this. It would be a breaking change, so it's something that would take some more consideration.

athewsey · 2020-04-23T09:06:28Z

Thanks for explaining the background and considering it on the backlog! 😄

My $0.02 is still that the risk of beginners misunderstanding the basics (thinking they need to upload their model) outweighs the value of consistency with RunConfig - but as you say the change would be breaking so not to be taken lightly!

Also just FYI for anybody who arrives here and chooses to use argparse.parse_known_args() to omit writing unused parser.add_argument('--model_dir', ...) in their scripts: Do note the warning about prefix matching in the argparse doc.

dslove · 2020-10-16T00:23:31Z

I am trying to use this arg to save trained TF models to the places I like but I'd say this arg doesn't behave as what I expect. Actually and honestly, I haven't yet figure out how it works and all the documents talking about it sounds confusing to me. Echo to OP.

nonoesp · 2021-12-02T11:43:45Z

Definitely found this to be a source of headaches and bugs and errors as well.

athewsey changed the title ~~--model_dir is inconsistent, confusing, and unnecessary~~ --model_dir is inconsistent, confusing, and unnecessary Apr 7, 2020

laurenyu added the type: enhancement label Apr 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

--model_dir is inconsistent, confusing, and unnecessary #340

--model_dir is inconsistent, confusing, and unnecessary #340

athewsey commented Apr 7, 2020

laurenyu commented Apr 21, 2020

athewsey commented Apr 23, 2020

dslove commented Oct 16, 2020

nonoesp commented Dec 2, 2021

--model_dir is inconsistent, confusing, and unnecessary #340

--model_dir is inconsistent, confusing, and unnecessary #340

Comments

athewsey commented Apr 7, 2020

laurenyu commented Apr 21, 2020

athewsey commented Apr 23, 2020

dslove commented Oct 16, 2020

nonoesp commented Dec 2, 2021