Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible class or Enum for SageMaker Job #4935

Open
sjcahill-fcc opened this issue Nov 20, 2024 · 0 comments
Open

Possible class or Enum for SageMaker Job #4935

sjcahill-fcc opened this issue Nov 20, 2024 · 0 comments

Comments

@sjcahill-fcc
Copy link

Describe the feature you'd like

When working with SageMaker we are often defining sources and destinations for data and artifacts within our jobs.

For instance a ProcessingInput for a processing job will be defined like:

ProcessingInput(
                        source='s3://path/to/my/input-data.csv',
                        destination='/opt/ml/processing/input'
)

and an output would be defined like:

ProcessingOutput(source='/opt/ml/processing/output/train', destination='s3://...')

And the /opt/ml/... filepaths determine where resources exist in the container and need to be correctly handled in our processing/training code.

There are other locations similar to this for training and tuning and there are environment variables that can control the default locations where resources are expected to be inside the local container.

To keep consistency across our SageMaker projects we usually end up defining a basic class or an Enum in a config file. This helps avoid things like typos and allows users to keep consistent conventions between projects.

Something like a class or Enum that define the most commonly used locations could be helpful for new users to SageMaker and prevent users from having to reference documentation (which can sometimes be a little scattered) to remember the conventional locations.

For example:

class SageMakerProcessingChannels:
    PROCESSING_INPUT_CHANNEL = "/opt/ml/processing/input"
    PROCESSING_OUTPUT_CHANNEL = "/opt/ml/processing/output"
    PROCESSING_TRAIN_OUTPUT_CHANNEL = "/opt/ml/processing/output/train"
    PROCESSING_VALIDATION_OUTPUT_CHANNEL = "/opt/ml/processing/output/validation"
    PROCESSING_TEST_OUTPUT_CHANNEL = "/opt/ml/processing/output/test"
    PROCESSING_TEMP = "/opt/ml/processing/temp"

How would this feature be used? Please describe.
This feature would help standardize some of these common locations and provide IDE code-completion support for common
parameters when working in SageMaker.

Now our processing inputs and outputs would be:

inputs = [ProcessingInput(
                        source='s3://path/to/my/input-data.csv',
                        destination=SageMakerProcessingChannels.PROCESSING_INPUT_CHANNEL
)]
outputs = [
ProcessingOutput(source=SageMakerProcesingChannels.PROCESSING_TRAIN_OUTPUT_CHANNEL, destination='s3://...')
]

Describe alternatives you've considered
We currently use a config that does this and use a cookie cutter template to initialize the SageMaker datascience projects to help promote uniformity across teams.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant