Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specifying a code location in ProcessingStep should be optional #4904

Open
system123 opened this issue Oct 21, 2024 · 2 comments
Open

Specifying a code location in ProcessingStep should be optional #4904

system123 opened this issue Oct 21, 2024 · 2 comments
Labels
component: pipelines Relates to the SageMaker Pipeline Platform type: question

Comments

@system123
Copy link

Describe the bug
When defining a ProcessingStep using the Python SDK the pipeline compiler complains if the code= argument is not specified. However, the SDK documentation and code have code=None as a default (which is invalid) and the AWS documentation for processing steps states that the code parameter may be None if the code already exists in the container. In this case the ScriptProcessor already contains the code, and defines how to execute it through command= parameter.

To reproduce
Defining a processing step without a code argument will cause an error.

evaluation_step = ProcessingStep(
    name="EvaluateModel",
    processor=script_processor,
    inputs=[
        sagemaker.processing.ProcessingInput(
            source=train_step.properties.ModelArtifacts.S3ModelArtifacts,
            destination="/opt/ml/processing/model",
        ),
        sagemaker.processing.ProcessingInput(
            source=input_data_uri,
            destination="/opt/ml/processing/data",
        ),
    ],
    outputs=[
        sagemaker.processing.ProcessingOutput(
            output_name="evaluation",
            source="/opt/ml/processing/evaluation",
            destination="s3://my-bucket/models/"
        ),
    ],
    property_files=[evaluation_report],
)

Expected behavior
If a ScriptProcessor is used which is based upon a custom image, the command should just be run directly. No specific code needs to be uploaded or pulled into the container. The expected behaviour can be obtained using the SDK currently by pointing code to any dummy file on S3 or the local machine. This is then pushed to the container, but the command specified by the Script Processor is still executed.

Screenshots or logs

ValueError: code None url scheme b'' is not recognized. Please pass a file path or S3 url
@system123 system123 added the bug label Oct 21, 2024
@rohangujarathi rohangujarathi added the component: pipelines Relates to the SageMaker Pipeline Platform label Oct 28, 2024
@fjpa121197
Copy link

@system123 can you try creating the step this way:

preprocessing_job = ScriptProcessor(
        image_uri=processing_image_uri,
        command=["python3"],
        role=role_pipeline,
        instance_type=instance_type,
        instance_count=instance_count,
        sagemaker_session=pipeline_session,
    )

    step_args_preprocessing = preprocessing_job.run(
        code=os.path.join(BASE_DIR, "preprocess.py"),
        inputs=[
            ProcessingInput(...)
        ],
        outputs=[
            ProcessingOutput(...)
        ],
    )

    step_preprocessing = ProcessingStep(
        name="PreprocessingStep",
        step_args=step_args_preprocessing,
    )

hope that solves the issue

@qidewenwhen
Copy link
Member

Hi @system123, thanks for reaching out!

I have received an internal customer ticket on the same topic and responded to that. Not sure if that was from you, so replying here as well.

The ScriptProcessor, as its name suggested, is for the use case of supplying custom script or code. That's why it has the code argument as required. In other words, ScriptProcessor is not for the use case of Bring Your Own Processor Container.

However, there is still another more general class to use, i.e. Processor,

class Processor(object):

for which, you don't need to supply the code. And instead, you'll need to supply an image uri, which can be your custom image. This class works with ProcessingStep as well. See the example below:

processor = Processor(
    image_uri=IMAGE_URI,
    role=ROLE,
    instance_count=1,
    instance_type=INSTANCE_TYPE,
    sagemaker_session=pipeline_session,
)

step_args = processor.run(
    inputs=processing_input, 
)

step = ProcessingStep(
            name="MyProcessingStep",
            step_args=step_args,
)

Hope this can help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component: pipelines Relates to the SageMaker Pipeline Platform type: question
Projects
None yet
Development

No branches or pull requests

4 participants