Training Pipeline to Create Inference Pipeline with Train/Test/Val Split #3848

Rim921 · 2023-03-14T11:46:09Z

Rim921
Mar 14, 2023

Hi,

I am trying to modify the simple preprocess->train->evaluate->register->transform pipeline example,
from https://github.com/aws/amazon-sagemaker-examples/blob/main/sagemaker-pipelines/tabular/abalone_build_train_deploy/sagemaker-pipelines-preprocess-train-evaluate-batch-transform.ipynb,
so that the created model is an inference pipeline that includes the preprocessing step.
The intention is to run a batch transformation on raw data that would run the preprocessing and then the xgboost inference, so I need to save the fitted scikit-learn preprocessing pipeline (PP) but I also want to keep the train/test/val split functionality.

From what I understand, I need to register a PipelineModel, which only supports Estimators. If I change the preprocessing step from a ProcessingStep to a TrainingStep then I could easily save the PP but I won't be able to save the train/test/val datasets. I thought of two different solutions:

Separate the preprocessing step into a processing and training step where the former performs the train/test/val split and the latter fits and saves the PP. This is straightforward but I would still need to transform the datasets so I would need to add a transform step before the xgboost training step? Seems over complicated
Keep the preprocessing step as a ProcessingStep, save the PP to /opt/ml/processing, add PP to the processing step's output list, and create a model using the PP output var as the model data. Much simpler, my only hesitation is that I want all my code on S3 and the ProcessingStep expects an S3 URI to a python file whereas the model estimator expects a tar file with a specified entry point. For this to work I would need two copies of the processing script on S3 (.py and tar.gz). In addition, I don't want to be restricted to having all my preprocessing code written on one file.

Does anyone know of a solution or hack around these limitations?
Is there a way to create a TrainingStep that can save more results than just the model artificats?
Is there a way to create a ProcessingStep that uses a zipped file and entry_point as the code input?

Thanks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training Pipeline to Create Inference Pipeline with Train/Test/Val Split #3848

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Training Pipeline to Create Inference Pipeline with Train/Test/Val Split #3848

Rim921 Mar 14, 2023

Replies: 0 comments

Rim921
Mar 14, 2023