You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
From what I understand, I need to register a PipelineModel, which only supports Estimators. If I change the preprocessing step from a ProcessingStep to a TrainingStep then I could easily save the PP but I won't be able to save the train/test/val datasets. I thought of two different solutions:
Separate the preprocessing step into a processing and training step where the former performs the train/test/val split and the latter fits and saves the PP. This is straightforward but I would still need to transform the datasets so I would need to add a transform step before the xgboost training step? Seems over complicated
Keep the preprocessing step as a ProcessingStep, save the PP to /opt/ml/processing, add PP to the processing step's output list, and create a model using the PP output var as the model data. Much simpler, my only hesitation is that I want all my code on S3 and the ProcessingStep expects an S3 URI to a python file whereas the model estimator expects a tar file with a specified entry point. For this to work I would need two copies of the processing script on S3 (.py and tar.gz). In addition, I don't want to be restricted to having all my preprocessing code written on one file.
Does anyone know of a solution or hack around these limitations?
Is there a way to create a TrainingStep that can save more results than just the model artificats?
Is there a way to create a ProcessingStep that uses a zipped file and entry_point as the code input?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hi,
I am trying to modify the simple preprocess->train->evaluate->register->transform pipeline example,
from https://github.com/aws/amazon-sagemaker-examples/blob/main/sagemaker-pipelines/tabular/abalone_build_train_deploy/sagemaker-pipelines-preprocess-train-evaluate-batch-transform.ipynb,
so that the created model is an inference pipeline that includes the preprocessing step.
The intention is to run a batch transformation on raw data that would run the preprocessing and then the xgboost inference, so I need to save the fitted scikit-learn preprocessing pipeline (PP) but I also want to keep the train/test/val split functionality.
From what I understand, I need to register a PipelineModel, which only supports Estimators. If I change the preprocessing step from a ProcessingStep to a TrainingStep then I could easily save the PP but I won't be able to save the train/test/val datasets. I thought of two different solutions:
Does anyone know of a solution or hack around these limitations?
Is there a way to create a TrainingStep that can save more results than just the model artificats?
Is there a way to create a ProcessingStep that uses a zipped file and entry_point as the code input?
Thanks
Beta Was this translation helpful? Give feedback.
All reactions