Skip to content

Commit

Permalink
Update README and parameters
Browse files Browse the repository at this point in the history
  • Loading branch information
thvasilo committed Dec 19, 2024
1 parent 2edf3a6 commit 55d817f
Show file tree
Hide file tree
Showing 4 changed files with 105 additions and 63 deletions.
58 changes: 38 additions & 20 deletions sagemaker/pipeline/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,9 +40,12 @@ The project consists of three main Python scripts:

1. `create_sm_pipeline.py`: Defines the structure of the SageMaker pipeline
2. `pipeline_parameters.py`: Manages the configuration and parameters for the pipeline
3. `execute_pipeline.py`: Executes created pipelines
3. `execute_sm_pipeline.py`: Executes created pipelines

## Access code and install dependencies
## Installation

To construct and execute GraphStorm SageMaker pipelines you need the code
available and a Python environment with the SageMaker SDK and `boto3` installed.

1. Clone the GraphStorm repository:
```
Expand All @@ -63,15 +66,15 @@ To create a new SageMaker pipeline for GraphStorm:

```bash
python create_sm_pipeline.py \
--role arn:aws:iam::123456789012:role/SageMakerRole \
--execution-role arn:aws:iam::123456789012:role/SageMakerRole \
--region us-west-2 \
--graphstorm-pytorch-image-url 123456789012.dkr.ecr.us-west-2.amazonaws.com/graphstorm:sm-cpu \
--graphstorm-pytorch-cpu-image-url 123456789012.dkr.ecr.us-west-2.amazonaws.com/graphstorm:sagemaker-cpu \
--instance-count 2 \
--jobs-to-run gconstruct train inference \
--graph-name my-graph \
--graph-construction-config-filename my_gconstruct_config.json \
--input-data-s3 s3://input-bucket/data \
--output-prefix-s3 s3://output-bucket/results \
--output-prefix s3://output-bucket/results \
--train-inference-task node_classification \
--train-yaml-s3 s3://config-bucket/train.yaml
```
Expand All @@ -83,7 +86,7 @@ to construct the graph and the train config file at `s3://config-bucket/train.ya
to run training and inference.

The `--instance-count` parameter determines the number of workers and partitions we will create and use
during partitioning/training. It is also aliased to `--num-parts`.
during partitioning/training.

You can customize various aspects of the pipeline using additional command-line arguments. Refer to the script's help message for a full list of options:

Expand All @@ -96,15 +99,15 @@ python create_sm_pipeline.py --help
To execute a created pipeline:

```bash
python execute_pipeline.py \
python execute_sm_pipeline.py \
--pipeline-name my-graphstorm-pipeline \
--region us-west-2
```

You can override various pipeline parameters during execution:

```bash
python execute_pipeline.py \
python execute_sm_pipeline.py \
--pipeline-name my-graphstorm-pipeline \
--region us-west-2 \
--instance-count 4 \
Expand All @@ -114,20 +117,22 @@ python execute_pipeline.py \
For a full list of execution options:

```bash
python execute_pipeline.py --help
python execute_sm_pipeline.py --help
```

For more fine-grained execution options see the
For more fine-grained execution options, like selective execution,
[SageMaker AI documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/pipelines-selective-ex.html).

## Pipeline Components

The GraphStorm SageMaker pipeline typically includes the following steps:
The GraphStorm SageMaker pipeline can include the following steps:

1. **Graph Construction**: Builds the graph from input data
2. **Graph Partitioning**: Partitions the graph for distributed processing
3. **Training**: Trains the graph neural network model
4. **Inference**: Runs inference on the trained model
1. **Graph Construction (GConstruct)**: Builds the partitioned graph from input data in a single instance.
2. **Graph Processing (GSProcessing)**: Processes the graph data using PySpark, preparing it for distributed partitioning.
3. **Graph Partitioning (DistPart)**: Partitions the graph using multiple instances.
4. **GraphBolt Conversion**: Converts the partitioned data (usually generated from DistPart) to GraphBolt format.
5. **Training**: Trains the graph neural network model.
6. **Inference**: Runs inference on the trained model.

Each step is configurable and can be customized based on your specific requirements.

Expand Down Expand Up @@ -165,27 +170,40 @@ python create_sm_pipeline.py \
--use-graphbolt true
```

will create a pipeline that uses GSProcessing to process and prepare the data for partitioning,
use GSPartition to partition the data, convert the partitioned data to the GraphBolt format,
This will create a pipeline that uses GSProcessing to process and prepare the data for partitioning,
use DistPart to partition the data, convert the partitioned data to the GraphBolt format,
then run a train and an inference job in sequence.

You can use this job sequence when your graph is too large to partition on one instance using
GConstruct (1+ TB is the suggested threshold to move to distributed partitioning).

### Asynchronous Execution

To start a pipeline execution without waiting for it to complete:

```bash
python execute_pipeline.py \
python execute_sm_pipeline.py \
--pipeline-name my-graphstorm-pipeline \
--region us-west-2 \
--async-execution
```

### Local Execution

For testing purposes, you can execute the pipeline locally:

```bash
python execute_sm_pipeline.py \
--pipeline-name my-graphstorm-pipeline \
--local-execution
```

Note that local execution requires a GPU if the pipeline is configured to use GPU instances.

## Troubleshooting

- Ensure all required AWS permissions are correctly set up
- Check SageMaker execution logs for detailed error messages
- Verify that all S3 paths are correct and accessible. Note trailing `/` that could cause issues.
- Verify that all S3 paths are correct and accessible
- Ensure that the specified EC2 instance types are available in your region

See also [Troubleshooting Amazon SageMaker Pipelines](https://docs.aws.amazon.com/sagemaker/latest/dg/pipelines-troubleshooting.html)
Expand Down
18 changes: 9 additions & 9 deletions sagemaker/pipeline/create_sm_pipeline.py
Original file line number Diff line number Diff line change
Expand Up @@ -290,7 +290,7 @@ def _create_gconstruct_step(self, args: PipelineArgs) -> ProcessingStep:

gconstruct_processor = ScriptProcessor(
image_uri=args.aws_config.graphstorm_pytorch_cpu_image_url,
role=args.aws_config.role,
role=args.aws_config.execution_role,
instance_count=1,
instance_type=self.graphconstruct_instance_type_param,
command=["python3"],
Expand Down Expand Up @@ -360,7 +360,7 @@ def _create_gconstruct_step(self, args: PipelineArgs) -> ProcessingStep:
def _create_gsprocessing_step(self, args: PipelineArgs) -> ProcessingStep:
# Implementation for GSProcessing step
pyspark_processor = PySparkProcessor(
role=args.aws_config.role,
role=args.aws_config.execution_role,
instance_type=args.instance_config.graph_construction_instance_type,
instance_count=args.instance_config.gsprocessing_instance_count,
image_uri=args.aws_config.gsprocessing_pyspark_image_url,
Expand Down Expand Up @@ -438,9 +438,9 @@ def _create_dist_part_step(self, args: PipelineArgs) -> ProcessingStep:
# Implementation for DistPartition step
dist_part_processor = ScriptProcessor(
image_uri=args.aws_config.graphstorm_pytorch_cpu_image_url,
role=args.aws_config.role,
role=args.aws_config.execution_role,
instance_count=self.instance_count_param,
instance_type=self.cpu_instance_type_param,
instance_type=self.graphconstruct_instance_type_param,
command=["python3"],
sagemaker_session=self.pipeline_session,
volume_size_in_gb=self.volume_size_gb_param,
Expand Down Expand Up @@ -496,7 +496,7 @@ def _create_gb_convert_step(self, args: PipelineArgs) -> ProcessingStep:
# Implementation for GraphBolt partition step
gb_part_processor = ScriptProcessor(
image_uri=args.aws_config.graphstorm_pytorch_cpu_image_url,
role=args.aws_config.role,
role=args.aws_config.execution_role,
instance_count=1,
instance_type=self.graphconstruct_instance_type_param,
command=["python3"],
Expand Down Expand Up @@ -543,7 +543,7 @@ def _create_train_step(self, args: PipelineArgs) -> TrainingStep:
entry_point=os.path.basename(args.script_paths.train_script),
source_dir=os.path.dirname(args.script_paths.train_script),
image_uri=self.train_infer_image,
role=args.aws_config.role,
role=args.aws_config.execution_role,
instance_count=self.instance_count_param,
instance_type=self.train_infer_instance,
py_version="py3",
Expand Down Expand Up @@ -627,7 +627,7 @@ def _create_inference_step(self, args: PipelineArgs) -> TrainingStep:
entry_point=os.path.basename(args.script_paths.inference_script),
source_dir=os.path.dirname(args.script_paths.inference_script),
image_uri=self.train_infer_image,
role=args.aws_config.role,
role=args.aws_config.execution_role,
instance_count=self.instance_count_param,
instance_type=self.train_infer_instance,
py_version="py3",
Expand Down Expand Up @@ -682,10 +682,10 @@ def main():

if pipeline_args.update:
# TODO: If updating ensure pipeline exists first to get more informative error
pipeline.update(role_arn=pipeline_args.aws_config.role)
pipeline.update(role_arn=pipeline_args.aws_config.execution_role)
print(f"Pipeline '{pipeline.name}' updated successfully.")
else:
pipeline.create(role_arn=pipeline_args.aws_config.role)
pipeline.create(role_arn=pipeline_args.aws_config.execution_role)
print(f"Pipeline '{pipeline.name}' created successfully.")


Expand Down
19 changes: 13 additions & 6 deletions sagemaker/pipeline/execute_sm_pipeline.py
Original file line number Diff line number Diff line change
Expand Up @@ -75,10 +75,17 @@ def parse_args():
# Optional override parameters
overrides.add_argument("--instance-count", type=int, help="Override instance count")
overrides.add_argument(
"--cpu-instance-type", type=str, help="Override CPU instance type"
"--cpu-instance-type",
type=str,
help="Override CPU instance type. "
"Always used in DistPart step and if '--train-on-cpu' is provided, "
"in Train and Inference steps.",
)
overrides.add_argument(
"--gpu-instance-type", type=str, help="Override GPU instance type"
"--gpu-instance-type",
type=str,
help="Override GPU instance type. "
"Used by default in in Train and Inference steps, unless '--train-on-cpu' is provided.",
)
overrides.add_argument(
"--graphconstruct-instance-type",
Expand Down Expand Up @@ -144,12 +151,12 @@ def main():
# Ensure GPU is available if trying to execute with GPU locally
if not pipeline_deploy_args.instance_config.train_on_cpu:
try:
subprocess.check_output('nvidia-smi')
subprocess.check_output("nvidia-smi")
except Exception:
raise RuntimeError(
'Need host with NVidia GPU to run training on GPU! '
"Need host with NVidia GPU to run training on GPU! "
"Try re-deploying the pipeline with --train-on-cpu set."
)
)
# Use local pipeline and session
local_session = LocalPipelineSession()
pipeline_generator = GraphStormPipelineGenerator(
Expand All @@ -162,7 +169,7 @@ def main():
}
pipeline = pipeline_generator.create_pipeline()
pipeline.sagemaker_session = local_session
pipeline.create(role_arn=pipeline_deploy_args.aws_config.role)
pipeline.create(role_arn=pipeline_deploy_args.aws_config.execution_role)
else:
assert args.region, "Need to provide --region for remote SageMaker execution"
boto_session = boto3.Session(region_name=args.region)
Expand Down
Loading

0 comments on commit 55d817f

Please sign in to comment.