Update README and parameters

awslabs · Dec 19, 2024 · 55d817f · 55d817f
1 parent 2edf3a6
commit 55d817f
Show file tree

Hide file tree

Showing 4 changed files with 105 additions and 63 deletions.
diff --git a/sagemaker/pipeline/README.md b/sagemaker/pipeline/README.md
@@ -40,9 +40,12 @@ The project consists of three main Python scripts:
 
 1. `create_sm_pipeline.py`: Defines the structure of the SageMaker pipeline
 2. `pipeline_parameters.py`: Manages the configuration and parameters for the pipeline
-3. `execute_pipeline.py`: Executes created pipelines
+3. `execute_sm_pipeline.py`: Executes created pipelines
 
-## Access code and install dependencies
+## Installation
+
+To construct and execute GraphStorm SageMaker pipelines you need the code
+available and a Python environment with the SageMaker SDK and `boto3` installed.
 
 1. Clone the GraphStorm repository:
    ```
@@ -63,15 +66,15 @@ To create a new SageMaker pipeline for GraphStorm:
 
 ```bash
 python create_sm_pipeline.py \
-    --role arn:aws:iam::123456789012:role/SageMakerRole \
+    --execution-role arn:aws:iam::123456789012:role/SageMakerRole \
     --region us-west-2 \
-    --graphstorm-pytorch-image-url 123456789012.dkr.ecr.us-west-2.amazonaws.com/graphstorm:sm-cpu \
+    --graphstorm-pytorch-cpu-image-url 123456789012.dkr.ecr.us-west-2.amazonaws.com/graphstorm:sagemaker-cpu \
     --instance-count 2 \
     --jobs-to-run gconstruct train inference \
     --graph-name my-graph \
     --graph-construction-config-filename my_gconstruct_config.json \
     --input-data-s3 s3://input-bucket/data \
-    --output-prefix-s3 s3://output-bucket/results \
+    --output-prefix s3://output-bucket/results \
     --train-inference-task node_classification \
     --train-yaml-s3 s3://config-bucket/train.yaml
 ```
@@ -83,7 +86,7 @@ to construct the graph and the train config file at `s3://config-bucket/train.ya
 to run training and inference.
 
 The `--instance-count` parameter determines the number of workers and partitions we will create and use
-during partitioning/training. It is also aliased to `--num-parts`.
+during partitioning/training.
 
 You can customize various aspects of the pipeline using additional command-line arguments. Refer to the script's help message for a full list of options:
 
@@ -96,15 +99,15 @@ python create_sm_pipeline.py --help
 To execute a created pipeline:
 
 ```bash
-python execute_pipeline.py \
+python execute_sm_pipeline.py \
     --pipeline-name my-graphstorm-pipeline \
     --region us-west-2
 ```
 
 You can override various pipeline parameters during execution:
 
 ```bash
-python execute_pipeline.py \
+python execute_sm_pipeline.py \
     --pipeline-name my-graphstorm-pipeline \
     --region us-west-2 \
     --instance-count 4 \
@@ -114,20 +117,22 @@ python execute_pipeline.py \
 For a full list of execution options:
 
 ```bash
-python execute_pipeline.py --help
+python execute_sm_pipeline.py --help
 ```
 
-For more fine-grained execution options see the
+For more fine-grained execution options, like selective execution,
 [SageMaker AI documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/pipelines-selective-ex.html).
 
 ## Pipeline Components
 
-The GraphStorm SageMaker pipeline typically includes the following steps:
+The GraphStorm SageMaker pipeline can include the following steps:
 
-1. **Graph Construction**: Builds the graph from input data
-2. **Graph Partitioning**: Partitions the graph for distributed processing
-3. **Training**: Trains the graph neural network model
-4. **Inference**: Runs inference on the trained model
+1. **Graph Construction (GConstruct)**: Builds the partitioned graph from input data in a single instance.
+2. **Graph Processing (GSProcessing)**: Processes the graph data using PySpark, preparing it for distributed partitioning.
+3. **Graph Partitioning (DistPart)**: Partitions the graph using multiple instances.
+4. **GraphBolt Conversion**: Converts the partitioned data (usually generated from DistPart) to GraphBolt format.
+5. **Training**: Trains the graph neural network model.
+6. **Inference**: Runs inference on the trained model.
 
 Each step is configurable and can be customized based on your specific requirements.
 
@@ -165,27 +170,40 @@ python create_sm_pipeline.py \
     --use-graphbolt true
 ```
 
-will create a pipeline that uses GSProcessing to process and prepare the data for partitioning,
-use GSPartition to partition the data, convert the partitioned data to the GraphBolt format,
+This will create a pipeline that uses GSProcessing to process and prepare the data for partitioning,
+use DistPart to partition the data, convert the partitioned data to the GraphBolt format,
 then run a train and an inference job in sequence.
-
+You can use this job sequence when your graph is too large to partition on one instance using
+GConstruct (1+ TB is the suggested threshold to move to distributed partitioning).
 
 ### Asynchronous Execution
 
 To start a pipeline execution without waiting for it to complete:
 
 ```bash
-python execute_pipeline.py \
+python execute_sm_pipeline.py \
     --pipeline-name my-graphstorm-pipeline \
     --region us-west-2 \
     --async-execution
 ```
 
+### Local Execution
+
+For testing purposes, you can execute the pipeline locally:
+
+```bash
+python execute_sm_pipeline.py \
+    --pipeline-name my-graphstorm-pipeline \
+    --local-execution
+```
+
+Note that local execution requires a GPU if the pipeline is configured to use GPU instances.
+
 ## Troubleshooting
 
 - Ensure all required AWS permissions are correctly set up
 - Check SageMaker execution logs for detailed error messages
-- Verify that all S3 paths are correct and accessible. Note trailing `/` that could cause issues.
+- Verify that all S3 paths are correct and accessible
 - Ensure that the specified EC2 instance types are available in your region
 
 See also [Troubleshooting Amazon SageMaker Pipelines](https://docs.aws.amazon.com/sagemaker/latest/dg/pipelines-troubleshooting.html)

diff --git a/sagemaker/pipeline/create_sm_pipeline.py b/sagemaker/pipeline/create_sm_pipeline.py
@@ -290,7 +290,7 @@ def _create_gconstruct_step(self, args: PipelineArgs) -> ProcessingStep:
 
         gconstruct_processor = ScriptProcessor(
             image_uri=args.aws_config.graphstorm_pytorch_cpu_image_url,
-            role=args.aws_config.role,
+            role=args.aws_config.execution_role,
             instance_count=1,
             instance_type=self.graphconstruct_instance_type_param,
             command=["python3"],
@@ -360,7 +360,7 @@ def _create_gconstruct_step(self, args: PipelineArgs) -> ProcessingStep:
     def _create_gsprocessing_step(self, args: PipelineArgs) -> ProcessingStep:
         # Implementation for GSProcessing step
         pyspark_processor = PySparkProcessor(
-            role=args.aws_config.role,
+            role=args.aws_config.execution_role,
             instance_type=args.instance_config.graph_construction_instance_type,
             instance_count=args.instance_config.gsprocessing_instance_count,
             image_uri=args.aws_config.gsprocessing_pyspark_image_url,
@@ -438,9 +438,9 @@ def _create_dist_part_step(self, args: PipelineArgs) -> ProcessingStep:
         # Implementation for DistPartition step
         dist_part_processor = ScriptProcessor(
             image_uri=args.aws_config.graphstorm_pytorch_cpu_image_url,
-            role=args.aws_config.role,
+            role=args.aws_config.execution_role,
             instance_count=self.instance_count_param,
-            instance_type=self.cpu_instance_type_param,
+            instance_type=self.graphconstruct_instance_type_param,
             command=["python3"],
             sagemaker_session=self.pipeline_session,
             volume_size_in_gb=self.volume_size_gb_param,
@@ -496,7 +496,7 @@ def _create_gb_convert_step(self, args: PipelineArgs) -> ProcessingStep:
         # Implementation for GraphBolt partition step
         gb_part_processor = ScriptProcessor(
             image_uri=args.aws_config.graphstorm_pytorch_cpu_image_url,
-            role=args.aws_config.role,
+            role=args.aws_config.execution_role,
             instance_count=1,
             instance_type=self.graphconstruct_instance_type_param,
             command=["python3"],
@@ -543,7 +543,7 @@ def _create_train_step(self, args: PipelineArgs) -> TrainingStep:
             entry_point=os.path.basename(args.script_paths.train_script),
             source_dir=os.path.dirname(args.script_paths.train_script),
             image_uri=self.train_infer_image,
-            role=args.aws_config.role,
+            role=args.aws_config.execution_role,
             instance_count=self.instance_count_param,
             instance_type=self.train_infer_instance,
             py_version="py3",
@@ -627,7 +627,7 @@ def _create_inference_step(self, args: PipelineArgs) -> TrainingStep:
             entry_point=os.path.basename(args.script_paths.inference_script),
             source_dir=os.path.dirname(args.script_paths.inference_script),
             image_uri=self.train_infer_image,
-            role=args.aws_config.role,
+            role=args.aws_config.execution_role,
             instance_count=self.instance_count_param,
             instance_type=self.train_infer_instance,
             py_version="py3",
@@ -682,10 +682,10 @@ def main():
 
     if pipeline_args.update:
         # TODO: If updating ensure pipeline exists first to get more informative error
-        pipeline.update(role_arn=pipeline_args.aws_config.role)
+        pipeline.update(role_arn=pipeline_args.aws_config.execution_role)
         print(f"Pipeline '{pipeline.name}' updated successfully.")
     else:
-        pipeline.create(role_arn=pipeline_args.aws_config.role)
+        pipeline.create(role_arn=pipeline_args.aws_config.execution_role)
         print(f"Pipeline '{pipeline.name}' created successfully.")
 
 

diff --git a/sagemaker/pipeline/execute_sm_pipeline.py b/sagemaker/pipeline/execute_sm_pipeline.py
@@ -75,10 +75,17 @@ def parse_args():
     # Optional override parameters
     overrides.add_argument("--instance-count", type=int, help="Override instance count")
     overrides.add_argument(
-        "--cpu-instance-type", type=str, help="Override CPU instance type"
+        "--cpu-instance-type",
+        type=str,
+        help="Override CPU instance type. "
+        "Always used in DistPart step and if '--train-on-cpu' is provided, "
+        "in Train and Inference steps.",
     )
     overrides.add_argument(
-        "--gpu-instance-type", type=str, help="Override GPU instance type"
+        "--gpu-instance-type",
+        type=str,
+        help="Override GPU instance type. "
+        "Used by default in in Train and Inference steps, unless '--train-on-cpu' is provided.",
     )
     overrides.add_argument(
         "--graphconstruct-instance-type",
@@ -144,12 +151,12 @@ def main():
         # Ensure GPU is available if trying to execute with GPU locally
         if not pipeline_deploy_args.instance_config.train_on_cpu:
             try:
-                subprocess.check_output('nvidia-smi')
+                subprocess.check_output("nvidia-smi")
             except Exception:
                 raise RuntimeError(
-                    'Need host with NVidia GPU to run training on GPU! '
+                    "Need host with NVidia GPU to run training on GPU! "
                     "Try re-deploying the pipeline with --train-on-cpu set."
-                    )
+                )
         # Use local pipeline and session
         local_session = LocalPipelineSession()
         pipeline_generator = GraphStormPipelineGenerator(
@@ -162,7 +169,7 @@ def main():
         }
         pipeline = pipeline_generator.create_pipeline()
         pipeline.sagemaker_session = local_session
-        pipeline.create(role_arn=pipeline_deploy_args.aws_config.role)
+        pipeline.create(role_arn=pipeline_deploy_args.aws_config.execution_role)
     else:
         assert args.region, "Need to provide --region for remote SageMaker execution"
         boto_session = boto3.Session(region_name=args.region)