From f2622032c7fddf78364c6dcb78030986ff771ec4 Mon Sep 17 00:00:00 2001 From: Gabriel Hare Date: Tue, 11 Apr 2023 16:28:28 -0700 Subject: [PATCH 1/5] ReWoTe steps 1-4 --- GabrielHare/Containerization-HPC.md | 44 +++++++++++++++++++++++++++++ 1 file changed, 44 insertions(+) create mode 100644 GabrielHare/Containerization-HPC.md diff --git a/GabrielHare/Containerization-HPC.md b/GabrielHare/Containerization-HPC.md new file mode 100644 index 00000000..bf3cb8db --- /dev/null +++ b/GabrielHare/Containerization-HPC.md @@ -0,0 +1,44 @@ +# Containerization / Benchmarks (HPC) + +> Ideal candidate: skilled HPC engineer versed in HPC, and Containers + +# Overview + +The aim of this task is to build an HPC compatible container (i.e. [Singularity](https://sylabs.io/guides/3.5/user-guide/introduction.html)) and test its performance in comparison with a native installation (no containerization) for a set of distributed memory calculations. + +# Requirements + +1. A working deployment pipeline - using any preferred tool such as SaltStack, Terraform, CloudFormation - for building out the computational infrastructure +2. A pipeline for building the HPC compatible container +3. A set of benchmarks for one or more HPC application on one or more cloud instance type + +# Expectations + +- The application may be relatively simple - e.g. Linpack, this is focused more on infrastructure +- Repeatable approach (no manual setup "in console") +- Clean workflow logic + +# Timeline + +We leave exact timing to the candidate. Should fit Within 5 days total. + +# User story + +As a user of this pipeline I can: + +- build an HPC-compatible container for an HPC executable/code +- run test calculations to assert working state of this container +- (optional) compare the behavior of this container with a OS native installation + +# Notes + +- Commit early and often + +# Suggestions + +We suggest: + +- using AWS as the cloud provider +- using Exabench as the source of benchmarks: https://github.com/Exabyte-io/exabyte-benchmarks-suite +- using CentOS or similar as operating system +- using Salstack, or Terraform, for infrastructure management From 80222f27c8063da5d0f4b5974991533deaa2df70 Mon Sep 17 00:00:00 2001 From: Gabriel Hare Date: Thu, 20 Apr 2023 23:52:15 -0700 Subject: [PATCH 2/5] Configuration for AWS Parallel Cluster t2.micro Ubuntu 20.04 --- .../mat3ra-benchmark-cluster-config.yaml | 22 +++++++++++++++++++ 1 file changed, 22 insertions(+) create mode 100644 GabrielHare/mat3ra-benchmark-cluster-config.yaml diff --git a/GabrielHare/mat3ra-benchmark-cluster-config.yaml b/GabrielHare/mat3ra-benchmark-cluster-config.yaml new file mode 100644 index 00000000..71ecfe9f --- /dev/null +++ b/GabrielHare/mat3ra-benchmark-cluster-config.yaml @@ -0,0 +1,22 @@ +Region: us-west-1 +Image: + Os: ubuntu2004 +HeadNode: + InstanceType: t2.micro + Networking: + SubnetId: subnet-0b07a91d85deb0cb3 + Ssh: + KeyName: Reification10_Ubuntu +Scheduling: + Scheduler: slurm + SlurmQueues: + - Name: queue1 + ComputeResources: + - Name: t2micro + Instances: + - InstanceType: t2.micro + MinCount: 0 + MaxCount: 16 + Networking: + SubnetIds: + - subnet-0e15af99ba3c799e0 From 10415f082cd58ecb13cf044e2262bb18aafb4e5d Mon Sep 17 00:00:00 2001 From: Gabriel Hare Date: Fri, 21 Apr 2023 00:28:57 -0700 Subject: [PATCH 3/5] Create AWS_HPL_Setup Add recipe to run HPL benchmarks on AWS Parallel Cluster --- GabrielHare/AWS_HPL_Setup | 352 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 352 insertions(+) create mode 100644 GabrielHare/AWS_HPL_Setup diff --git a/GabrielHare/AWS_HPL_Setup b/GabrielHare/AWS_HPL_Setup new file mode 100644 index 00000000..a33f7e8b --- /dev/null +++ b/GabrielHare/AWS_HPL_Setup @@ -0,0 +1,352 @@ +# AWS Parallel Cluster Setup +[Following general Setup instructions from AWS docs](https://docs.aws.amazon.com/parallelcluster/latest/ug/install-v3-virtual-environment.html) +Local Platform Only +Ubuntu 20.04 (Focal Fossa) + +NOTE: Performance improvement option: +[https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html#efa-instance-types](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html#efa-instance-types) + +NOTE: Logging option: +[https://docs.aws.amazon.com/vpc/latest/userguide/flow-logs.html](https://docs.aws.amazon.com/vpc/latest/userguide/flow-logs.html) + +NOTE: Relevant example use-cases: +[https://sunhwan.github.io/blog/2021/04/17/Run-Molecular-Dynamics-Simulation-on-AWS-Cluster.html](https://sunhwan.github.io/blog/2021/04/17/Run-Molecular-Dynamics-Simulation-on-AWS-Cluster.html) +[https://aws.amazon.com/blogs/hpc/running-20k-simulations-in-3-days-with-aws-batch/](https://aws.amazon.com/blogs/hpc/running-20k-simulations-in-3-days-with-aws-batch/) + +## Install +(1) Install python3, pip, & virtualenv + + $ apt install python3 + $ python3 -m pip install --upgrade pip + $ python3 -m pip install --user --upgrade virtualenv + + +(2) Install AWS Parallel Cluster into virtual environment apc + + $ mkdir [your cluster project] && cd [your cluster project] + $ virtualenv apc + $ source apc/bin/activate + $ pip install --upgrade "aws-parallelcluster" + + Verify: + + $ pcluster version + + +(3) Install Node Version Manager + + $ curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.38.0/install.sh | bash + $ chmod ug+x ~/.nvm/nvm.sh + $ source ~/.nvm/nvm.sh + $ nvm install --lts + +Verify: + + $ node --version + + + ## Deploy cluster +(4) Configure & Create cluster + + $ pcluster configure --config [your cluster config].yaml + +This will start a command line interface to specify the cluster configuration. +EXAMPLE: + - One queue, named "queue1" + - Nodes at t2.micro instances + - Platform is Ubuntu 20.04 (focal fossa) + - Head node has public IP address, cluster nodes are private + - Maximum of 16 instances + + $ pcluster create-cluster --cluster-configuration [your cluster config].yaml --cluster-name [your cluster name] + +Expected response: + + { + "cluster": { + "clusterName": "[your cluster name]", + "cloudformationStackStatus": "CREATE_IN_PROGRESS", + "cloudformationStackArn": "arn:aws:cloudformation:us-west-1:918364550460:stack/mat3ra-benchmark/e06cb140-dbfa-11ed-8526-0658b83a89f5", + "region": "us-west-1", + "version": "3.5.1", + "clusterStatus": "CREATE_IN_PROGRESS", + "scheduler": { + "type": "slurm" + } + } + } +Wait for cluster creation to complete + + $ watch -n 10 pcluster describe-cluster --cluster-name mat3ra-benchmark + +Cluster creation is complete when: + + "clusterStatus": "CREATE_COMPLETE" + + +NOTE: The EC2 console will show only the cluster head node. Additional nodes will be created as needed, and will be terminated after a period of inactivity. + +NOTE: The configuration file will persist, and can be used to recreate the cluster, provided that the VPC identified in the configuration still exists. + +## Log in +(5) Log in to cluster head node + + $ pcluster ssh --cluster-name mat3ra-benchmark + +WARNING: DO NOT UPGRADE - this will stall out at 79%, and the cluster will need to be recreated! + +# HPL Benchmark setup +[Following general setup instructions, with modification for AWS instances](https://alan-turing-institute.github.io/data-science-benchmarking/examples/HPL_benchmarks_workflow_example.html) +Remote Platform Only +Ubuntu 20.04 (focal fossa) + +## Install linear algebra library +(1) Confirm [ATLAS](https://math-atlas.sourceforge.net) development library is installed + + $ dpkg --listfiles libatlas-base-dev | grep 'atlas_buildinfo\.h' + +Expected response: + + /usr/include/x86_64-linux-gnu/atlas/atlas_buildinfo.h + +EXPECT: Install include path is: + + /usr/include/x86_64-linux-gnu/atlas/ + +If ATLAS is not found install for development +$ apt install -y libatlas-base-dev + +## Install MPI library +(2) Confirm OpenMPI [LINK] is installed and active + + $ which mpirun + +Expected response: + + /opt/amazon/openmpi/bin/mpirun + +If [OpenMPI](https://docs.open-mpi.org) is not found [install for development](https://packages.ubuntu.com/focal/libopenmpi-dev) + + $ apt install -y openmpi-bin libopenmpi-dev + +(2a) OPTIONAL: Enable Intel [MPI](https://www.intel.com/content/www/us/en/docs/mpi-library/get-started-guide-linux/2021-6/overview.html) +Confirm that IntelMPI is [available on cluster nodes](https://docs.aws.amazon.com/parallelcluster/latest/ug/intelmpi.html) as a [module](https://uisapp2.iu.edu/confluence-prd/pages/viewpage.action?pageId=115540061) + + $ module avail + +Expect response to include "intelmpi" +Switch to using IntelMPI + + $ module load intelmpi + +Confirm switch + + $ which mpirun + +Expect response: + + /opt/intel/mpi/2021.6.0/bin/mpirun + +IMPORTANT: IntelMPI will need to be enabled on each node as a part of the launch script, otherwise HPL will not be able to run. + +NOTE: Another MPI option is [MPIch](https://www.mpich.org/about/overview/) +[https://stackoverflow.com/a/25493270/5917300](https://stackoverflow.com/a/25493270/5917300) + +## Install HPL + +[Following these directions for HPL install](https://gist.github.com/Levi-Hope/27b9c32cc5c9ded78fff3f155fc7b5ea) +NOTE: "wget," "nano," and "build-essential" are pre-installed on AWS nodes + +(3) Confirm latest HPL version in repostory (currently 2.3): +[https://netlib.org/benchmark/hpl/](https://netlib.org/benchmark/hpl/) + +(4) Download HPL source to user directory, which will be shared by all nodes. + + $ cd ~ + $ wget https://www.netlib.org/benchmark/hpl/hpl-2.3.tar.gz + $ tar -xf hpl-2.3.tar.gz + +(4) Configure HPL build + + $ mv hpl-2.3 hpl + $ cd hpl/setup + $ sh make_generic + $ mv Make.UNKNOWN ../Make.linux + $ cd ../ + +Edit the Make.linux file +NOTE: In Make.linux "TOPdir = ${HOME}/hpl/", which is the reason for the rename above + + $ nano Make.linux + +> ARCH = linux +> LAinc = /usr/include/x86_64-linux-gnu/atlas/ + +If using Intel MPI + +> MPinc = /opt/intel/mpi/2021.6.0/include/ +> MPlib = /opt/intel/mpi/2021.6.0/lib/release/libmpi.so + +If using OpenMPI + +> MPinc = /opt/amazon/openmpu/include/ +> MPlib = /opt/amazon/openmpu/lib/libmpi.so + +Build xhpl + + $ make arch=linux + +NOTE: "Expect gcc: warning: ..." but no errors +IMPORTANT: The path to "xhpl" is "${HOME}/hpl/bin/linux" so it must be added to path of each instance + +Confirm build using a test for one node + + $ cd bin/linux + $ mv HPL.dat default_HPL.dat + $ nano HPL.dat + +Paste the following into HPL.dat (every line must be included) + +> HPLinpack benchmark input file +> Innovative Computing Laboratory, University of Tennessee +> HPL.out output file name (if any) +> 6 device out (6=stdout,7=stderr,file) +> 1 # of problems sizes (N) +> 5040 Ns +> 1 # of NBs +> 128 NBs +> 0 PMAP process mapping (0=Row-,1=Column-major) +> 1 # of process grids (P x Q) +> 1 Ps +> 1 Qs +> 16.0 threshold +> 1 # of panel fact +> 2 PFACTs (0=left, 1=Crout, 2=Right) +> 1 # of recursive stopping criterium +> 4 NBMINs (>= 1) +> 1 # of panels in recursion +> 2 NDIVs +> 1 # of recursive panel fact. +> 1 RFACTs (0=left, 1=Crout, 2=Right) +> 1 # of broadcast +> 1 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM) +> 1 # of lookahead depth +> 1 DEPTHs (>=0) +> 2 SWAP (0=bin-exch,1=long,2=mix) +> 64 swapping threshold +> 0 L1 in (0=transposed,1=no-transposed) form +> 0 U in (0=transposed,1=no-transposed) form +> 1 Equilibration (0=no,1=yes) +> 8 memory alignment in double (> 0) + +Execute the test + + $ xhpl + + +NOTE: [Explanation of HPL.dat parameters](https://ulhpc-tutorials.readthedocs.io/en/latest/parallel/mpi/HPL/#hpl-main-parameters) +This includes instructions for choosing the maximum viable matrix size based on available memory in node hardware. + +NOTE: [Configuring HPL for general node hardware](https://github.com/matthew-li/lbnl_hpl_doc#3-gathering-parameters) + +# Install Exabyte Benchmark Suite + +NOTE: SLURM is already installed on AWS instances, and [no other systems resource management services are supported](https://aws.amazon.com/blogs/hpc/choosing-between-batch-or-parallelcluster-for-hpc/). +NOTE: AWS offers [a migration guide for older clusters](https://aws.amazon.com/blogs/hpc/easing-your-migration-from-sge-to-slurm-in-aws-parallelcluster-3/). +NOTE: SLURM provides a "qsub" command that accepts canonical PBS arguments. +If sbatch is not found + +(1) Install git-lfs +[Following GitHub directions](https://github.com/git-lfs/git-lfs/blob/main/INSTALLING.md) +NOTE: Instances have git installed, but git-lfs is required for this project. + + $ curl -s [https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh](https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh) | sudo bash + $ sudo apt install -y git-lfs + +(2) Install python3-virtualenv +NOTE: Instances will have python3 installed, but use of virtualenv is recommended for requirement installs + + $ sudo apt install -y python3-virtualenv + +(3) Install project dependencies + + $ apt install -y libpng-dev libfreetype-dev + +NOTE: These libraries are already installed on instance + +(4) Clone the benchmark repository +NOTE: This environment is not expected to be used for development, so a minimal clone will suffice. +TEMP: Use project from GabrielHare until update PR is merged + + $ git clone --depth 1 --recurse-submodules --shallow-submodules [https://github.com/GabrielHare/exabyte-benchmarks-suite.git](https://github.com/GabrielHare/exabyte-benchmarks-suite.git) --single-branch --branch GabrielHare + $ cd exabyte-benchmarks-suite + $ virtualenv env + $ source env/bin/activate + $ pip install -r requirements.txt + +(5) Run the benchmark tests +NOTE: Only the HPL benchmarks will be run + - espresso module is empty + - vasp module expects additional files which are missing + - gromacs has not been installed + +Prepare the benchmarks (generates HPL.dat files) + + $ ./exabench --prepare --type hpl + +Execute the tests (will launch EC2 node instances) + + $ ./exabench --execute --type hpl + +Wait for queue to empty + + $ watch -n 10 squeue + +Collect results when the queue is empty so all benchmark test configurations have executed. + + $ ./exabench --results --type hpl + +Results are appended to the results/result.json file, which is + + $ git config diff.lfs.textconv cat + $ git diff results/results.json + +In the default configuration EC2 node instances will terminate after 10 minutes without activity. + +NOTE: Node instances will appear in EC2 to handle each job, up to the cluster maximum. +NOTE: Standard and error outputs will appear in a log file adjacent to the HPL.dat for each benchmark +NOTE: [SLURM squeue job status](https://slurm.schedmd.com/squeue.html#lbAG) +NOTE: [Additional diagnostics using sacctmgr](https://stackoverflow.com/questions/29928925/how-can-i-get-detailed-job-run-info-from-slurm-e-g-like-that-produced-for-sta) +NOTE: [Additional node boot failure information](https://stackoverflow.com/questions/59074208/obtain-the-boot-and-failure-history-of-nodes-in-a-slurm-cluster) +NOTE: The following is a *rare* job failure, and can be resolved simply be retrying the failed job: +``` +-------------------------------------------------------------------------- +There are not enough slots available in the system to satisfy the 16 +slots that were requested by the application: + +xhpl + +Either request fewer slots for your application, or make more slots +available for use. + +A "slot" is the Open MPI term for an allocatable unit where we can +launch a process. The number of slots available are defined by the +environment in which Open MPI processes are run: + +1. Hostfile, via "slots=N" clauses (N defaults to number of +processor cores if not provided) +2. The --host command line parameter, via a ":N" suffix on the +hostname (N defaults to 1 if not provided) +3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.) +4. If none of a hostfile, the --host command line parameter, or an +RM is present, Open MPI defaults to the number of processor cores + +In all the above cases, if you want Open MPI to default to the number +of hardware threads instead of the number of processor cores, use the +--use-hwthread-cpus option. + +Alternatively, you can use the --oversubscribe option to ignore the +number of available slots when deciding the number of processes to +launch. +-------------------------------------------------------------------------- +``` From c13d7b249860bb5b1bb0d12fae1fd3a82ecbb699 Mon Sep 17 00:00:00 2001 From: Gabriel Hare Date: Fri, 21 Apr 2023 00:29:44 -0700 Subject: [PATCH 4/5] Rename AWS_HPL_Setup to AWS_HPL_Setup.md --- GabrielHare/{AWS_HPL_Setup => AWS_HPL_Setup.md} | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename GabrielHare/{AWS_HPL_Setup => AWS_HPL_Setup.md} (100%) diff --git a/GabrielHare/AWS_HPL_Setup b/GabrielHare/AWS_HPL_Setup.md similarity index 100% rename from GabrielHare/AWS_HPL_Setup rename to GabrielHare/AWS_HPL_Setup.md From 74d9d7596e6c43ff1deab042b01c003619add4fd Mon Sep 17 00:00:00 2001 From: Gabriel Hare Date: Fri, 21 Apr 2023 00:34:34 -0700 Subject: [PATCH 5/5] Formatting changes --- GabrielHare/AWS_HPL_Setup.md | 16 +++++++++++----- 1 file changed, 11 insertions(+), 5 deletions(-) diff --git a/GabrielHare/AWS_HPL_Setup.md b/GabrielHare/AWS_HPL_Setup.md index a33f7e8b..cb748540 100644 --- a/GabrielHare/AWS_HPL_Setup.md +++ b/GabrielHare/AWS_HPL_Setup.md @@ -1,6 +1,8 @@ # AWS Parallel Cluster Setup [Following general Setup instructions from AWS docs](https://docs.aws.amazon.com/parallelcluster/latest/ug/install-v3-virtual-environment.html) + Local Platform Only + Ubuntu 20.04 (Focal Fossa) NOTE: Performance improvement option: @@ -10,7 +12,9 @@ NOTE: Logging option: [https://docs.aws.amazon.com/vpc/latest/userguide/flow-logs.html](https://docs.aws.amazon.com/vpc/latest/userguide/flow-logs.html) NOTE: Relevant example use-cases: + [https://sunhwan.github.io/blog/2021/04/17/Run-Molecular-Dynamics-Simulation-on-AWS-Cluster.html](https://sunhwan.github.io/blog/2021/04/17/Run-Molecular-Dynamics-Simulation-on-AWS-Cluster.html) + [https://aws.amazon.com/blogs/hpc/running-20k-simulations-in-3-days-with-aws-batch/](https://aws.amazon.com/blogs/hpc/running-20k-simulations-in-3-days-with-aws-batch/) ## Install @@ -51,6 +55,7 @@ Verify: $ pcluster configure --config [your cluster config].yaml This will start a command line interface to specify the cluster configuration. + EXAMPLE: - One queue, named "queue1" - Nodes at t2.micro instances @@ -298,26 +303,27 @@ Execute the tests (will launch EC2 node instances) $ ./exabench --execute --type hpl -Wait for queue to empty +Wait for queue to empty and [watch job status progression](https://slurm.schedmd.com/squeue.html#lbAG) $ watch -n 10 squeue +NOTE: Node instances will appear in EC2 to handle each job, up to the cluster maximum. + Collect results when the queue is empty so all benchmark test configurations have executed. $ ./exabench --results --type hpl -Results are appended to the results/result.json file, which is +Results are appended to the results/result.json file, which is managed by LFS. $ git config diff.lfs.textconv cat $ git diff results/results.json In the default configuration EC2 node instances will terminate after 10 minutes without activity. -NOTE: Node instances will appear in EC2 to handle each job, up to the cluster maximum. -NOTE: Standard and error outputs will appear in a log file adjacent to the HPL.dat for each benchmark -NOTE: [SLURM squeue job status](https://slurm.schedmd.com/squeue.html#lbAG) NOTE: [Additional diagnostics using sacctmgr](https://stackoverflow.com/questions/29928925/how-can-i-get-detailed-job-run-info-from-slurm-e-g-like-that-produced-for-sta) + NOTE: [Additional node boot failure information](https://stackoverflow.com/questions/59074208/obtain-the-boot-and-failure-history-of-nodes-in-a-slurm-cluster) + NOTE: The following is a *rare* job failure, and can be resolved simply be retrying the failed job: ``` --------------------------------------------------------------------------