-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add support for torque in IPMU #36
base: main
Are you sure you want to change the base?
Conversation
return "torque" | ||
|
||
|
||
class TorqueProviderI(TorqueProvider): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm interested in what this subclass is for - it looks like you're trying to add a tasks-per-node parameter which would usually end up launching multiple copies of the Parsl worker pool on one node (rather than having one process worker pool manage the whole node). Is this what you're intending / is this actually what happens?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am pasting the submission script generated by parsl:
#!/bin/bash
#PBS -N shear.test
#PBS -q small
#PBS -S /bin/bash
#PBS -N parsl.parsl.torque.block-0.1726231023.145446
#PBS -m n
#PBS -l walltime=10:00:00
#PBS -l nodes=2:ppn=12
#PBS -o /work/xiangchong.li/superonionGW/code/image/xlens/tests/xlens/multiband/runinfo/000/submit_scripts/parsl.parsl.torque.block-0.1726231023.145446.submit.stdout
#PBS -e /work/xiangchong.li/superonionGW/code/image/xlens/tests/xlens/multiband/runinfo/000/submit_scripts/parsl.parsl.torque.block-0.1726231023.145446.submit.stderr
source /work/xiangchong.li/setupIm.sh
export JOBNAME="parsl.parsl.torque.block-0.1726231023.145446"
set -e
export CORES=$(getconf _NPROCESSORS_ONLN)
[[ "1" == "1" ]] && echo "Found cores : $CORES"
WORKERCOUNT=24
cat << MPIRUN_EOF > cmd_$JOBNAME.sh
process_worker_pool.py -a gw2.local -p 0 -c 1.0 -m None --poll 10 --task_port=54319 --result_port=54758 --logdir=/work/xiangchong.li/superonionGW/code/image/xlens/tests/xlens/multiband/runinfo/000/torque --block_id=0 --hb_period=30 --hb_threshold=120 --cpu-affinity none --available-accelerators --start-method spawn
MPIRUN_EOF
chmod u+x cmd_$JOBNAME.sh
mpirun -np $WORKERCOUNT /bin/bash cmd_$JOBNAME.sh
[[ "1" == "1" ]] && echo "All workers done"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I add this subclass so that I can change the ppn
parameters in PBS system by setting the tasks_per_node
in the configure file. The goal is to use 12 cpus in each node and each cpu has one task.
I am not sure I am doing it in the best way, but the code can be run on the server.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the usual Parsl model, you'd run one copy of process_worker_pool.py
on each node, and that worker pool would be in charge of running multiple tasks at once. The command line you specify has an option -c 1.0
which means 1 core per worker.
So the worker pool code should run as many workers (and so, as many simultaneous tasks) as you have cores on your worker node: that is the code that is in charge of running multiple workers, not mpirun
.
Have a look in your run directory (deep inside runinfo/....) for a file called manager.log
. You should see one per node (or with your configuration above, 24 per node) and inside those files you should see a log line like this:
2024-09-13 12:54:44.837 parsl:254 72 MainThread [INFO] Manager will spawn 8 workers
How many workers do you see there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should see one per node (or with your configuration above, 24 per node)
I think in the submission code generated by parsl, I run 12 tasks per node, each with one cpu, and there are 24 tasks over 2 ndoes.
#PBS -l nodes=2:ppn=12
says each node uses 12 cores (therefore 12 tasks running at the same time).
while the WORKERCOUNT=24
says each node has 24 workers across all the nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that slurm is doing in a more consistent way, I guess:
https://github.com/Parsl/parsl/blob/dd9150d7ac26b04eb8ff15247b1c18ce9893f79c/parsl/providers/slurm/slurm.py#L266
It has the option to set cores_per_task
in addition to tasks_per_node
.. PBS does not has this option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In your setup, i think you should make each worker pool try to use only 1 core, so that when you run 12 worker pools per node, you get 1 x 12 = 12 workers on each node. Have a look at the max_workers Parsl configuration parameter - for example, see how it is configured at in2p3:
Yeah.. Got it now. Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are lots of different ways to change things to get what you want, so it is quite confusing.
You could try this:
i) set the number of nodes in your job to 1 (so if you want to run on multiple nodes, you launch multiple blocks/multiple batch jobs)
ii) use the change you have made in this PR to set task_per_node to 12 - so that 12 cores are requested in #PBS nodes=...
iii) use the SimpleLauncher
instead of the MpiRunLauncher
here:
so that only a single copy of the process worker pool is launched in each batch job - rather than using mpirun to launch many copies of it
iv) tell the process worker pool to use 12 workers per pool, using max_workers = 12.
That should result in batch jobs where each batch job:
- gets 12 cores from PBS
- runs one copy of the Parsl process worker pool
- the process worker pool runs 12 workers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Does the SimpleLauncher
support running on two nodes? I thought if I use two nodes, I have to have 2 copies, one on each node. And I thought the copy shall be done with MpiRunLauncher
? Please correct me if you find this understanding is wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SimpleLauncher
does not support running on two nodes.
The model I wrote above has 1 node per block/per batch job - and if you want to use two nodes, set the max_blocks paramter to 2. so that you get two separate batch jobs that look like this.
(I opened a Parsl issue Parsl/parsl#3616 to request that the Parsl team try to make this interface nicer, some time in the future)
Is there anything else I need to do to finish the PR? |
@mr-superonion that's probably not for me to say - I was mostly interested in understanding what is missing in Parsl to make this so complicated, and I think I have got that information now in Parsl/parsl#3616 and Parsl/parsl#3617 |
Hi @mr-superonion . I've seen this, am grateful for your contribution, and will work on getting it incorporated. There are some hoops that I've got to jump through. |
…ich made me want to clarify this
Sorry that the code before actually does not work. The system put all the pools in one node and the other nodes basically do nothing. now I change the code so that it creates @benclifford , please let me know if this setup does not make sense. |
The PR adds support for torque and makes sure it can be run on servers at IPMU (for HSC project).
Checklist
doc/changes