Skip to content

20221102 meeting

Aurelien Bouteiller edited this page Nov 11, 2022 · 2 revisions

WG meeting Nov 2, 2022

Agenda items

Coalesce the discussion about cancelling allocation requests that has been going on in multiple venues (e.g., MPI Session WG, etc.).

Discussion

  • We discussed the current allocation interface to re-accustom ourselves with the nuance between the PMIX_ALLOC_REQ_ID (user supplied) and the PMIX_ALLOC_ID (PMIx produced)

  • Expected behavior when cancelled: the party that cancels gets a PMIX_SUCCESS (or equivalent if async), the party that alloc-ed will get a PMIX_ERR_CANCELLED.

Issues related to distributed cancellation (e.g., cancellation from another node)

What to do if cancellation comes before the request allocation is started at another node

Multiple possible options were considered:

  1. The PMIX_ALLOC_ID is used to cancel, this is non-workable for cancelling since this is returned when the allocation is complete; thus it must use the PMIX_ALLOC_REQ_ID
  2. The cancellation is 'cached' and will cancel anything that matches the PMIX_ALLOC_REQ_ID in the future; we don't like that, because it is global shared state that would itself need to be cancelled somehow. Ugly and problematic.
  3. The cancellation is ignored, because it doesn't match any current allocation request. This is simple, the user could use its own synchronization if the use case is important to them, but we believe that in general, the cancellation will be issued at the same location as the allocation, thus decreasing the relevance of this scenario. Overall, that looks like the adequate solution.

The expected outcome is that the allocation request (if any is later posted) will succeed, and the cancellation will return PMIX_ERR_NOT_FOUND (or the appropriate error code).

What if the cancellation comes after

We had a similar discussion for the case where the cancellation comes after the allocation has completed (and returned/will return when probed PMIX_SUCCESS). We think the same reasoning applies and again the cancel should return PMIX_ERR_NOT_FOUND.

How should the API look like

We had some very quick discussion if the API should come in the form of a new function, or some new attribute keys to existing job-control operations. This has not been completely resolved yet.

Timetable

This is too late for v5, it may make it to v5.1

Next meeting

I believe we said Dec 7th. Please correct if wrong.