Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: new wait for idle #1219

Closed
wants to merge 8 commits into from
Closed

Conversation

dimaqq
Copy link
Contributor

@dimaqq dimaqq commented Nov 26, 2024

Pull the new Model.wait_for_idle implementation from #1104

  • pull in the unit tests
  • pull in the implementation
  • new unit tests pass
  • all unit tests pass
  • feature flag
  • run integration tests with and without the feature flag
  • implement idle/ready distinction properly
  • merge chore: remove juju.loop, deprecated before 3.0 #1242 and rebase

@dimaqq
Copy link
Contributor Author

dimaqq commented Dec 4, 2024

@james-garner-canonical I'd appreciate an early review.

I'm expecting integration tests to pass 🎉

There's still some refactoring to be done... and I'm happy to hear your thoughts about that and areas to improve in general.

Copy link
Contributor

@james-garner-canonical james-garner-canonical left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, so much nicer than the previous implementation. Really cool to have the tests passing on both. Was wait_for_idle tested much previously?

juju/model.py Outdated Show resolved Hide resolved
juju/model.py Outdated Show resolved Hide resolved
juju/model.py Outdated Show resolved Hide resolved
juju/model.py Outdated Show resolved Hide resolved
tests/unit/test_wait_for_idle.py Outdated Show resolved Hide resolved
tests/unit/test_wait_for_idle.py Outdated Show resolved Hide resolved
juju/unit.py Show resolved Hide resolved
juju/client/facade.py Show resolved Hide resolved
@dimaqq dimaqq force-pushed the feat-new-wait-for-idle branch 2 times, most recently from 810eb09 to 9a36b7a Compare December 11, 2024 08:21
@dimaqq
Copy link
Contributor Author

dimaqq commented Dec 11, 2024

@james-garner-canonical the big refactor is done, I hope you'll like the result :)

still a draft, as some unit tests are missing and I have one internal concern, but should be ready for review :)

Copy link
Contributor

@james-garner-canonical james-garner-canonical left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't looked at the tests (though I noticed the addition of freezegun), but I've left some comments on the new implementation. My main concern is that the generator/coroutine business would be a lot easier (for me) to follow if we used a regular class/object instead (see comment on def _loop) -- it took me a while to get my head around. I do like the separation of the time-based checks. Lmk if you want me to look at anything else in particular

juju/model/idle.py Outdated Show resolved Hide resolved
juju/model/__init__.py Outdated Show resolved Hide resolved
juju/model/idle.py Outdated Show resolved Hide resolved
juju/model/idle.py Outdated Show resolved Hide resolved
juju/model/idle.py Outdated Show resolved Hide resolved
juju/model/idle.py Outdated Show resolved Hide resolved
juju/model/__init__.py Show resolved Hide resolved
juju/model/__init__.py Outdated Show resolved Hide resolved
juju/model/__init__.py Outdated Show resolved Hide resolved
@dimaqq dimaqq force-pushed the feat-new-wait-for-idle branch from 9a36b7a to 3bcf321 Compare December 17, 2024 05:59
@dimaqq
Copy link
Contributor Author

dimaqq commented Dec 17, 2024

@james-garner-canonical I've refactored the loop to be an [async] filter.

It turns out that I've misunderstood the definition of "idle".

In short:

  • ready units: depends on workload (and possibly agent, machine, etc.)
  • idle units: depends on agent only

This means that I have to update the idle.check code.

Edit: done now.

@dimaqq dimaqq force-pushed the feat-new-wait-for-idle branch from 6710b0c to b6fbb0c Compare December 17, 2024 08:48
@dimaqq dimaqq force-pushed the feat-new-wait-for-idle branch from 88cd329 to 065d3de Compare December 18, 2024 03:15
@dimaqq
Copy link
Contributor Author

dimaqq commented Dec 18, 2024

@james-garner-canonical idle implemented, most comments addressed, tests pass again. please review :)

@dimaqq dimaqq marked this pull request as ready for review December 18, 2024 06:01
juju/model/_idle.py Outdated Show resolved Hide resolved
Copy link
Contributor

@james-garner-canonical james-garner-canonical left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exciting updates, glad tests are passing now.

I'm finding it hard to follow the logical changes between versions on this PR, I think primarily because the generator adds a fair bit of mental overhead for me -- it's kind of nicer now (though the new outer async loop construct took me some time to digest), but I still have to put in more mental work to follow the various yield ... breaks and yield ... continues.

For now I've made some requests for changes to make it easier for me to follow what's going on. Sorry if that seems like wasted effort -- it did take me a while, but I kind of had to do it anyway to follow the code.

Let me know if you want me to look closer at specific logical changes (looks like there are changes to the idle_since logic?). I still haven't looked in detail at your tests either, would you like further review there?

Also, what's the latest with running charmers' integration tests on the latest version? Are there still failures with the new wait_for_idle that pass with the old one?

Comment on lines +3304 to +3320
async def status_on_demand():
while True:
yield _idle.check(
await self.get_status(),
apps=apps,
raise_on_error=raise_on_error,
raise_on_blocked=raise_on_blocked,
status=status,
)

async for done in _idle.loop(
status_on_demand(),
apps=apps,
wait_for_exact_units=wait_for_exact_units,
wait_for_units=wait_for_units,
idle_period=idle_period,
):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: simplify this for noobs like me ... btw at first I thought this could be simplified without changing anything else, but after making it to _idle.loop I realised that my original simplification was totally incorrect, which suggests that the current version is a bit tricky to follow ... the below suggestion assumes a later suggestion to refactor _idle.loop to use a class, which I think makes things a lot easier to follow -- it certainly does for me

Suggested change
async def status_on_demand():
while True:
yield _idle.check(
await self.get_status(),
apps=apps,
raise_on_error=raise_on_error,
raise_on_blocked=raise_on_blocked,
status=status,
)
async for done in _idle.loop(
status_on_demand(),
apps=apps,
wait_for_exact_units=wait_for_exact_units,
wait_for_units=wait_for_units,
idle_period=idle_period,
):
timing_checker = _idle.TimingChecker()
while True:
full_status = await self.get_status()
check_status = _idle.check(
raw_status,
apps=apps,
raise_on_error=raise_on_error,
raise_on_blocked=raise_on_blocked,
status=status,
)
done = timing_checker.check(
check_status,
apps=apps,
wait_for_exact_units=wait_for_exact_units,
wait_for_units=wait_for_units,
idle_period=idle_period,
)

Comment on lines +130 to +137
if unit.machine:
machine = full_status.machines[unit.machine]
assert isinstance(machine, MachineStatus)
assert machine.instance_status
if machine.instance_status.status == "error" and raise_on_error:
raise JujuMachineError(
f"{unit_name!r} machine {unit.machine!r} has errored: {machine.instance_status.info!r}"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: make it clear up front that we're only interested in unit.machine -- there's no elif/else coming

Suggested change
if unit.machine:
machine = full_status.machines[unit.machine]
assert isinstance(machine, MachineStatus)
assert machine.instance_status
if machine.instance_status.status == "error" and raise_on_error:
raise JujuMachineError(
f"{unit_name!r} machine {unit.machine!r} has errored: {machine.instance_status.info!r}"
)
if not unit.machine:
continue
machine = full_status.machines[unit.machine]
assert isinstance(machine, MachineStatus)
assert machine.instance_status
if machine.instance_status.status == "error" and raise_on_error:
raise JujuMachineError(
f"{unit_name!r} machine {unit.machine!r} has errored: {machine.instance_status.info!r}"
)


for app_name in apps:
units.update(_app_units(full_status, app_name))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: I really like how the loops below all do one thing, much nicer to read and follow than the original wait_for_idle. I notice that the first few only do anything if raise_on_error, while the next couple only do something if raise_on_blocked. How about something like:

    if raise_on_error:
        _check_for_errors(full_status, apps, units)
    if raise_on_blocked:
        _check_for_blocked(full_status, apps, units)

if app.status.status == "blocked" and raise_on_blocked:
raise JujuAppError(f"{app_name!r} is blocked: {app.status.info!r}")

rv = CheckStatus(set(), set(), set())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor suggestion: in the CheckStatus definition, use = field(default_factory=set). Reasoning: I think rv = CheckStatus() will look more natural to readers (like me =))

Comment on lines +35 to +98
async def loop(
foo: AsyncIterable[CheckStatus | None],
*,
apps: AbstractSet[str],
wait_for_exact_units: int | None = None,
wait_for_units: int,
idle_period: float,
) -> AsyncIterable[bool]:
"""The outer, time-dependents logic of a wait_for_idle loop."""
idle_since: dict[str, float] = {}

async for status in foo:
logger.info("wait_for_idle iteration %s", status)
now = time.monotonic()

if not status:
yield False
continue

expected_idle_since = now - idle_period

# FIXME there's some confusion about what a "busy" unit is
# are we ready when over the last idle_period, every time sampled:
# a. >=N units were ready (possibly different each time), or
# b. >=N units were ready each time
for name in status.units:
if name in status.idle_units:
idle_since[name] = min(now, idle_since.get(name, float("inf")))
else:
idle_since[name] = float("inf")

if busy := {n for n, t in idle_since.items() if t > expected_idle_since}:
logger.info("Waiting for units to be idle enough: %s", busy)
yield False
continue

for app_name in apps:
ready_units = [
n for n in status.ready_units if n.startswith(f"{app_name}/")
]
if len(ready_units) < wait_for_units:
logger.info(
"Waiting for app %r units %s >= %s",
app_name,
len(status.ready_units),
wait_for_units,
)
yield False
break

if (
wait_for_exact_units is not None
and len(ready_units) != wait_for_exact_units
):
logger.info(
"Waiting for app %r units %s == %s",
app_name,
len(ready_units),
wait_for_exact_units,
)
yield False
break
else:
yield True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: I really think that a class to encapsulate the state, and a single synchronous method call call with returns, is so much easier to follow than the async generator, and it seems like it's pretty straightforwardly logically equivalent -- just a lot easier to follow for noobs like me

Suggested change
async def loop(
foo: AsyncIterable[CheckStatus | None],
*,
apps: AbstractSet[str],
wait_for_exact_units: int | None = None,
wait_for_units: int,
idle_period: float,
) -> AsyncIterable[bool]:
"""The outer, time-dependents logic of a wait_for_idle loop."""
idle_since: dict[str, float] = {}
async for status in foo:
logger.info("wait_for_idle iteration %s", status)
now = time.monotonic()
if not status:
yield False
continue
expected_idle_since = now - idle_period
# FIXME there's some confusion about what a "busy" unit is
# are we ready when over the last idle_period, every time sampled:
# a. >=N units were ready (possibly different each time), or
# b. >=N units were ready each time
for name in status.units:
if name in status.idle_units:
idle_since[name] = min(now, idle_since.get(name, float("inf")))
else:
idle_since[name] = float("inf")
if busy := {n for n, t in idle_since.items() if t > expected_idle_since}:
logger.info("Waiting for units to be idle enough: %s", busy)
yield False
continue
for app_name in apps:
ready_units = [
n for n in status.ready_units if n.startswith(f"{app_name}/")
]
if len(ready_units) < wait_for_units:
logger.info(
"Waiting for app %r units %s >= %s",
app_name,
len(status.ready_units),
wait_for_units,
)
yield False
break
if (
wait_for_exact_units is not None
and len(ready_units) != wait_for_exact_units
):
logger.info(
"Waiting for app %r units %s == %s",
app_name,
len(ready_units),
wait_for_exact_units,
)
yield False
break
else:
yield True
class TimingChecker:
"""The outer, time-dependent logic of a wait_for_idle loop."""
def __init__(self):
self.idle_since: dict[str, float] = {}
def check(
status: CheckStatus,
*
apps: AbstractSet[str],
wait_for_exact_units: int | None = None,
wait_for_units: int,
idle_period: float,
) -> bool:
logger.info("wait_for_idle iteration %s", status)
now = time.monotonic()
if not status:
return False
expected_idle_since = now - idle_period
# FIXME there's some confusion about what a "busy" unit is
# are we ready when over the last idle_period, every time sampled:
# a. >=N units were ready (possibly different each time), or
# b. >=N units were ready each time
for name in status.units:
if name in status.idle_units:
self.idle_since[name] = min(now, idle_since.get(name, float("inf")))
else:
self.idle_since[name] = float("inf")
if busy := {n for n, t in self.idle_since.items() if t > expected_idle_since}:
logger.info("Waiting for units to be idle enough: %s", busy)
return False
for app_name in apps:
ready_units = [
n for n in status.ready_units if n.startswith(f"{app_name}/")
]
if len(ready_units) < wait_for_units:
logger.info(
"Waiting for app %r units %s >= %s",
app_name,
len(status.ready_units),
wait_for_units,
)
return False
if (
wait_for_exact_units is not None
and len(ready_units) != wait_for_exact_units
):
logger.info(
"Waiting for app %r units %s == %s",
app_name,
len(ready_units),
wait_for_exact_units,
)
return False
return True

@dimaqq dimaqq mentioned this pull request Dec 19, 2024
@dimaqq
Copy link
Contributor Author

dimaqq commented Dec 19, 2024

I've pushed a separate branch with this refactor only: #1245 PTAL

@dimaqq
Copy link
Contributor Author

dimaqq commented Dec 19, 2024

closing in favour of #1245

@dimaqq dimaqq closed this Dec 19, 2024
jujubot added a commit that referenced this pull request Dec 20, 2024
#1245

Same as #1219 but using dumb classes instead of async generator.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants