Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Entry Update and Event Creation Can Possibly Drift. Consider periodic full-db reconciliation of cache #4973

Open
stevend-uber opened this issue Mar 13, 2024 · 8 comments
Labels
priority/backlog Issue is approved and in the backlog unscoped The issue needs more design or understanding in order for the work to progress

Comments

@stevend-uber
Copy link
Contributor

stevend-uber commented Mar 13, 2024

In the event that there occurs drift between in-mem event-based cache and entries/agents tables, we should consider a configurable periodic method that does a full reconciliation of the in-mem cache with the db.

@stevend-uber stevend-uber changed the title Entry Update and Event Creation should be a single transaction Entry Update and Event Creation Can Possibly Drift. Consider periodic full-db reconciliation of cache Mar 13, 2024
@evan2645 evan2645 added the triage/in-progress Issue triage is in progress label Mar 13, 2024
@faisal-memon
Copy link
Contributor

I believe kubernetes does something similar to this with reconciling controllers

@evan2645 evan2645 added priority/backlog Issue is approved and in the backlog unscoped The issue needs more design or understanding in order for the work to progress and removed triage/in-progress Issue triage is in progress labels Mar 19, 2024
@evan2645
Copy link
Member

Thanks for opening this @stevend-uber, definitely sounds like there's something we need to fix here. Are you able to share more information about what you bumped into, if anything?

@stevend-uber
Copy link
Contributor Author

In the 1.8.7 release of the events-based cache, we saw that the pruned entries weren't being updated in the cache. And although that is a "one time temporary bug", cache drift can occur for any number of reasons. Defensively, a configurable reconciliation loop should prevent such from drifting too much.

@faisal-memon
Copy link
Contributor

Please assign this to me.

@edwbuck
Copy link
Contributor

edwbuck commented Mar 25, 2024

@stevend-uber in 1.8.7 release, there were bugs in the events-based cache, because the implementation never considered the mix of enabled/non-enabled systems that Uber ran. Those items were quickly fixed, prior to the 1.9.0 release.

Do you have evidence of a specific failure since those fixes were applied? It seems to me that doing this "just in case" an event is missed falls into the same category as doing this "just in case" you missed reading a vanilla registration entry, as they are both read from the same database.

If there is a know failure, we need to capture it and fix it. If there is no known failure, then it is not clear why this code would be desired.

To clarify further. If no failure scenario is observed, this could potentially induce one, for the systems that cannot easily reconcile their entire database into hydration within one time slice of cache updates. If we offer this feature, please consider having it disabled by default.

@rturner3
Copy link
Collaborator

Just to share some more context, Uber has seen a bug with the event cache feature for a very small percentage of entries. That issue motivated this suggestion. Reported this bug as a separate issue: #5021

@edwbuck
Copy link
Contributor

edwbuck commented Apr 5, 2024

mentioning #4985 as this issue may be blocker. If fixes to #5021 likely would make this effort unneeded, please indicate it here.

@edwbuck
Copy link
Contributor

edwbuck commented Apr 12, 2024

@rturner3 I believe that the fix under issue #5021 closes the means by which the cache and the database drifted at Uber. @faisal-memon has a branch that he's about to make into a merge request, it only has one small item he was working on when we talked last.

Please consider what you would like to do with this Issue if #5021 covers the cache synchronization errors at Uber.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority/backlog Issue is approved and in the backlog unscoped The issue needs more design or understanding in order for the work to progress
Projects
None yet
Development

No branches or pull requests

5 participants