-
Notifications
You must be signed in to change notification settings - Fork 485
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Entry Update and Event Creation Can Possibly Drift. Consider periodic full-db reconciliation of cache #4973
Comments
I believe kubernetes does something similar to this with reconciling controllers |
Thanks for opening this @stevend-uber, definitely sounds like there's something we need to fix here. Are you able to share more information about what you bumped into, if anything? |
In the 1.8.7 release of the events-based cache, we saw that the pruned entries weren't being updated in the cache. And although that is a "one time temporary bug", cache drift can occur for any number of reasons. Defensively, a configurable reconciliation loop should prevent such from drifting too much. |
Please assign this to me. |
@stevend-uber in 1.8.7 release, there were bugs in the events-based cache, because the implementation never considered the mix of enabled/non-enabled systems that Uber ran. Those items were quickly fixed, prior to the 1.9.0 release. Do you have evidence of a specific failure since those fixes were applied? It seems to me that doing this "just in case" an event is missed falls into the same category as doing this "just in case" you missed reading a vanilla registration entry, as they are both read from the same database. If there is a know failure, we need to capture it and fix it. If there is no known failure, then it is not clear why this code would be desired. To clarify further. If no failure scenario is observed, this could potentially induce one, for the systems that cannot easily reconcile their entire database into hydration within one time slice of cache updates. If we offer this feature, please consider having it disabled by default. |
Just to share some more context, Uber has seen a bug with the event cache feature for a very small percentage of entries. That issue motivated this suggestion. Reported this bug as a separate issue: #5021 |
@rturner3 I believe that the fix under issue #5021 closes the means by which the cache and the database drifted at Uber. @faisal-memon has a branch that he's about to make into a merge request, it only has one small item he was working on when we talked last. Please consider what you would like to do with this Issue if #5021 covers the cache synchronization errors at Uber. |
In the event that there occurs drift between in-mem event-based cache and entries/agents tables, we should consider a configurable periodic method that does a full reconciliation of the in-mem cache with the db.
The text was updated successfully, but these errors were encountered: