Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved monitoring of replica state during large binlog load #151

Merged
merged 12 commits into from
Dec 24, 2024

Conversation

WithSoull
Copy link
Contributor

Improved monitoring of replica state during large binlog load

Description:

This pull request addresses an issue where the ZooKeeper service mistakenly believed that a replica was dead when it was actually downloading a large binlog file.

To resolve this problem, I added a new field to the replica — NodeState.IsBinlogLoading. This field will allow us to more accurately track the state of the replica and ensure that the service correctly identifies when a replica is in the process of downloading a binlog.

Additionally, I have written unit tests and integration tests that cover the new functionality I have added.

Small fixes

I have also made minor updates to the existing code in the unit tests to improve the readability of the tests.

Finally, I have updated the Dockerfile to reflect the changes made in this pull request. This update removes two warnings and ensures that the latest version of the code is being used in the container.

internal/app/data.go Outdated Show resolved Hide resolved
internal/app/data.go Outdated Show resolved Hide resolved
@WithSoull WithSoull requested a review from teem0n December 23, 2024 12:33
@noname0443
Copy link
Collaborator

I believe it is a good idea to aggregate tests into two scripts, because it reduces consuming resources on creating a container on each test, but I suppose we may have a problem in the future with flapped tests. If one is dead due to context deadline, we have to restart a whole test set for either 5.7 or 8.0.

I think we should separate CI/CD changes in a separate PR and maybe add some restart logic for flapped tests.

@WithSoull
Copy link
Contributor Author

Yes, I think you are right, we should split the PR.

I want to add my thoughts on this.
We need to set some kind of restart limit so that it doesn't go on indefinitely.

@noname0443
Copy link
Collaborator

BTW: In the current view, we launch tests in a sequence. So it can be a good idea to launch them in a parallel way.

@noname0443 noname0443 merged commit 3eeb942 into yandex:master Dec 24, 2024
52 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants