Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARTEMIS-4325 - Ability to failback after failover #4522

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

AntonRoskvist
Copy link
Contributor

Opening for feedback as this might not be the right solution. It does however demonstrate the functionality I'm after.

If the feature makes sense on a high level, please let me know if the implementation needs any changes or can be made better in some way.

Use case detailed in: https://issues.apache.org/jira/browse/ARTEMIS-4325

@brusdev
Copy link
Member

brusdev commented Jun 23, 2023

This feature makes sense to me if your are not using HA (live/backup pairs), have you already considered waiting for a topology update to know when a node is up?
https://github.com/apache/activemq-artemis/blob/main/artemis-core-client/src/main/java/org/apache/activemq/artemis/core/client/impl/ClientSessionFactoryImpl.java#L1567

@jbertram
Copy link
Contributor

I had the same thought about using the topology update to know when the "original" node came back. I think that would be simpler than manually checking and it would save a thread as well.

@jbertram jbertram marked this pull request as draft June 26, 2023 17:50
@AntonRoskvist
Copy link
Contributor Author

@brusdev
@jbertram

Thanks for your feedback, much appreciated!

In the case of failback between HA live/backup-pairs, that should be covered already. Since failback would only follow a failover which would only happen if the retry-logic failed first i.e both live and backup servers are unavailable.

I think failing over (and back) between different live/backup pairs in a scenario like this make some sense, as long as it doesn't happen internally within the groups. If not, would it make sense to add some additional check to inhibit failback between live-backup-pairs?

I might have missed something with the topology listener though, it would be nice to have this happen without having to "ping" the old node... but would this work if the nodeID of the broker changes (say the broker cluster is part of a blue/green deploy or similar, such that FQDN/IP remains the same but the underlying broker is replaced by one started on a new journal)?

I can give that a try regardless, but it might take some time due to vacation.

Br,
Anton

@jbertram
Copy link
Contributor

Isn't the use-case in view here live-to-live failback after failover (implemented recently via ARTEMIS-4251)? I'd leave live/backup pairs out of this completely since there are already semantics in place for that use-case.

would this work if the nodeID of the broker changes (say the broker cluster is part of a blue/green deploy or similar, such that FQDN/IP remains the same but the underlying broker is replaced by one started on a new journal)?

The org.apache.activemq.artemis.api.core.client.TopologyMember has the node ID as well as the TransportConfiguration which you should be able to use to identify host details like name and/or IP address. There are also the isMember methods which may be useful in this use-case.

@AntonRoskvist
Copy link
Contributor Author

@jbertram @brusdev

Hello, back again from vacation.

I made some changes as per our discussion and rebased against main. I'm still running the failback check in a separate thread since I didn't want to potentially stall/lock the TopologyHandler but now it's not starting until the node should be available (or available soon). Let me know what you think and see if there are any other changes that should be made here.

@AntonRoskvist AntonRoskvist marked this pull request as ready for review August 30, 2023 18:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants