Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update the disaster recovery page to address feedback #2026

Merged
merged 2 commits into from
Dec 20, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 19 additions & 13 deletions modules/ROOT/pages/clustering/disaster-recovery.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -30,14 +30,14 @@ In this guide the following terms are used:

* An _offline_ server is a server that is not running but may be restartable.
* A _lost_ server, however, is a server that is currently not running and cannot be restarted.
* A _write-available_ database is able to serve writes, while a _write unavailable_ database is not.
* A _write-available_ database is able to serve writes, while a _write-unavailable_ database is not.
====

There are four steps to recovering a cluster from a disaster:

. Start the Neo4j process on all servers which are not _lost_.
See xref:start-the-neo4j-process[Start the Neo4j process] for more information.
. Make the `system` database able to accept write operations, so that the cluster can be modified.
. Make the `system` database able to serve write operations, so that the cluster can be modified.
See xref:make-the-system-database-write-available[Make the `system` database write-available] for more information.
. Detach any potential lost servers from the cluster and replace them by new ones.
See xref:make-servers-available[Make servers available] for more information.
Expand Down Expand Up @@ -85,14 +85,14 @@ The server may have to be considered indefinitely lost.

==== Objective
====
The `system` database is able to accept write operations.
The `system` database is able to serve write operations.
====

The `system` database contains the view of the cluster.
This includes which servers and databases are present, where they live and how they are configured.
During a disaster, the view of the cluster might need to change to reflect a new reality, such as removing lost servers.
Databases might also need to be recreated to regain write availability.
Because both of these steps are executed by modifying the `system` database, making the `system` database write-enabled is a vital first step during disaster recovery.
Because both of these steps are executed by modifying the `system` database, making the `system` database write-available is a vital first step during disaster recovery.

==== Verifying the state

Expand Down Expand Up @@ -143,7 +143,8 @@ Be aware that not replacing servers can cause cluster overload when databases ar
=====
+
. On each server, run `bin/neo4j-admin database load system --from-path=[path-to-dump] --overwrite-destination=true` to load the current `system` database dump.
. On each server, ensure that the discovery settings are correct, see xref:clustering/setup/discovery.adoc[Cluster server discovery] for more information.
. On each server, ensure that the discovery settings are correct.
See xref:clustering/setup/discovery.adoc[Cluster server discovery] for more information.
. Start the Neo4j process on all servers.
====

Expand All @@ -162,7 +163,8 @@ Therefore, informing the cluster of servers which are lost is not enough.
The databases hosted on lost servers also need to be moved onto available servers in the cluster, before the lost servers can be removed.

==== Verifying the state
The cluster's view of servers can be seen by listing the servers, see xref:clustering/servers.adoc#_listing_servers[Listing servers] for more information.
The cluster's view of servers can be seen by listing the servers.
See xref:clustering/servers.adoc#_listing_servers[Listing servers] for more information.
The state has been verified if *all* servers show `health` = `Available` and `status` = `Enabled`.

[source, cypher]
Expand All @@ -173,7 +175,7 @@ SHOW SERVERS;
==== Path to correct state
Use the following steps to remove lost servers and add new ones to the cluster.
To remove lost servers, any allocations they were hosting must be moved to available servers in the cluster.
This can be done in two different ways:
This is done in two different steps:

* Any allocations that cannot move by themselves require the database to be recreated so that they are forced to move.
* Any allocations that can move will be instructed to do so by deallocating the server.
Expand Down Expand Up @@ -208,7 +210,8 @@ A database can be set to `READ-ONLY` before it is started to avoid updates on th
`ALTER DATABASE database-name SET ACCESS READ ONLY`.
=====

. On each server, run `CALL dbms.cluster.statusCheck([])` to check the write availability for all databases running in primary mode on this server, see xref:clustering/monitoring/status-check.adoc#monitoring-replication[Monitoring replication] for more information.
. On each server, run `CALL dbms.cluster.statusCheck([])` to check the write availability for all databases running in primary mode on this server.
See xref:clustering/monitoring/status-check.adoc#monitoring-replication[Monitoring replication] for more information.
+
[NOTE]
=====
Expand All @@ -218,7 +221,8 @@ Instead, check that the primary is allocated on an available server and that it

. For each database that is not write-available, recreate it to move it from lost servers and regain write availability.
Go to xref:clustering/databases.adoc#recreate-databases[Recreate databases] for more information about recreate options.
Remember to make sure there are recent backups for the databases before recreating them, see xref:backup-restore/online-backup.adoc[Online backup] for more information.
Remember to make sure there are recent backups for the databases before recreating them.
See xref:backup-restore/online-backup.adoc[Online backup] for more information.
If any database has `currentStatus` = `quarantined` on an available server, recreate them from backup using xref:clustering/databases.adoc#uri-seed[Backup as seed].
+
[CAUTION]
Expand Down Expand Up @@ -278,16 +282,18 @@ For the stricter check, run `SHOW DATABASES` and verify that `requestedStatus` =

==== Path to correct state
Use the following steps to make all databases in the cluster write-available again.
They include recreating any databases that are not write-capable and identifying any recreations that will not complete.
They include recreating any databases that are not write-available and identifying any recreations that will not complete.
Recreations might fail for different reasons, but one example is that the checksums do not match for the same transaction on different servers.

.Guide
[%collapsible]
====
. Identify all write unavailable databases by running `CALL dbms.cluster.statusCheck([])` as described in the xref:clustering/disaster-recovery.adoc#example-verification[Example verification] part of this disaster recovery step.
. Identify all write-unavailable databases by running `CALL dbms.cluster.statusCheck([])` as described in the xref:clustering/disaster-recovery.adoc#example-verification[Example verification] part of this disaster recovery step.
Filter out all databases desired to be stopped, so that they are not recreated unnecessarily.
. Recreate every database that is not write-available and has not been recreated previously, see xref:clustering/databases.adoc#recreate-databases[Recreate databases] for more information.
Remember to make sure there are recent backups for the databases before recreating them, see xref:backup-restore/online-backup.adoc[Online backup] for more information.
. Recreate every database that is not write-available and has not been recreated previously.
See xref:clustering/databases.adoc#recreate-databases[Recreate databases] for more information.
Remember to make sure there are recent backups for the databases before recreating them.
See xref:backup-restore/online-backup.adoc[Online backup] for more information.
If any database has `currentStatus` = `quarantined` on an available server, recreate them from backup using xref:clustering/databases.adoc#uri-seed[Backup as seed].
+
[CAUTION]
Expand Down
Loading