Internals of clustering

Elections and leadership

The Core instances used as Primary servers in a cluster use the Raft protocol to ensure consistency and safety. See Advanced Causal Clustering for more information on the Raft protocol. An implementation detail of Raft is that it uses a Leader role to impose an ordering on an underlying log with other instances acting as Followers which replicate the leader’s state. Specifically in Neo4j, this means that writes to the database are ordered by the Core instance currently playing the Leader role for the respective database. If a Neo4j DBMS cluster contains multiple databases, each one of those databases operates within a logically separate Raft group, and therefore each has an individual leader. This means that a Core instance may act both as Leader for some databases, and as Follower for other databases.

If a follower has not heard from the leader for a while, then it can initiate an election and attempt to become the new leader. The follower makes itself a Candidate and asks other Cores to vote for it. If it can get a majority of the votes, then it assumes the leader role. Cores will not vote for a candidate which is less up-to-date than itself. There can only be one leader at any time per database, and that leader is guaranteed to have the most up-to-date log.

Elections are expected to occur during the normal running of a cluster and they do not pose an issue in and of itself. If you are experiencing frequent re-elections and they are disturbing the operation of the cluster then you should try to figure out what is causing them. Some common causes are environmental issues (e.g. a flaky networking) and work overload conditions (e.g. more concurrent queries and transactions than the hardware can handle).

Leadership balancing

Write transactions will always be routed to the leader for the respective database. As a result, unevenly distributed leaderships may cause write queries to be disproportionately directed to a subset of instances. By default, Neo4j avoids this by automatically transferring database leaderships so that they are evenly distributed throughout the cluster. Additionally, Neo4j will automatically transfer database leaderships away from instances where those databases are configured to be read-only using dbms.databases.read_only or similar.

Multi-database and the reconciler

Databases operate as independent entities in a Neo4j DBMS, both in standalone and in a cluster. Since a cluster can consist of multiple independent server instances, the effects of administrative operations like creating a new database happen asynchronously and independently for each server. However, the immediate effect of an administrative operation is to safely commit the desired state in the system database.

The desired state committed in the system database gets replicated and is picked up by an internal component called the reconciler. It runs on every instance and takes the appropriate actions required locally on that instance for reaching the desired state; creating, starting, stopping, and dropping databases.

Every database runs in an independent Raft group and since there are two databases in a fresh cluster, system and neo4j, this means that it also has two Raft groups. Every Raft group also has an independent leader and thus a particular Core instance could be the leader for one database and a follower for another.

This does not apply to clusters where a Single instance is the Primary server. In such clusters, the Single instance is the leader of all databases and there is no Raft at all.

Server-side routing

Server-side routing is a complement to the client-side routing.

In a Causal Cluster deployment of Neo4j, Cypher queries may be directed to a cluster member that is unable to run the given query. With server-side routing enabled, such queries will be rerouted internally to a cluster member that is expected to be able to run it. This situation can occur for write-transaction queries when they address a database for which the receiving cluster member is not the leader.

The cluster role for core cluster members is per database. Thus, if a write-transaction query is sent to a cluster member that is not the leader for the specified database (specified either via the Bolt Protocol or with the Cypher USE clause), server-side routing will be performed if properly configured.

Server-side routing is enabled by the DBMS, by setting dbms.routing.enabled=true for each cluster member. The listen address (dbms.routing.listen_address) and advertised address (dbms.routing.advertised_address) also need to be configured for server-side routing communication.

Client connections need to state that server-side routing should be used and this is available for Neo4j Drivers and HTTP API.

Neo4j Drivers can only use server-side routing when the neo4j:// URI scheme is used. The Drivers will not perform any routing when the bolt:// URI scheme is used, instead connecting directly to the specified host.

On the cluster-side you must fulfil the following pre-requisites to make server-side routing available:

Set dbms.routing.enabled=true on each member of the cluster.
Configure dbms.routing.listen_address, and provide the advertised address using dbms.routing.advertised_address on each member.
Optionally, you can set dbms.routing.default_router=SERVER on each member of the cluster.

The final pre-requisite enforces server-side routing on the clients by sending out a routing table with exactly one entry to the client. Therefore, dbms.routing.default_router=SERVER configures a cluster member to make its routing table behave like a standalone instance. The implication is that if a Neo4j Driver connects to this cluster member, then the Neo4j Driver sends all requests to that cluster member. Please note that the default configuration for dbms.routing.default_router is dbms.routing.default_router=CLIENT. See dbms.routing.default_router for more information.

The HTTP-API of each member will benefit from these settings automatically.

The table below shows the criteria by which server-side routing is performed:

Table 1. Server-side routing criteria
CLIENT - Neo4j Driver (Bolt Protocol)				SERVER - Neo4j Cluster member
URI scheme	Client-side routing	Request server-side routing	Transaction type	Server - Instance > Role (per database)	Server-side routing enabled	Routes the query
`neo4j://`			write	Primary - Single
`neo4j://`			read	Primary - Single
`neo4j://`			write	Primary - Core > leader
`neo4j://`			read	Primary - Core > leader
`neo4j://`			write	Primary - Core > follower
`neo4j://`			read	Primary - Core > follower
`neo4j://`			write	Secondary - Read Replica
`neo4j://`			read	Secondary - Read Replica
`bolt://`			write	Primary - Single
`bolt://`			read	Primary - Single
`bolt://`			write	Primary - Core > leader
`bolt://`			read	Primary - Core > leader
`bolt://`			write	Primary - Core > follower
`bolt://`			read	Primary - Core > follower
`bolt://`			write	Secondary - Read Replica
`bolt://`			read	Secondary - Read Replica

Server-side routing connector configuration

Rerouted queries are communicated over the Bolt Protocol using a designated communication channel. The receiving end of the communication is configured using the following settings:

Server-side routing driver configuration

Server-side routing uses the Neo4j Java driver to connect to other cluster members. This driver is configured with settings of the format:

dbms.routing.driver.*

Server-side routing encryption

Encryption of server-side routing communication is configured by the cluster SSL policy. For more information, see Cluster Encryption.

Store copy

Store copies are initiated when an instance does not have an up-to-date copy of the database. For example, this is the case when a new instance is joining a cluster (without a seed). It can also happen as a consequence of falling behind the rest of the cluster, for reasons such as connectivity issues or having been shut down. Upon re-establishing connection with the cluster, an instance recognizes that it is too far behind and fetches a new copy from the rest of the cluster.

A store copy is a major operation, which may disrupt the availability of instances in the cluster. Store copies should not be a frequent occurrence in a well-functioning cluster, but rather be an exceptional operation that happens due to specific causes, e.g. network outages or planned maintenance outages. If store copies happen during regular operation, then the configuration of the cluster, or the workload directed at it, might have to be reviewed so that all instances can keep up, and that there is enough of a buffer of Raft logs and transaction logs to handle smaller transient issues.

The protocol used for store copies is robust and configurable. The network requests are directed at an upstream member according to configuration and they are retried despite transient failures. The maximum amount of time to retry every request can be configured with causal_clustering.store_copy_max_retry_time_per_request. If a request fails and the maximum retry time has elapsed then it stops retrying and the store copy fails.

Use causal_clustering.catch_up_client_inactivity_timeout to configure the inactivity timeout for any particular request.

The causal_clustering.catch_up_client_inactivity_timeout configuration is for all requests from the catchup client, including the pulling of transactions.

The default upstream strategy is not applicable to Single instances and it differs for Core and Read Replica instances. Core instances always send the initial request to the leader to get the most up-to-date information about the store. The strategy for the file and index requests for Core instances is to vary every other request to a random Read Replica instance and every other to a random Core instance.

Read Replica instances use the same strategy for store copies as it uses for pulling transactions. The default is to pull from a random Core instance.

If you are running a multi-datacenter cluster, then upstream strategies for both Core and Read Replica instances can be configured. Remember that for Read Replica instances, this also affects from where transactions are pulled. See more in Configure for multi-data center operations.

Using the Replica instance in case of failure

In case of failure (e.g. a partial failure of a cluster due to the loss of an instance, but not of the majority), you may transform a Read Replica instance into a Core instance as a way to restore the cluster’s core availability. However, keep in mind that this is not advised as it could cause data loss and complications in the Raft group.

To avoid that, the read_replica instance must not be initialized as a single instance, nor be introduced in a different or new cluster. This action would cause an override of the raft state, thus preventing the replica from successfully joining the targeted cluster.

After performing that change, follow these instructions to unbind the Replica instance and update the discovery configurations amongst cluster members:

Ensure that the converted read_replica currently belongs to the same cluster that it will be re-introduced back to, as a core. This can be done by performing CALL dbms.cluster.overview() and verifying the instance’s address and cluster mode.
Stop and unbind the read_replica instance.
Update the cluster mode configuration in neo4j.conf, from dbms.mode=READ_REPLICA to dbms.mode=CORE.
Stop Neo4j on the removed core instances that are not intended to serve as core members.
Unbind those instances from the cluster by performing neo4j-admin unbind while they are stopped. This action will prevent such instances from subsequently attempting to rejoin the running cluster.

At this point, the previous read_replica (now core) instance may be introduced into the running cluster. To persist this change in the cluster’s architecture, the following configuration updates are advised:

On the previous read_replica (now core) instance, set causal_clustering.discovery_advertised_address and causal_clustering.discovery_listen_address as appropriate.

Update the causal_clustering.initial_discovery_members configuration with the currently valid list of discovery addresses for each member of the cluster. This should replace the addresses of any removed core(s) with the discovery addresses of the previous read_replica (now core) instance.

In cases where causal_clustering.discovery_type is other than LIST, make sure to update the corresponding address resolution addresses records. For example, DNS A records for discovery types DNS and SRV, and any Kubernetes service address alternate to reflect the inclusion of the read_replica discovery address.

On-disk state

The on-disk state of cluster instances is different from that of standalone instances. The biggest difference is the existence of an additional cluster state. Most of the files there are relatively small, but the Raft logs can become quite large depending on the configuration and workload.

It is important to understand that once a database has been extracted from a cluster and used in a standalone deployment, it must not be put back into an operational cluster.

If you try to reinsert a modified database back into the cluster, then the logs and stores will mismatch. Operators should not try to merge standalone databases into the cluster in the optimistic hope that their data will become replicated. That does not happen and instead, it likely leads to unpredictable cluster behavior.