Monitor cluster endpoints for status information
A Causal Cluster exposes some HTTP endpoints which can be used to monitor the health of the cluster. In this section we will describe these endpoints and explain their semantics.
Adjusting security settings for Causal Clustering endpoints
If authentication and authorization is enabled in Neo4j, the Causal Clustering status endpoints will also require authentication credentials.
The setting dbms.security.auth_enabled
controls whether the native auth provider is enabled.
For some load balancers and proxy servers, providing authentication credentials with the request is not an option.
For those situations, consider disabling authentication of the Causal Clustering status endpoints by setting dbms.security.causal_clustering_status_auth_enabled=false
in neo4j.conf.
Unified endpoints
A unified set of endpoints exist, both on Core Servers and on Read Replicas, with the following behavior:
-
/db/<databasename>/cluster/writable
— Used to directwrite
traffic to specific instances. -
/db/<databasename>/cluster/read-only
— Used to directread
traffic to specific instances. -
/db/<databasename>/cluster/available
— Available for the general case of directing arbitrary request types to instances that are available for processing read transactions. -
/db/<databasename>/cluster/status
— Gives a detailed description of this instance’s view of its status within the cluster, for the given database. -
/dbms/cluster/status
— Gives a detailed description of this instance’s view of its status within the cluster, for all databases. Useful for monitoring and coordinating rolling upgrades. See Status endpoints for further details.
Every /db/<databasename>/*
endpoint targets a specific database.
The databaseName
path parameter represents the name of the database.
By default, a fresh Neo4j installation with two databases system
and neo4j
will have the following cluster endpoints:
http://localhost:7474/dbms/cluster/status
http://localhost:7474/db/system/cluster/writable
http://localhost:7474/db/system/cluster/read-only
http://localhost:7474/db/system/cluster/available
http://localhost:7474/db/system/cluster/status
http://localhost:7474/db/neo4j/cluster/writable
http://localhost:7474/db/neo4j/cluster/read-only
http://localhost:7474/db/neo4j/cluster/available
http://localhost:7474/db/neo4j/cluster/status
Endpoint | Instance state | Returned code | Body text |
---|---|---|---|
|
Leader |
|
|
Follower |
|
|
|
Read Replica |
|
|
|
|
Leader |
|
|
Follower |
|
|
|
Read Replica |
|
|
|
|
Leader |
|
|
Follower |
|
|
|
Read Replica |
|
|
|
|
Leader |
|
JSON - See Status endpoint for details. |
Follower |
|
JSON - See Status endpoint for details. |
|
Read Replica |
|
JSON - See Status endpoint for details. |
|
|
Leader |
|
JSON - See Status endpoint for details. |
Follower |
|
JSON - See Status endpoint for details. |
|
Read Replica |
|
JSON - See Status endpoint for details. |
From the command line, a common way to ask those endpoints is to use curl
.
With no arguments, curl
will do an HTTP GET
on the URI provided and will output the body text, if any.
If you also want to get the response code, just add the -v
flag for verbose output.
Here are some examples:
-
Requesting
writable
endpoint on a Core Server that is currently elected leader with verbose output:
#> curl -v localhost:7474/db/neo4j/cluster/writable
* About to connect() to localhost port 7474 (#0)
* Trying ::1...
* connected
* Connected to localhost (::1) port 7474 (#0)
> GET /db/neo4j/cluster/writable HTTP/1.1
> User-Agent: curl/7.24.0 (x86_64-apple-darwin12.0) libcurl/7.24.0 OpenSSL/0.9.8r zlib/1.2.5
> Host: localhost:7474
> Accept: */*
>
< HTTP/1.1 200 OK
< Content-Type: text/plain
< Access-Control-Allow-Origin: *
< Transfer-Encoding: chunked
< Server: Jetty(9.4.17)
<
* Connection #0 to host localhost left intact
true* Closing connection #0
Status endpoints
The status endpoint, available at /db/<databasename>/cluster/status
, is to be used to assist with rolling upgrades.
For more information, see Upgrade and Migration Guide → Upgrade a Causal Cluster.
Typically, you will want to have some guarantee that a Core is safe to shutdown for each database before removing it from a cluster. Counter intuitively, a core being safe to shutdown means that a majority of the other cores are healthy, caught up, and have recently heard from that database’s leader. The status endpoints provide the following information in order to help resolve such issues.
Several of the fields in status endpoint responses refer to details of Raft, the algorithm used in Neo4j Causal Clusters to provide highly available transactions.
When using multiple databases, each database implements Raft independently.
Therefore, details such as |
{
"lastAppliedRaftIndex":0,
"votingMembers":["30edc1c4-519c-4030-8348-7cb7af44f591","80a7fb7b-c966-4ee7-88a9-35db8b4d68fe","f9301218-1fd4-4938-b9bb-a03453e1f779"],
"memberId":"80a7fb7b-c966-4ee7-88a9-35db8b4d68fe",
"leader":"30edc1c4-519c-4030-8348-7cb7af44f591",
"millisSinceLastLeaderMessage":84545,
"participatingInRaftGroup":true,
"core":true,
"isHealthy":true,
"raftCommandsPerSecond":124
}
Field | Type | Optional | Example | Description |
---|---|---|---|---|
|
boolean |
no |
|
Used to distinguish between Core Servers and Read Replicas. |
|
number |
no |
|
Every transaction in a cluster is associated with a raft index. Gives an indication of what the latest applied raft log index is. |
|
boolean |
no |
|
A participating member is able to vote. A Core is considered participating when it is part of the voter membership and has kept track of the leader. |
|
string[] |
no |
|
A member is considered a voting member when the leader has been receiving communication with it. List of member’s |
|
boolean |
no |
|
Reflects that the local database of this member has not encountered a critical error preventing it from writing locally. |
|
string |
no |
|
Every member in a cluster has it’s own unique member id to identify it.
Use |
|
string |
yes |
|
Follows the same format as |
|
number |
yes |
|
The number of milliseconds since the last heartbeat-like leader message. Not relevant to Read Replicas, and hence is not included. |
|
number |
yes |
|
An estimate of the average Raft state machine throughput over a sampling windown configurable via |
After an instance has been switched on, you can access the status endpoint in order to make sure all the guarantees listed in the table below are met.
To get the most accurate view of a cluster it is strongly recommended to access the status endpoint on all core members and compare the result. The following table explains how results can be compared.
Name of check | Method of calculation | Description |
---|---|---|
|
Every Core’s status endpoint indicates |
We want to make sure the data across the entire cluster is healthy. Whenever any Cores are false that indicates a larger problem. |
|
For any 2 Cores (A and B), status endpoint A’s |
When the voting begins, all the Cores are equal to each other, and you know all members agree on membership. |
|
For all Cores (S), excluding Core Z (to be switched off), every member in S contains S in their voting set.
Membership is determined by using the |
Sometimes network conditions will not be perfect and it may make sense to switch off a different Core to the one we originally wanted to switch off. If you run this check for all Cores, the ones that match this condition can be switched off (providing other conditions are also met). |
|
For any 2 Cores (A and B), |
If the leader is different then there may be a partition (alternatively, this could also occur due to bad timing). If the leader is unknown, that means the leader messages have actually timed out. |
|
For Core A with |
If there is a large difference in the applied indexes between Cores, then it could be dangerous to switch off a Core. |
For example, you observe the metric (the difference between the maximum and minimum |
Combined status endpoints
When using the status endpoints to support a rolling upgrade, you need to assess whether a Core is safe to shutdown for all databases.
To avoid having to issue a separate request to each /db/<databasename>/cluster/status
endpoint, you can use the /dbms/cluster/status
instead.
This endpoint returns a json array, the elements of which contain the same fields as the single database version, along with fields for for databaseName
and databaseUuid
.
[
{
"databaseName": "neo4j",
"databaseUuid": "f4dacc01-f88a-4512-b3bf-68f7539c941e",
"databaseStatus": {
"lastAppliedRaftIndex": -1,
"votingMembers": [
"0cff51ad-7cee-44cc-9102-538fc4544b95",
"90ff5df1-f5f8-4b4c-8289-a0e3deb2235c",
"99ca7cd0-6072-4387-bd41-7566a98c6afc"
],
"memberId": "90ff5df1-f5f8-4b4c-8289-a0e3deb2235c",
"leader": "90ff5df1-f5f8-4b4c-8289-a0e3deb2235c",
"millisSinceLastLeaderMessage": 0,
"raftCommandsPerSecond": 0.0,
"core": true,
"participatingInRaftGroup": true,
"healthy": true
}
},
{
"databaseName": "system",
"databaseUuid": "00000000-0000-0000-0000-000000000001",
"databaseStatus": {
"lastAppliedRaftIndex": 7,
"votingMembers": [
"0cff51ad-7cee-44cc-9102-538fc4544b95",
"90ff5df1-f5f8-4b4c-8289-a0e3deb2235c",
"99ca7cd0-6072-4387-bd41-7566a98c6afc"
],
"memberId": "90ff5df1-f5f8-4b4c-8289-a0e3deb2235c",
"leader": "90ff5df1-f5f8-4b4c-8289-a0e3deb2235c",
"millisSinceLastLeaderMessage": 0,
"raftCommandsPerSecond": 0.0,
"core": true,
"participatingInRaftGroup": true,
"healthy": true
}
}
]