Graph databases provide an excellent way to establish and research relationships in a quick and easy way.
The most common one would be Neo4j that without a doubt is currently the leading pick.
Monitoring such a DB would usually use a built-in tools like foe the Neo4j case would ‘Neo4j Ops Manager’
Neo4j Ops Manager (abbreviated to NOM) is a tool provided to assist an administrator of Neo4j DBMS deployments and address this gap. It contains current and future features which allow the administrator to monitor, administer and operate these deployments within their estate.
Another option is to export metrics into Prometheus and then visualise them using Grafana.
Neo4j exposes a lot of metrics about itself like:
- The server load — the strain on the machine hosting Neo4j.
- The Neo4j load — the strain on Neo4j.
- The cluster health — to ensure the cluster is working as expected.
- The workload of a Neo4j instance.
Additional metrics…
Out of the hundreds ( !!! ) of metrics available to us we found that there is still something missing for us.
One of those is the ability to monitor queries running inside the Neo4j system and to have alerts based on that.
We have long observed the fact that we have backend queries that could run for up-to 20 hours. Let’s not get into what those queries are OR why they take so long and except it as a given for now.
Set on a mission to write our own exporter had it’s difficulties. From language to framework. eventually we coverage to a Python exporter running Flask
Neo4j cluster contains CORE nodes and REPLICA nodes.
We added to this Namespace another deployment ( apart from the helm for Neo4j ) to deploy our code. Link to the code below.
Implementation
The first thing we need to do is find out the namespace in which the pod is running with our application. By default the namespace to be used is placed in a file at /var/run/secrets/kubernetes.io/serviceaccount/namespace in each container running in the Kubernetes cluster. Let’s use the following command to assign this value to a variable:
Next, we can already start generating a /metrics page that Prometheus will poll once in a while and collect our metrics. Flask allows you to do this very simply, something like that:
We can make requests to Neo4j and output the results in a format suitable for Prometheus using Client_python (https://github.com/prometheus/client_python).
But in practice, we are faced with the fact that in order to poll the entire Neo4j cluster consisting of 7 cores and 3 replicas, it takes a fairly large amount of time, more than Prometheus is ready to wait for the page /metrics to be generated.
As a result, we got gaps in the data series, which in our case had a false positive effect, since judging by the monitoring, we do not have slow queries at a certain point in time and everything is fine, but in fact we just did not wait for the page with slow query metrics.
This way we will collect metrics in the background, once in a while. In Python, you can implement this in various ways, we chose this construction:
This function will be executed in the background once every 240 seconds. Next, we will start collecting metrics.
Database Statuses
The first thing that we would like to have in monitoring and what is not out of the box is database statuses. The database may go into a state other than “online” and we would like to have a notification from monitoring about this.
To make requests to Neo4j, let’s use the py2neo library. With this request, we will get the current statuses of all databases on all nodes of the Neo4j cluster.
At this stage, we have to solve another problem, handle a situation when for some reason we could not connect to Neo4j or the request was executed with an error. When the cluster is under heavy load, this happens periodically. We can use the try-except construct:
In order for the print() command to output logs not only to the screen, but also to the Pod logs, use the environment variable: ENV PYTHONUNBUFFERED=1 in your Dockerfile.
Next, we record the database statuses in a format suitable for Prometheus. Let’s define the data type for metric:
Let’s cycle through all the results of the query and set the metric value to 1 if the database is “online” and 0 otherwise:
Long-running queries
The second important thing is monitoring slow requests, we will do so to monitor queries running for more than 10 seconds. First, let’s define a variable for storing such queries:
Secondly, slow queries are executed on different nodes of the cluster and in order to collect them all, we need to make requests to each core and replica separately. Let’s create a loop that will iterate through the values obtained from the environment variables:
And the request to Neo4j to get a list of slow requests looks like this:
In the end, similar to the statuses of databases, we iterate through the results and put them in a variable:
With each iteration of the loop, we write data to the resulting variable:
Generating a page with metrics
And eventually we generate a page with metrics for Prometheus at /metrics:
Code and Dashboard
The full program can be found in this repository including a dedicated Grafana dashboard:
Conclusion
Thus, by adding this small Pod with an exporter to each of our Neo4j clusters, we automatically expanded the current monitoring implementation. This improved the convenience for the development team as they were able to observe potentially problematic long-running queries in real time, and the SRE team was able to receive notifications about databases in a non-online state.