Why is Spring’s Health Down, Down, Up, Up, Up and Down again?

Why

Our new JavaScript client application regularly calls the /health endpoint of our Grails backend to determine on- of offline state. Things started to become “funny” with it.

This endpoint we get for free, since Grails is based on Spring Boot, which comes with a sub-project called Spring Boot Actuator.

This gives us a a bunch of endpoints which allows us to monitor and interact with our application, including /health which returns health information.

So, our JS client checks whether or not it can reach this /health endpoint, executed every few seconds, to determine if the user is on- or offline. Nothing fancy, and we might switch later on to just using the Google homepage or something, but for now this works.

Failing health check

On localhost everything always seems fine, but as soon as I got our Jenkins pipeline finally to deploy the app to to our test servers after each build, and we started verifying the app there, things became funny.

Usually we had a streak of perfectly good calls.

GET https://tst.example.com/health 200 ()
GET https://tst.example.com/health 200 ()
GET https://tst.example.com/health 200 ()
etc

Other times every few seconds we saw errors accumulating in the Chrome Inspector. Health checks would fail with with a HTTP status code of 503 Service unavailable for a long time.

GET https://tst.example.com/health 503 ()
GET https://tst.example.com/health 503 ()
GET https://tst.example.com/health 503 ()
etc

Then after a while we would get good calls again!

GET https://tst.example.com/health 200 ()
GET https://tst.example.com/health 200 ()
etc

The response of these failed requests just said

{"status":"DOWN"}

This is — by design — not very descriptive.

I certainly did not write any healh indicators myself so why would it be “down”?

Experienced Spring Booters know it will pick up any health indicator on the classpath and comes default with a few. Which ones are actually in use can be a mystery, because by default this endpoint is classified by Spring Boot as “sensitive” — and thus doesn’t expose too much information to the outside world.

I had to make the health check a bit more “chatty” by setting the following setting:

endpoints.health.sensitive: false

Now, calling the endpoint manually revealed the contenders!

{
  "status":"DOWN",
  "diskSpace":{
    "status":"DOWN",
    "total":8579448832,
    "free":20480,
    "threshold":10485760
  },
  "db":{
    "status":"UP",
    "database":"H2",
    "hello":1
  }
}

The general status of “down” is an aggregate result of (in this case: 2) auto-configured health indicators listed explicitly now.

What inmediately came to mind, when I saw this:

  • Why didn’t I remove H2 yet ๐Ÿ™‚
  • Hey, disk space is running out on the test server already?!

The H2 database comes as a default dependency in any Grails application, but our app doesn’t use it — not in production and not for testing — so we will definately remove it from the dependencies. That’s a worry less.

With regard to disk space, it’s the good ol’ DiskSpaceHealthIndicator (indeed part of the auto-configured indicators) telling me things are unhealthy.

It has a default threshold of 10485760 bytes or 10 MB — the minimum disk space that should be available.

And…there’s only 20 kb free space? Of 8 gigs in total.

That’s a pretty low number ๐Ÿ™‚

In the first 0.7 seconds I didn’t believe the healt indicator, can you imagine?

So I SSH’ed into the test server to check the available disk space with the df utility:

[Ted@server-01t ~]$ df -h
Filesystem             Size  Used Avail Use% Mounted on
/dev/mapper/rhel-root  8.0G  8.0G   20K 100% /
...

Right, at least the health check speaks the truth there: there’s actually only a tiny bit of space left.

I relayed this to my IT collegue which provisioned this machine, to investigate. Seemed that there were already some Java heap dumps from earlier experiments taking up the space — which I was told will be removed ASAP.

Better check the other node too.

[Ted@server-02t ~]$ df -h
Filesystem             Size  Used Avail Use% Mounted on
/dev/mapper/rhel-root  8.0G  5.3G  2.8G  66% /

Enough room there.

Wait a minute? “Other node?” Yes, we have 2 test servers, 01t and 02t.

At that point, I realized: the behaviour I was seeing was because of the loadbalancer forwarding a request to tst.example.com to either server-01t or the other server-02t. One of them was low on disk space, which explains that the health indicator of the Grails app on that server says “down” – resulting in a HTTP 503.

When observing these health calls (which requests are continuously made by our JS client) through the Chrome Inspector one small question was left: why do we have a streak of (sometimes 50x) “ups” (200) and then a bunch of “downs” (503) then in a seemingly random order?

The load balancer should keep us “fixed” on that node where a JS client for the first time makes its requests, as we configure our servers like that.

If the loadbalancer would send every request (to tst.example.com) round robin to server 1 or 2, I would expect a more (random) response of e.g. “up”, “down”, “down”, “up”, “down”, “up”, “up”, “down”, “up”.

Well, it seemed that during the window while I was observing this behaviour, the rest of the team was still developing features and…pushing to Git, which Jenkins picks up, which gets deployed to both servers. Because of a redeploy of the app to ech server serially, the loadbalancer “sees” the unavailibility of the application on the one server (with enough disk space: “up”, “up”, “up”, “up”, “up”) for the duration of the deployment and redirects traffic to the other server (with almost no disk space: “down”, “down”, “down”)…

…which gets updated with a new WAR pretty soon after, and requests end up on the other server again (with enough disk space: “up”, “up”, “up”, “up”, “up”).

๐Ÿ™‚

Costs again 3 hours out of my life. Including some time noting down this stuff here (but I think that’s worth it) ๐Ÿ™‚

Lesson learned

Know your process

Knowing that there’s a loadbalancer and multiple nodes (and how they work) helps. And that your CI server continuously deploys new versions to your environment which is under investigation does not help. But altogether knowing this did help to clarify the observed behaviour.

Learn the “sensible” defaults of your framework.

In case of Grails 3 and Spring Boot, know the stuff which gets “auto-configured” from the classpath, inspect it and make sure it’s going to be what you actually want.

We will get rid of H2 and review the health indicators we actually need, may be disabling the auto-configuration altogether. We cleaned up the Java heap dumps which caused the full disk. We’ve re-confirmed that the Unix team will monitor the OS, including disk space, so that we at least don’t need the DiskSpaceHealthIndicator anymore ๐Ÿ™‚

3 thoughts on “Why is Spring’s Health Down, Down, Up, Up, Up and Down again?

  1. Familiar situation. Also, these /health checks can get rather expensive or time-consuming with defaults on the classpath.
    In a microservice project, one mircoservice would check the /health on some other microservices, which would check their stuff like database connections, diskspace, whether the internet was up and maybe even the phase of the moon…

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s