How do we monitor your website for downtime?
Written by: Peter Steenbergen
Co-Founder / Development
A short while ago there was an outage for a website that we monitor for one of our clients. His response was the following: “UptimeRobot & Pingdom did not recognized the downtime of our website but UptimeMate did notice the outage, how is this even possible?!”.
There are numerous possibilities for this, one of the reasons is that we monitor at different locations and with multiple kind of checks. Today we want to give you some insight about our infrastructure.
Techniques used in our Stack
Our stack is using the following software:
- Laravel Horizon - Queue workers
- Beats(, let the beats roll :D)
- Google API’s
- Various unix commands like: curl, dig
- Lots of SSH
Elasticsearch at the heart
If you look at the Elastic Stack - Elasticsearch is known to be the heart of the stack. It is the place where it is storing and analyzing all the data in the cluster. This is also for the case for UptimeMate, we installed and configured multiple elasticsearch nodes which are hardware and software highly available and connected to a 10 Gbit private LAN for the transport connection. Within a millisecond each node can communicate with every node. To make the data scale at our proportions we are heavily making use of Index Lifecycle Management to make an index rollover at several points in time or size. When the data gets older, it gets automatically gets removed from the cluster. So, no need for custom data cleanup tasks.
At this time of writing we are making use of light weighted datashippers on every monitoring node. Some people called this approach the Hexagonal Architecture, but I just like it to call a single purpose node. Every monitoring node is connected to one of our Elasticsearch nodes so the data is available within a second after harvesting the raw data. This is where the real power lies within our infrastructure. A single node may fail, but it won’t impact the monitoring services in any kind of way. If you get a downtime notice, the reason is simple: there are just more ‘down’ statuses then ‘up’.
We have multiple worker nodes in our cluster connected to a Redis cluster running on Laravel Horizon. Each worker node is checking all the data in our highly available Elasticsearch cluster for any down locations for every website in our cluster. When a worker fails it bootstraps a new one within minutes, but there are always multiple workers which are gathering the data
For every change in our database we trigger a command on the remote servers in our network to just do process and handle their single purpose job. When it is done those nodes will connect directly to our datalake.
UptimeMate Website / Application
Our website is not direct connected to our monitoring nodes. It ‘just’ makes use of our own API. When our website is (going) down we also see our beautiful notifications that are being send to our customers. So, we are using our own product in a certain kind of way.
But wait, “what was the problem and solution for the customer” you say?
We monitor all websites from 3 locations by default. In our logs and metrics, we saw about a quarter of the pings that had the status down. Since we only trigger a down status when there are more than half of the chosen are down for a particular website (and so do some of our competitors), we found that this was very odd and in this case we reached out to our customer. Honestly, first I thaught it was a glitch on our end but was soon that it was a correct notification.
The response from the customer was: “We hate our hosting at the moment.. during presentations, our website hangs and with a reload the website is fast and responsive. What is the problem with it? The hosting provider said it was on our end so, but all our logs are empty from the webserver and our Laravel application. And we are left in the dark. Do you have any idea?”.
Since it was a Laravel application we gave it a shot. The problem in the end was a very full session folder which was ~80GB’s.. With clearing that folder there was no more downtime. We migrated the session to Redis, connected to all.
Do you have any questions about our stack in more details or how we can help your (custom) situation, then send an E-mail to email@example.com? We won’t bite and are happy to help.