There are a lot of monitoring tools out there, everything from log aggregators, probes, dashboards, graphs, alerting systems, and so on. To make things easier, we can usually divide monitoring into two areas: parsing logs, using something like the ELK stack, and probing systems for specific metrics, which is often done with a tool called Nagios.
Both of those tools are great, and I've used them among many others in the past, but one thing I've noticed is how complex they tend to get. Deploying Nagios usually doesn't involve just installing the Nagios dashboard. You need NRPE probes, a backend database, Nconf, and agents on individual systems. Similarly, log collection only goes so far. As soon as you have a couple of servers and applications to monitor, you could be looking at gigabytes of logs coming in every day, so that's why you need to not only collect the logs, but analyze them, parse them, filter them, create graphs, alerts, and so on.
My experience using these tools tell me that as complexity is added, effectiveness is often reduced. You end up with hundreds, if not thousands of different data points that you need to keep track of, and soon enough you start getting false positives, random parts of the architecture breaks because some library was updated somewhere, and no one fixes it for a few months while the alerts are just ignored as a "known issue".
The way we do monitoring
In my view, monitoring has to be hyper focused on two very precise things. First, you need to monitor important metrics. The word important is key, because I've seen an endless amount of dashboards that show things like swap utilization, long database queries, failed authentication attempts, and so on. While each of these metrics could be useful, when you're monitoring hundreds of them, times how many apps you're taking care of, soon enough you're drowning in alerts.
Instead, what I really care about is this: Is the web app up and responsive, and is it about to fail because of the most common types of issues (low on disk space, CPU resources or memory). Then, has the backup been done correctly last night, and have all system updates been applied? You wouldn't believe the amount of servers I've seen in various environments I've worked in where backups hadn't been processed in months, or security updates simply weren't being done, but you could be sure things like inodes were being monitored! This is what I would call a disconnect between technology and common sense.
So in order to get these metrics, we use a simple Python based app I created a few years back called Healthstone (pictured above) which runs a simple dashboard, one agent on each host, and lets me know by email/SMS if something goes wrong. No database needed, no additional library or parsing software, a simple solution that never breaks.
Then, on top of the metrics which alerts us if something needs to be taken care of ASAP, we also collect logs in CloudWatch, for later review. I'm not a big believer in automated log parsing. In my view, you spend way more time trying to tweak filters in order to get marginally useful data. Logs are there in order to diagnose a problem, after the metrics told you about the problem. While it's nice to get alerted that your PHP app wrote a warning in its log, chances are none of us have time to investigate this when we have to focus on making sure the app is actually up and running.
This is what the log streams look like for each server:
So far, this system is working great. We've had cloud servers stay up for years, fully patched up and with backups never failing once. When something does happen, either because of an update gone wrong or some hardware failure, we're alerted very quickly and the alert isn't lost in a sea of false positives. And because everything we deploy is infrastructure as code, it means all the new deployments are automatically added to our dashboards. Peace of mind, and low administrative overhead.