One of the most requested additions to Observium is, predictably, Nagios-style up/down alerting. Most requesters assume this is a simple addition, and on the face of it, it seems to be, but it isn’t!
Not only is it one of the most requested additions, it’s also the feature that we most want to add to Observium. Indeed, we originally began developing Observium because of our experiences with Nagios and Cacti. We’ve mostly managed to replace Cacti, but we haven’t come close to allowing people to replace Nagios.
In April last year I was looking after a network which was using Munin for its graphing and alerting. I quickly replaced those with the standard Observium + Icinga set up that I use, but the hassle of configuring the two systems pushed me to finally sit down and try to come up with a plan to add alerting to Observium. This is an attempt at documenting what I’ve come up with so far.
Firstly a few basic assumptions about how an Observium-style alerting system should work:
- Use the existing Observium database for host and entity information (an entity is a port, a drive, a sensor, etc)
- Use the existing Observium pollers to collect metrics, no separate poller
- Follow the spirit of Observium’s automation ethos and require minimum configuration
These brought up a number of challenges, unique amongst alerting systems:
- No other alerting system treats different “types” of entity in the way we do. Most have a single list of entities that they check, we have a dozen different database tables in different formats. How do we store rules and thresholds for all of these? Per-specific table or in a central table?
- It would arguably be easier to create a separate poller, but that would cause performance problems. One of the problems with this approach is that if a host is omitted (for example, only running poller 0-2 when set for 4 pollers) from a polling cycle, how do we detect and alert that?
- This is what makes the whole thing worth doing, and what makes it the most difficult. We need to know what to monitor and have sane defaults. We need to monitor everyone someone would need to monitor automatically, out of the box.
- We need to have some method of easily defining general conditions that apply to an entire network of similar devices
- We need to be able to override these general conditions both per-device and per-entity
Defining the Schema
The plan is that each poller module calls an alert processing function for each “entity” that it polls, passing an array of values and properties of that entity. The alert processing function builds a list of checks it needs to do for that entity. It them runs through this list of checks, using the array of values it received from the poller. This allows arbitrary metrics to be passed from poller modules without needing any code to be written to handle them in the alert processor. Conditions marked as ‘mandatory’ would triggered if the metric isn’t present in the passed array.
Passing the data
Below is an example of some metrics that might be passed to the alert processing function by some poller modules:
- Bits/sec in/out
- Bits/sec in/out as percentage of ifSpeed
- Errors/sec in/out
- Unicast/nonunicast/broadcast packets in/out
- ADSL SNR/noise margin/sync speed
- Bytes free
- Percentage free
- Inodes free
Below is an example of possible condition types
- preg_match, !preg_match
- >, <, =, !=
- str_match, !str_match
Generating the alerts