memetic dot org

Building a usable alerting system

by on Jan.20, 2013, under Cisco, Linux, Observium, PHP, SNMP

One of the most requested additions to Observium is, predictably, Nagios-style up/down alerting. Most requesters assume this is a simple addition, and on the face of it, it seems to be, but it isn’t!

Not only is it one of the most requested additions, it’s also the feature that we most want to add to Observium. Indeed, we originally began developing Observium because of our experiences with Nagios and Cacti. We’ve mostly managed to replace Cacti, but we haven’t come close to allowing people to replace Nagios.

In April last year I was looking after a network which was using Munin for its graphing and alerting. I quickly replaced those with the standard Observium + Icinga set up that I use, but the hassle of configuring the two systems pushed me to finally sit down and try to come up with a plan to add alerting to Observium. This is an attempt at documenting what I’ve come up with so far.

Uniquely Observium

Firstly a few basic assumptions about how an Observium-style alerting system should work:

  1. Use the existing Observium database for host and entity information (an entity is a port, a drive, a sensor, etc)
  2. Use the existing Observium pollers to collect metrics, no separate poller
  3. Follow the spirit of Observium’s automation ethos and require minimum configuration

These brought up a number of challenges, unique amongst alerting systems:

  1. No other alerting system treats different “types” of entity in the way we do. Most have a single list of entities that they check, we have a dozen different database tables in different formats. How do we store rules and thresholds for all of these? Per-specific table or in a central table?
  2. It would arguably be easier to create a separate poller, but that would cause performance problems. One of the problems with this approach is that if a host is omitted (for example, only running poller 0-2 when set for 4 pollers) from a polling cycle, how do we detect and alert that?
  3. This is what makes the whole thing worth doing, and what makes it the most difficult. We need to know what to monitor and have sane defaults. We need to monitor everyone someone would need to monitor automatically, out of the box.
    • We need to have some method of easily defining general conditions that apply to an entire network of similar devices
    • We need to be able to override these general conditions both per-device and per-entity
I came up with a fairly simple plan based on passing the numbers we already collect via the various poller modules to a function which would check the values against a set of conditions. When a condition fails an entry is inserted into an queue table which is scanned by a separate cron’d process, so as not to slow down the poller process generating alerts and making external connections.

Block diagram of the alerting system.

Defining the Schema

Alerts conditions should be definable per-host, per-entity or globally with per-host and per-entity overrides. The table below shows a set of example global conditions. Per-host and per-entity conditions would also require device and entity columns.

An example global rule set.


The plan is that each poller module calls an alert processing function for each “entity” that it polls, passing an array of values and properties of that entity. The alert processing function builds a list of checks it needs to do for that entity. It them runs through this list of checks, using the array of values it received from the poller. This allows arbitrary metrics to be passed from poller modules without needing any code to be written to handle them in the alert processor. Conditions marked as ‘mandatory’ would triggered if the metric isn’t present in the passed array.

A basic block diagram of the check definition tables

Passing the data

Below is an example of some metrics that might be passed to the alert processing function by some poller modules:

  • Port
    • Bits/sec in/out
    • Bits/sec in/out as percentage of ifSpeed
    • Errors/sec in/out
    • Unicast/nonunicast/broadcast packets in/out
    • ADSL SNR/noise margin/sync speed
    • MTU
    • ifSpeed
    • Duplex
    • Promiscuous
  • Storage 
    • Bytes free
    • Percentage free
    • Inodes free

Below is an example of possible condition types

  • preg_match, !preg_match
  • >, <, =, !=
  • str_match, !str_match

Generating the alerts

 


4 Comments for this entry

  • Nic B

    I’ve been using Observium for the past 2-3 months. Really love it and it’s proved useful for graphing and I’ve even turned on email alerts(after ignoring most all ports). Can’t wait for alerts so I can pass data directly to spiceworks!

  • Rob

    I’ve been using Observium now for 2 days and I was simply blown away by the simplicity!

    An (more advanced) alerting system would be awesome!

  • Dude

    Observium is a great Monitoring tool. I can’t wait for release the alerting system. Great work and thank you really much.

    Greetz from germany ;)

  • monx

    Observium is my new favorite monitoring system! I would be very happy to get an alerting system! Is there any news to this topic? By the way, dependencies on the alerting would also be nice to avoid allot of not necessary alerts.

Leave a Reply

Looking for something?

Use the form below to search the site:

Still not finding what you're looking for? Drop a comment on a post or contact us so we can take care of it!

Blogroll

A few highly recommended websites...

Archives

All entries, chronologically...