tl;dr

Performance data gathered by the nagios check_mk agent can be transported to and stored in an ElasticSearch database to save the original time-resolution from being reduced by the round-robin database mechanism.

Problem

The term monitoring usually refers to a concept better described as availability monitoring, i.e. monitoring whether some service or resource provided for by your computing system is available for your users. Examples include checking whether some port is open or if you have sufficient free space on your block devices, vulgo, hard disks.

For operations accustomed to a rather constant use or, perhaps better phrased, a constant rate of service consumption, it is sufficient to monitor the state of the service you supply which will close the case.

Another kind of operations is accustomed to varying rates of service use or service consumption. A typical example consists of some kind of web-service (a website, a portal, a blog, you name it). The use of this service typically varies over time, which necessiatets scaling the machines supplying different parts of that web-service up and down according to customer demand.

To be able to do so, it is first necessary to have at least some idea about the load your machines or services are currently subjected to, which then leads to the necessity to monitor load.

Solution

Availability Monitoring

Various monitoring systems focussing on availability monitoring exist. Nagios and more recently Icinga or Shinken are well known members of that type. Beside these, numerous others exist. Operations teams usually have some surveillance of that kind already in place, I myself and most of my colleagues tend be rather picky on the availability of the service in our charge and prefer not be told about some outages by our bosses.

Load Monitoring

Other monitoring systems such as collectd or Cacti are geared toward measuring, reporting and graphing load. (The division between systems is not sharp, functionality usually overlaps and capabilities converge.)

Convergence Zone

Unless strictly necessary, I advise against operating (and maintaining) separate software systems for more or less the same purpose. Rising complexity is usually the death of maintainability.

Most instances of availability monitoring systems already monitor system load to be able to alert the admin on call when load parameters (CPU, memory, etc.) are reached. Some systems have evolved and sport capabilities to store and graph time-dependent performance data. As a rather easy setup, I have decided to use check_mk multisite from the OMD Open Monitoring Distribution.

Round-Robin Databases

The graphing capabilities of OMD are sufficient, but are lacking high resolution for past monitoring data. This is due to the design of round-robin databases.

Round-robin databases are fixed in size to counter the problem of time-series databases boundlessly growing over time, which eventually will require enormous amounts of storage and will at some point be very costly to search. Older time-series data is thus reduced in resolution to save space.

Forking Data in an ElasticSearch Pipeline

I now have a case where I need to have high-resolution monitoring data with a sampling rate of about 1/min over reasonably long storage times in the area of several months.

Investigating the modus operandi of the OMD Check_MK multisite yields that the storage and graphing plugin pnp4nagios is used to pipe performance data gathered by the ckeck_mk agent to a round robin database. Incidentally, the data is spooled in a flat file log format to /omd/sites/<the_site>/var/pnp4nagios/spool/perfdata.<unix_timestamp>, parsed, written to the round-robin database and then deleted.

Exploiting this fact, a Logstash instance on the monitoring machine can surveil that directory/file-pattern to discover new spooling files in that directory.

input
{
  file
  {
    discover_interval => 5
    path => [ "/omd/sites/<...>/var/pnp4nagios/spool/perfdata.*" ]
    exclude => [ "*-PID-*" ] 
    sincedb_path => "/opt/logstash/etc/logstash/sincedbs"
    sincedb_write_interval => 5 
    start_position => "beginning"
    tags => [ "performance_monitoring" ]
    type => "performance_monitoring" 
  }
}

Operating thus, performance data can be grokked, here extracting perfomance data for CPU and memory usage as well as the current size of the TCP state table on Linux firewall machines:

filter
{

  if [message] =~ "cpu.loads"
  {
    grok 
    {
      match => { "message" => "DATATYPE::%{WORD:datatype}%{SPACE}TIMET::%{INT:unixt}%{SPACE}HOSTNAME::%{HOSTNAME:hostname}%{SPACE}SERVICEDESC::%{GREEDYDATA:check_type}%{SPACE}SERVICEPERFDATA::load1=%{NUMBER:load1:float}.*load5=%{NUMBER:load5:float}.*load15=%{NUMBER:load15:float}.*SERVICECHECKCOMMAND::%{NOTSPACE:sourcecommand}%{GREEDYDATA}" }
      add_tag => [ "cpu_load" ]
    }
  }

  elseif [message] =~ "kernel.util"
  {
    grok
    {
      match => { "message" => "DATATYPE::%{WORD:datatype}%{SPACE}TIMET::%{INT:unixt}%{SPACE}HOSTNAME::%{HOSTNAME:hostname}%{SPACE}SERVICEDESC::%{GREEDYDATA:check_type}SERVICEPERFDATA::user=%{NUMBER:user}.*system=%{NUMBER:system}.*wait=%{NUMBER:wait}.*SERVICECHECKCOMMAND::%{NOTSPACE:sourcecommand}%{GREEDYDATA}" }
    add_tag => [ "cpu_utilization" ]
    }
  }

  elseif [message] =~ "Memory"
  {
    grok
    {
      match => { "message" => "DATATYPE::%{WORD:datatype}%{SPACE}TIMET::%{INT:unixt}%{SPACE}HOSTNAME::%{HOSTNAME:hostname}%{SPACE}SERVICEDESC::%{GREEDYDATA:check_type}SERVICEPERFDATA::ramused=%{NUMBER:ramused:float}.*swapused=%{NUMBER:swapused:float}.*memused=%{NUMBER:memused:float}.*mapped=%{NUMBER:mapped:float}.*committed_as=%{NUMBER:committed:float}.*pagetables=%{NUMBER:pagetables:float}.*shared=%{NUMBER:shared:float}.*SERVICECHECKCOMMAND::%{NOTSPACE:sourcecommand}%{GREEDYDATA}" }
    add_tag => [ "memory_used" ]
    }
  }

  elseif [message] =~ "conntrack_count"
  {
    grok
    {
      match => { "message" => "DATATYPE::%{WORD:datatype}%{SPACE}TIMET::%{INT:unixt}%{SPACE}HOSTNAME::%{HOSTNAME:hostname}%{SPACE}SERVICEDESC::%{GREEDYDATA:check_type}SERVICEPERFDATA::net.netfilter.nf_conntrack_count=%{NUMBER:conntrack_counter:float}.*SERVICECHECKCOMMAND::%{NOTSPACE:sourcecommand}%{GREEDYDATA}" }
    add_tag => [ "conntrack_count" ]
    }
  }

  else
  {
    grok
    {
      match => { "message" => "DATATYPE::%{WORD:datatype}%{SPACE}TIMET::%{INT:unixt}%{SPACE}HOSTNAME::%{HOSTNAME:hostname}%{GREEDYDATA}" }
      add_tag => [ "catchall" ]

    }

  }

  date
  {
    add_tag => [ "timestamp from unixt" ]
    match => [ "unixt", "UNIX" ]
  }

}

and written to ElasticSearch

output
{
  elasticsearch
  {
    cluster => "<loggingcluster>"
    idle_flush_time => 10
    node_name => "<fqdn>"
    index => "monitoring-%{+YYYY.MM.dd}"
    protocol => node
  }
}

Using Kibana it is rather easy to grapgh the time-series stored in ElasticSearch. This solves the problem of load data being reduced in resolution over time, makes load data available for futher analysis and allows load data to be correlated against e.g. logs to analyse the imapct of service usage on system load.

No adaption of the monitoring system already in place is necessary and re-using proven logging methods ensures maintainability while allowing to efficiently scale.