Showing posts with label Nagios. Show all posts
Showing posts with label Nagios. Show all posts

Wednesday, January 6, 2016

Part I. Stackstorm: from Nagios integration to Openstack automation

Recently I've been playing around Stackstorm - a platform for integration and automation of day-to-day tasks, monitoring events, existing scripts and deployment tools.

I am going to explain how easy it is to wrap your daily tasks into Stackstorm actions and workflows and how to provide a simple way of execution for complex tasks.

I've been thinking for a while and initially I was going to bring everything up in one blog post. But later, it didn't seem like a good idea. That being said, I'd rather break it into multiple posts, no matter how many there would be. Anyway, it seems to me that it'd be more practical and easier to read and understand.

Nagios

Let's start with integrating Nagios alerts into Stackstorm. Assuming that you already have Stackstorm version 1.2.0+ installed, configured and running, as well as there's Nagios running somewhere else that is capable of processing alerts and handling events. If not, please, proceed to www.stackstorm.com and www.nagios.org (installation and initial configuration of these two tools is beyond the scope of this post). Both tools are open source and free to use and have extensive documentation on installation and basic configuration.

First of all, I was happy to find an existing Nagios integration pack at https://github.com/StackStorm/st2contrib/tree/master/packs/nagios, but my joy ended really quickly as I found it not working. Secondly, this blog post helped me to get started: https://stackstorm.com/2015/10/05/auto-remediation-out-of-disk-space/

Although it talks about sensu and victorops (monitoring and paging tools), you can easily apply the logic to Nagios. With the help from Stackstorm support team (they are really cool guys and they reside on Slack, see https://stackstorm.com/community/ for details) I was able to patch the st2 service handler python script to make it work (see diff here https://www.diffchecker.com/vzsiskbg). Don't worry, later I'll post a link to the github repository with all the files you need.

Also, don't forget to apply for a trial Enterprise Stackstorm edition! You will get a very cool Flow Visual Editor that will let you create really nice workflows with a drag of your mouse! I highly recommend to at least try it. It won't hurt, I promise.

Deploying the pack

Let's go ahead and deploy our example nagios pack (the repository itself is located at https://github.com/emptywee/e_nagios):

st2 run packs.install packs=e_nagios register=all repo_url=https://github.com/emptywee/e_nagios.git

Don't worry if you see that re-loading rules throw some exceptions due to non-existing triggers. It's all right at this point of time. The trigger will be created once you run the st2 service handler script from the nagios server (assuming that it can easily connect to your stackstorm server over ports 9100 and 9101 and the username-password pair is correct, and the latest stackstorm version supports accessing auth and API endpoints on 443 port as well, but I haven't tried this approach yet). But this shouldn't happen since we have all rules disabled by default in the pack. Let's go ahead and enable nagios_service_chat.yaml rule. Simply edit it with your favorite editor in /opt/stackstorm/packs/e_nagios/rules/ and switch enabled to true.

Adjusting rules

Let's take a brief look at the rule itself (nagios_service_chat.yaml):

---
name: notify_chat
pack: e_nagios
description: Post to chat when nagios service state changes
enabled: true
trigger:
  type: e_nagios.service_state_change
criteria:
  trigger.attempt:
    pattern: 2
    type: gt
action:
  ref: chatops.post_message
  parameters:
    message: NAGIOS {{trigger.service}} ID:{{trigger.event_id}} STATE:{{trigger.state}}[{{ trigger.state_id }}]/{{trigger.state_type}}
      {{trigger.msg}}
    channel: '563b5f7f21f7a36d7bd5baaf'

trigger: - is the trigger name that will make this rule fire up the action when certain criteria are met. In our case here we will post a message to our chatops (be it Slack, Lets-chat, HipChat or any other that is supported by Hubot). Plugging in chatops has become a simple task since v1.2.0 of Stackstorm has been released. So make sure you have chatops up and running to fully utilize all features, comfort, flexibility, and other bells and whistles of Stackstorm platform.

criteria: - with only AND logic so far you can specify when exactly you'd like the action to be executed. In this example we want the bot to report to chatops whenever any service or host changes its state into HARD state (usually after 3 consecutive checks with the same result).

action: - what should be executed when the criteria are met. In our case here we will just post a message to the specific channel. If you are using Slack, the channel should be more readable and meaningful name. Don't hesitate to alter it to your needs.

Don't forget to reload the rules:

st2ctl reload --register-rules

There's another way to temporary enable rules:

st2 rule enable e_nagios.notify_chat

But it will get disabled back after the next reload of the rules unless you modify respective YAML file with the rule definition (so-called meta-data).

Setting up Nagios

If you take a look at the st2 service handler script, you may notice that it is you who decides what to pass from Nagios to Stackstorm rules. Because it's up to you to put anything you want into the payload that the script will send to Stackstorm. All these trigger.service, trigger.msg and alike are just parts of a payload that is formed on the Nagios host. With tens and hundreds of macros available in Nagios you may choose those that fit your needs. Here's a link to the standard Nagios macros: https://assets.nagios.com/downloads/nagioscore/docs/nagioscore/3/en/macrolist.html

Here's what you should do on the Nagios host. First of all, you need to upload st2service_handler.py script and st2service_handler.conf to the Nagios host and place them somewhere you like, make sure that Nagios can execute the script from that location. Make the script executable. Secondly, you need to define a check command with macros you found in the link just above. In my case I uploaded the script into the /opt/nagios/libexec/ directory and have Nagios with NRPE setup, so I define it in the master nrpe.cfg file:

command[st2nagios]=/opt/nagios/libexec/st2service_handler.py /opt/nagios/libexec/st2service_handler.conf $SERVICEEVENTID$ "$SERVICEDESC$" $SERVICESTATE$ $SERVICESTATEID$ $SERVICESTATETYPE$ $SERVICEATTEMPT$ $HOSTNAME$

Then apply this command to global_service_event_handler in nagios.cfg:

global_service_event_handler=st2nagios

That is it! We are almost done with the Nagios part. A good thing would be running the command manually as the nagios user:

$ whoami
nagios
$ /opt/nagios/libexec/st2service_handler.py /opt/nagios/libexec/st2service_handler.conf 123456 "Disk /var/log" WARNING 1 HARD 3 remote_host_name
Registered trigger type with st2.
POST: url: https://st2.example.com:9101/webhooks/st2/, body: {'trigger': 'e_nagios.service_state_change', 'payload': {'attempt': '3', 'service': 'Disk /var/log', 'event_id': '123456', 'state': 'WARNING', 'state_type': 'HARD', 'host': 'remote_host_name', 'msg': '[WARNING] Service/Host warning alert!', 'state_id': '1'}}
Sent nagios event to st2. HTTP_CODE: 202

This will register a trigger. After that you can safely reload rules that rely on the e_nagios.service_state_change trigger. Also, running it manually will let you test your rules and actions without actually forcing Nagios to generate real alerts. That is a good thing, isn't it?

So, in short, we are basically filling in our own payload and passing it from Nagios to Stackstorm. Just make sure that st2service_handler.py script has all the fields defined and in the correct order if you are about to add or remove Nagios macros from the event command.

The relevant part about it in st2service_handler.py is:

def _get_payload(host, service, event_id, state, state_id, state_type, attempt):
    payload = {}
    payload['host'] = host
    payload['service'] = service
    payload['event_id'] = event_id
    payload['state'] = state
    payload['state_id'] = state_id
    payload['state_type'] = state_type
    payload['attempt'] = attempt
    payload['msg'] = STATE_MESSAGE.get(state, 'Undefined state.')
    return payload
def main(args):
    event_id = args[1]
    service = args[2]
    state = args[3]
    state_id = args[4]
    state_type = args[5]
    attempt = args[6]
    host = args[7]
    body = {}
    body['trigger'] = ST2_TRIGGERTYPE_REF
    body['payload'] = _get_payload(host, service, event_id, state, state_id, state_type, attempt)
    _post_event_to_st2(_get_st2_webhooks_url(), body)

The order is defined by the array index of args passed to the main function. This is very important.

Verifying in chatops

Best way to verify it's working is run the command manually from the Nagios host. You should see your bot reporting to the channel as it's set in the rule.

Simple, eh? Now with what we have achieved so far, we can move forward and enhance our alert handling service.

Enhancing alert handling

Although it's very exciting to receive alerts in the chat room, it doesn't make you much happier than you already are, and it certainly doesn't relieve you from manually going and checking what exactly triggered the alert and take remedial actions.

Your use-case and real world scenarios might be slightly different, but a disk space auto-remediation is a common task that everyone runs into during their day-to-day operations. You don't really want to be awaken by a call early in the morning just to log in remotely and clean up some log files that filled up the whole disk. So it makes a really good example to deal with.

That's where the nagios_service_disk.yaml rule comes handy. Let's take a look at it:

---
name: check_disk
pack: e_nagios
description: Check disk usage and trigger remediation
enabled: true
trigger:
  type: e_nagios.service_state_change
criteria:
  trigger.service:
    pattern: "^Disk"
    type: matchregex
  trigger.state_type:
    pattern: "HARD"
    type: matchregex
  trigger.state_id:
    pattern: "0"
    type: gt
action:
  ref: e_nagios.remediate_disk_workflow
  parameters:
    hostname: "{{ trigger.host }}"
    directory: "{{ trigger.service | regex_replace('^Disk\\s*', '') }}" 

Pretty similar to the one above that just posts messages to the chat room, right? Don't forget to enable it, as it comes disabled by default. Although we utilize the same trigger, criteria here are slightly different. We are matching service description to a certain regex pattern, and we explicitly matching hard state of the alert, and catching state ID greater than zero (since in Nagios 0 means RECOVERY, 1 - WARNING, 2 - CRITICAL). We do not want to automatically fire an action on recovery alerts, right? At least in this case. One thing is also important here, and I once spent a lot of time figuring out why my rule didn't work properly. The thing is that you really should wrap your patterns in double quotes when you define your criteria. Even if it's an integer (see trigger.state_id criterion as an example).

That being said, once we receive an alert with the matching criteria Stackstorm will execute the defined action and do some magic with the parameters we are passing to that action. Namely, the action is called e_nagios.remediate_disk_workflow and is defined under actions/ directory of the pack. We also pass hostname that triggered the alert and stripping Disk part out of the service description, leaving only the directory of the mounted partition itself (assuming that disk monitoring service has appropriate service description defined in the Nagios config, don't hesitate to adjust to your own environment here). Yes, Stackstorm supports Jinja2 filters in the rules definition when you pass parameters to actions!

Disk Space Remediation Action

It's time to design our auto-remediation action! Here's how it looks (and it does really look nice) in the Visual Flow tool that comes in the Enterprise Stackstorm edition:

The workflow itself is pretty simple and consists of the following steps:

  1. Report in the chatops that we received the task to check the disk space;
  2. Run the disk check action that confirms the alert from Nagios;
  3. If the disk usage is above the defined threshold, run an auto-remediation action, else report in the chat room that it was a false positive alert from Nagios;
  4. If the auto-remediation action completes with no errors try to check the disk space usage again, else report about the error in the chat room;
  5. If the disk space usage comes below the defined threshold assume that auto-remediation succeeded and report about it in the chatops, else report in the chat room that the auto-remediation failed.

Suffice to say that reporting to the channel can be substituted or extended to reporting via email or any other means to page you and ask for manual intervention. The good thing here is that the scripts that do all the job can be written in any language you like, just make sure they can be executed remotely on the host that is being checked.

Before we can design our workflows in the Visual Flow, we need to define two basic actions for our needs:

  1. check_dir_size
  2. disk_remediate

The meta-data for the check_dir_size action is defined in YAML and looks like this:

---
description: 'Check the total percentage of disk taken up by a specified directory'
enabled: true
entry_point: check_dir_size.py
name: check_dir_size
parameters:
  action:
    description: "Run as an action.  (Outputs structured data)"
    default: true
    immutable: true
    type: boolean
  directory:
    description: "The directory to check"
    required: true
    type: string
  threshold:
    description: "Maximum percentage of disk space that can be consumed by the directory."
    default: 80
    type: integer
  debug:
    description: "Turn on debug output"
    default: false
    type: boolean
  sudo:
    default: true
    immutable: true
runner_type: remote-shell-script

Three things to pay attention to: entry_point, parameters and runner_type. Since it's a remote-shell-script runner, there's an implied parameter hosts that this action will require (see https://docs.stackstorm.com/runners.html#remote-script-runner-remote-shell-script for details, for instance in case you need to provide password authentication). entry_point points to the script name that should reside in the same directory. parameters declares all parameters that will be passed to the script, their types and other options. As a homework you may want to transform the alert level (Warning or Critical) coming from Nagios into threshold level for the script. But it should be done in the workflow that is depicted earlier when we talked about the Visual Flow instrument.

And the most interesting action is the disk_remediate action. Let's take a look at the meta-data of the action:

---
description: 'Try to remediate disk space issues'
enabled: true
entry_point: disk_remediate.pl
name: disk_remediate
parameters:
  action:
    description: "Run as an action.  (Outputs structured data)"
    default: true
    immutable: true
    type: boolean
  directory:
    description: "The directory to check"
    required: true
    type: string
  debug:
    description: "Turn on debug output"
    default: false
    type: boolean
  sudo:
    default: true
    immutable: true
runner_type: remote-shell-script

Basically it looks very similar to the first one. And here's where your imagination comes forward. The dummy auto-remediation script may look something like this:

#!/usr/bin/perl
use strict;
use Getopt::Long;
#use JSON;

my $directory;
my $debug;
my %output;

GetOptions(
    "directory=s" => \$directory,
    "debug"      => \$debug
);

if ( !defined( $directory ) )
{
    $output{ 'result' } = 'fail';
    $output{ 'reason' } = "Directory is not provided!";
    finish( 1 );
}

if($directory eq '/var/log')
{
# do something with /var/log
}
elsif ($directory eq '/var')
{
# do something with /var
}
elsif ($directory eq '/home')
{
# do something with /home
}
elsif ($directory eq '/opt')
{
# do something with /opt
}

$output{'result'} = 'success';
finish(0);

sub finish
{
    my $exit_code = shift || 0;
    #my $json = encode_json \%output;
    #print "$json\n";
    exit( $exit_code );
}

Who writes in Perl nowadays you'd ask? I don't know. Some old farts like me, probably. But you may go ahead and use your favorite Bash, Python or Ruby. All that matters is that it should be executable remotely on the host, where the disk issue is appeared and reported by Nagios. You may want to compress logs, upload them, move them, just delete them, enable compression in logrotate configuration or even try to extend logical volumes if you have some spare space left when such need arises. It's completely up to you what to do. I have disabled JSON output in the dummy script since JSON module is not installed by default on the Linux distributions. In general it's a good idea to produce outcome in JSON format, since it then can be easily adopted and published by actions in a workflow.

And in the end the whole workflow after you put everything together will look like (which is also shown on the picture in the beginning of the chapter, but in Visual Flow representation):

---
version: '2.0'

e_nagios.remediate_disk_workflow:
  type: direct
  input:
    - hostname
    - directory
    - threshold
    - channel
  tasks:
    lets_work:
      # [466, 27]
      action: chatops.post_message
      input:
        channel: <% $.channel %>
        message: "epsibot is trying to take care of the disk space issue on <% $.hostname %> in <% $.directory %>"
      on-success:
        - check_dir_size
    check_dir_size:
      # [289, 149]
      action: e_nagios.check_dir_size
      input:
        hosts: <% $.hostname %>
        directory: <% $.directory %>
        threshold: <% $.threshold %>
      on-success:
        - hubot_error
      on-error:
        - remediate
    hubot_report:
      # [485, 568]
      action: chatops.post_message
      input:
        channel: <% $.channel %>
        message: "epsibot has cleared <% $.directory %> on <% $.hostname %> and it is now less than <% $.threshold %> percent!"
    hubot_error:
      # [114, 274]
      action: chatops.post_message
      input:
        channel: <% $.channel %>
        message: "Alert from Nagios was false positive for <% $.directory %> on <% $.hostname %>!"
    remediate:
      # [489, 233]
      action: e_nagios.disk_remediate
      input:
        hosts: <% $.hostname %>
        directory: <% $.directory %>
      on-success:
        - check_dir_size2
      on-error:
        - hubot_rem_fail
    check_dir_size2:
      # [485, 410]
      action: e_nagios.check_dir_size
      input:
        hosts: <% $.hostname %>
        directory: <% $.directory %>
        threshold: <% $.threshold %>
      on-success:
        - hubot_report
      on-error:
        - hubot_rem_fail
    hubot_rem_fail:
      # [82, 464]
      action: chatops.post_message
      input:
        channel: <% $.channel %>
        message: "Auto-remediation failed for <% $.directory %> on <% $.hostname %>. Please check manually."

As I mentioned earlier we can actually pass how critical the alert was (was it just a warning or the situation is critical) and act accordingly by altering the threshold or telling our script to be more aggressive.

At last, let's look at the workflow's metadata, as it contains parameters that tie it to the rule we started from:

---
  name: "remediate_disk_workflow"
  runner_type: mistral-v2
  description: "Remediation workflow for diskspace alerts"
  enabled: true
  entry_point: "workflows/remediate_disk_workflow.yaml"
  parameters:
    hostname:
      type: "string"
      description: "Host to remediate disk space on"
    directory:
      type: "string"
      description: "Directory to prune if over the threshold"
    threshold:
      type: "integer"
      description: "threshold for check diskspace action. percentage"
      default: 75
    channel:
      type: "string"
      default: "563b5f7f21f7a36d7bd5baaf"
      description: "Channel to post messages to"
    context:
      default: {}
      immutable: true
      type: object
    task:
      default: null
      immutable: true
      type: string

For example, we can define an array in the metadata with threshold levels for critical and warning alerts and use it to pass different numbers to disk check scripts later in the workflow. Think about it on your own and try to implement.

Hope that helps to get your started with auto-remediation and guard your sleep at night. There are a few other rules in the pack worth looking at, e.g. triggering actions on "proc" and "load" Nagios service alerts. That being said, you may want to restart processes when Nagios reports them down.

We will talk about Stackstorm and Openstack integration in the next series of my posts.