Thursday, January 21, 2016

Stackstorm and Chatops Actions with confirmation

Before moving to Openstack integration I’d like to post a short article about highly demanded feature, which is going to be implemented and supported natively out of the box by Stackstorm one day, – Chatops Action Confirmation.

In short, some actions, requested from chatops, may indeed be dangerous and typo errors or incorrectly entered values may harm your system or lead to unexpected, unpredictable and undesirable results. That being said, it would be really nice to ask the user who issued the command to confirm his or her intentions to execute it.

So for now we have to do it on our own. And I’ll tell you what – it is not really difficult. We will examine two chatops aliases and I will elucidate on the things happening under the hood when these aliases are triggered.

Let’s begin and design our confirmation action and wrap it into the appropriate action-alias. If you want to quickly deploy the pack with all the actions and aliases right away, you can do so by running the following command:

st2 packs.install packs=st2chat_confirm register=all repo_url=https://github.com/emptywee/st2chat_confirm.git

confirm_exec.meta.yaml (metadata file)

When I was experimenting with it, I tried different approaches and initially it was an action-chain. Perhaps, there’s a better way to directly execute st2.kv.set action from the alias, but I haven’t found it yet. Either it’s impossible to do, or it’s poorly documented. All we need to do is pass username of the person who executes the action (triggers the alias). So, we will use a simple action-chain with only one action designed to construct a proper key for the Stackstorm data store.

---
# Action definition metadata
name: "confirm_exec"
description: "Confirm action execution"
runner_type: "action-chain"
enabled: true
entry_point: "workflows/confirm_exec.yaml"
parameters:
  exec_id:
    type: string
    required: true
    description: "Action execution to confirm"
  skip_notify:
    default:
      - save_key

We will pass one parameter to the action-chain, which in its turn will pick our chatname and stick it all together as a key. We need to do that because we do not want somebody else to confirm actions that were fired by you.

confirm_exec.yaml (action-chain)

The action chain itself is pretty simple:

---
chain:
    -
        name: "save_key"
        ref: "st2.kv.set"
        params:
            key: "{{action_context.parent.api_user}}_{{exec_id}}"
            value: "confirmed"
            ttl: 60

That is it for now. The action chain will set a key in the data store for 60 seconds. Now let’s wrap it up with an alias.

confirm_exec.yaml (alias)

The alias definition is also very simple.

---
name: test.confirm_exec
enabled: true
action_ref: st2chat_confirm.confirm_exec
description: Confirm potentially dangerous execution
formats:
  - display: "confirm <execution id>"
    representation:
      - "confirm {{exec_id}}"
ack:
  format: "Confirming action!"
  append_url: false
result:
  enabled: false

Feel free to adjust to your own needs here, don’t forget it’s just an example. This alias will trigger the action-chain once you give a command similar to ! confirm 56a01f468e326f6c51a3d4a9. Of course you can go ahead and replace execution id with some random number or magic word. It doesn’t really matter.

Now, let’s design our potentially dangerous action! I will use mistral workflow as an example, but there should be no problem to use the same approach for action-chains. Or should be, since a simple action-chain doesn’t really have mechanisms to implement waiting on user actions. But this is up to you to explore.

wf_with_confirm.meta.yaml (metadata)

Here’s our metadata for the potentially dangerous action!

---
  description: "test wf with confirm from chatops"
  runner_type: "mistral-v2"
  tags: []
  enabled: true
  pack: "st2chat_confirm"
  entry_point: "workflows/wf_with_confirm.yaml"
  uid: "action:st2chat_confirm:wf_with_confirm"
  parameters: 
    hostlist: 
      required: true
      type: "string"
      description: "a list of hosts"
    param1: 
      default: ""
      type: "string"
      description: "Some parameter"
  ref: "st2chat_confirm.wf_with_confirm"
  name: "wf_with_confirm"

In this example we are doing something (literally doing something!) to a list of hosts. Therefore, we will need to confirm it! We will pass a list of host names as hostlist and some arbitrary parameter param1.

The workflow itself is represented on the diagram below.

Flow Diagram of the test workflow

Let’s go step by step over the workflow.
1. First step here is to publish a few variables which we’ll refer to later, this step is optional and is placed here only for convenience. We publish chat_user, source_channel and exec_id variables here. You will see why later;
2. Second step is there to throw a message into the channel asking the user to confirm the action execution;
3. Next, we wait for about 30 seconds for the action to get confirmed, and if it’s confirmed we take the execution one way, if it’s not – the other way;

Yes, it’s that simple. This workflow can be used a starting point for every dangerous action you design. I think that we can even pass a name of the desired workflow to get executed after confirmation. That way we won’t have to copy and paste the same code in each such action. Code re-use is a really good thing to always keep it in mind.

wf_with_confirm.yaml (workflow)

The mistral workflow itself is quite simple as well:

---
version: '2.0'

e_playground.wf_with_confirm:
  type: direct
  input:
    - hostlist
    - param1
  tasks:
    publish_data:
      # [297, 28]
      action: core.noop
      publish:
        chat_user: <% env().get('__actions').get('st2.action').st2_context.parent.api_user %>
        source_channel: <% env().get('__actions').get('st2.action').st2_context.parent.source_channel %>
        exec_id: <% env().get('__actions').get('st2.action').st2_context.parent.execution_id %> 
      on-success:
        - post_confirm_message
    post_confirm_message:
      # [286, 163]
      action: chatops.post_message
      input:
        channel: '<% $.source_channel %>'
        message: '@<% $.chat_user %>, the action you have requested is dangerous. Please, confirm by issuing "! confirm <% $.exec_id %>" command. You have 30 seconds to confirm it.'

      on-success:
        - wait_for_confirmation
    wait_for_confirmation:
      # [286, 304]
      action: st2.kv.get
      input:
        key: '<% $.chat_user %>_<% $.exec_id %>'

      retry:
        count: 10
        delay: 3

      on-error:
        - post_not_confirmed
      on-success:
        - post_confirmed
    post_not_confirmed:
      # [456, 434]
      action: chatops.post_message
      input:
        channel: '<% $.source_channel %>'
        message: '@<% $.chat_user %>, I have not received confirmation from you within 30 seconds. The execution has been aborted.'

    post_confirmed:
      # [97, 445]
      action: chatops.post_message
      input:
        channel: '<% $.source_channel %>'
        message: '@<% $.chat_user %>, The action is confirmed. Proceeding...'

Take a look at the first task there. Notice the long path to the variables we need. Perhaps, there’s a better way to get to them and store them, but I couldn’t figure it out yet. If you did, please, share in the comments section below.

Key aspect here (why we actually use mistral workflow) is the retry section of the wait_for_confirmation task. Mistral allows you to retry the task for a set amount of attempts. Thus, setting 10 attempts with a 3-second delay gives us about 30 seconds to confirm the action.

Last touch would be wrapping it up in an action-alias.

test.yaml (alias)

---
name: st2chat_confirm.wf_with_confirm
enabled: true
action_ref: st2chat_confirm.wf_with_confirm
description: Test workflow with confirm. Starting point.
formats:
  - display: "do_something with <hostlist> <param1>"
    representation:
      - "do_something with {{hostlist}} {{param1}}"
result:
  format: |
    Execution ID {{ execution.id }} complete.

In the end

Reloading everything and trying to fire up the potentially dangerous action we have just created!
To reload actions and aliases metadata simply issue the following command:

st2ctl reload --register-all

Executing a potentially dangerous action

Ta-dam!

GitHub repository is located here: https://github.com/emptywee/st2chat_confirm

Feel free to ask questions if you have any. As always, you are welcome to join the friendly and super-fast responding Stackstorm community at https://stackstorm.com/community/

Also, I’d recommend trying the Stackstorm Enterprise Edition. It gives you that beautiful visual workflow editor and support from the Stackstorm core team.

Wednesday, January 6, 2016

Part I. Stackstorm: from Nagios integration to Openstack automation

Recently I've been playing around Stackstorm - a platform for integration and automation of day-to-day tasks, monitoring events, existing scripts and deployment tools.

I am going to explain how easy it is to wrap your daily tasks into Stackstorm actions and workflows and how to provide a simple way of execution for complex tasks.

I've been thinking for a while and initially I was going to bring everything up in one blog post. But later, it didn't seem like a good idea. That being said, I'd rather break it into multiple posts, no matter how many there would be. Anyway, it seems to me that it'd be more practical and easier to read and understand.

Nagios

Let's start with integrating Nagios alerts into Stackstorm. Assuming that you already have Stackstorm version 1.2.0+ installed, configured and running, as well as there's Nagios running somewhere else that is capable of processing alerts and handling events. If not, please, proceed to www.stackstorm.com and www.nagios.org (installation and initial configuration of these two tools is beyond the scope of this post). Both tools are open source and free to use and have extensive documentation on installation and basic configuration.

First of all, I was happy to find an existing Nagios integration pack at https://github.com/StackStorm/st2contrib/tree/master/packs/nagios, but my joy ended really quickly as I found it not working. Secondly, this blog post helped me to get started: https://stackstorm.com/2015/10/05/auto-remediation-out-of-disk-space/

Although it talks about sensu and victorops (monitoring and paging tools), you can easily apply the logic to Nagios. With the help from Stackstorm support team (they are really cool guys and they reside on Slack, see https://stackstorm.com/community/ for details) I was able to patch the st2 service handler python script to make it work (see diff here https://www.diffchecker.com/vzsiskbg). Don't worry, later I'll post a link to the github repository with all the files you need.

Also, don't forget to apply for a trial Enterprise Stackstorm edition! You will get a very cool Flow Visual Editor that will let you create really nice workflows with a drag of your mouse! I highly recommend to at least try it. It won't hurt, I promise.

Deploying the pack

Let's go ahead and deploy our example nagios pack (the repository itself is located at https://github.com/emptywee/e_nagios):

st2 run packs.install packs=e_nagios register=all repo_url=https://github.com/emptywee/e_nagios.git

Don't worry if you see that re-loading rules throw some exceptions due to non-existing triggers. It's all right at this point of time. The trigger will be created once you run the st2 service handler script from the nagios server (assuming that it can easily connect to your stackstorm server over ports 9100 and 9101 and the username-password pair is correct, and the latest stackstorm version supports accessing auth and API endpoints on 443 port as well, but I haven't tried this approach yet). But this shouldn't happen since we have all rules disabled by default in the pack. Let's go ahead and enable nagios_service_chat.yaml rule. Simply edit it with your favorite editor in /opt/stackstorm/packs/e_nagios/rules/ and switch enabled to true.

Adjusting rules

Let's take a brief look at the rule itself (nagios_service_chat.yaml):

---
name: notify_chat
pack: e_nagios
description: Post to chat when nagios service state changes
enabled: true
trigger:
  type: e_nagios.service_state_change
criteria:
  trigger.attempt:
    pattern: 2
    type: gt
action:
  ref: chatops.post_message
  parameters:
    message: NAGIOS {{trigger.service}} ID:{{trigger.event_id}} STATE:{{trigger.state}}[{{ trigger.state_id }}]/{{trigger.state_type}}
      {{trigger.msg}}
    channel: '563b5f7f21f7a36d7bd5baaf'

trigger: - is the trigger name that will make this rule fire up the action when certain criteria are met. In our case here we will post a message to our chatops (be it Slack, Lets-chat, HipChat or any other that is supported by Hubot). Plugging in chatops has become a simple task since v1.2.0 of Stackstorm has been released. So make sure you have chatops up and running to fully utilize all features, comfort, flexibility, and other bells and whistles of Stackstorm platform.

criteria: - with only AND logic so far you can specify when exactly you'd like the action to be executed. In this example we want the bot to report to chatops whenever any service or host changes its state into HARD state (usually after 3 consecutive checks with the same result).

action: - what should be executed when the criteria are met. In our case here we will just post a message to the specific channel. If you are using Slack, the channel should be more readable and meaningful name. Don't hesitate to alter it to your needs.

Don't forget to reload the rules:

st2ctl reload --register-rules

There's another way to temporary enable rules:

st2 rule enable e_nagios.notify_chat

But it will get disabled back after the next reload of the rules unless you modify respective YAML file with the rule definition (so-called meta-data).

Setting up Nagios

If you take a look at the st2 service handler script, you may notice that it is you who decides what to pass from Nagios to Stackstorm rules. Because it's up to you to put anything you want into the payload that the script will send to Stackstorm. All these trigger.service, trigger.msg and alike are just parts of a payload that is formed on the Nagios host. With tens and hundreds of macros available in Nagios you may choose those that fit your needs. Here's a link to the standard Nagios macros: https://assets.nagios.com/downloads/nagioscore/docs/nagioscore/3/en/macrolist.html

Here's what you should do on the Nagios host. First of all, you need to upload st2service_handler.py script and st2service_handler.conf to the Nagios host and place them somewhere you like, make sure that Nagios can execute the script from that location. Make the script executable. Secondly, you need to define a check command with macros you found in the link just above. In my case I uploaded the script into the /opt/nagios/libexec/ directory and have Nagios with NRPE setup, so I define it in the master nrpe.cfg file:

command[st2nagios]=/opt/nagios/libexec/st2service_handler.py /opt/nagios/libexec/st2service_handler.conf $SERVICEEVENTID$ "$SERVICEDESC$" $SERVICESTATE$ $SERVICESTATEID$ $SERVICESTATETYPE$ $SERVICEATTEMPT$ $HOSTNAME$

Then apply this command to global_service_event_handler in nagios.cfg:

global_service_event_handler=st2nagios

That is it! We are almost done with the Nagios part. A good thing would be running the command manually as the nagios user:

$ whoami
nagios
$ /opt/nagios/libexec/st2service_handler.py /opt/nagios/libexec/st2service_handler.conf 123456 "Disk /var/log" WARNING 1 HARD 3 remote_host_name
Registered trigger type with st2.
POST: url: https://st2.example.com:9101/webhooks/st2/, body: {'trigger': 'e_nagios.service_state_change', 'payload': {'attempt': '3', 'service': 'Disk /var/log', 'event_id': '123456', 'state': 'WARNING', 'state_type': 'HARD', 'host': 'remote_host_name', 'msg': '[WARNING] Service/Host warning alert!', 'state_id': '1'}}
Sent nagios event to st2. HTTP_CODE: 202

This will register a trigger. After that you can safely reload rules that rely on the e_nagios.service_state_change trigger. Also, running it manually will let you test your rules and actions without actually forcing Nagios to generate real alerts. That is a good thing, isn't it?

So, in short, we are basically filling in our own payload and passing it from Nagios to Stackstorm. Just make sure that st2service_handler.py script has all the fields defined and in the correct order if you are about to add or remove Nagios macros from the event command.

The relevant part about it in st2service_handler.py is:

def _get_payload(host, service, event_id, state, state_id, state_type, attempt):
    payload = {}
    payload['host'] = host
    payload['service'] = service
    payload['event_id'] = event_id
    payload['state'] = state
    payload['state_id'] = state_id
    payload['state_type'] = state_type
    payload['attempt'] = attempt
    payload['msg'] = STATE_MESSAGE.get(state, 'Undefined state.')
    return payload
def main(args):
    event_id = args[1]
    service = args[2]
    state = args[3]
    state_id = args[4]
    state_type = args[5]
    attempt = args[6]
    host = args[7]
    body = {}
    body['trigger'] = ST2_TRIGGERTYPE_REF
    body['payload'] = _get_payload(host, service, event_id, state, state_id, state_type, attempt)
    _post_event_to_st2(_get_st2_webhooks_url(), body)

The order is defined by the array index of args passed to the main function. This is very important.

Verifying in chatops

Best way to verify it's working is run the command manually from the Nagios host. You should see your bot reporting to the channel as it's set in the rule.

Simple, eh? Now with what we have achieved so far, we can move forward and enhance our alert handling service.

Enhancing alert handling

Although it's very exciting to receive alerts in the chat room, it doesn't make you much happier than you already are, and it certainly doesn't relieve you from manually going and checking what exactly triggered the alert and take remedial actions.

Your use-case and real world scenarios might be slightly different, but a disk space auto-remediation is a common task that everyone runs into during their day-to-day operations. You don't really want to be awaken by a call early in the morning just to log in remotely and clean up some log files that filled up the whole disk. So it makes a really good example to deal with.

That's where the nagios_service_disk.yaml rule comes handy. Let's take a look at it:

---
name: check_disk
pack: e_nagios
description: Check disk usage and trigger remediation
enabled: true
trigger:
  type: e_nagios.service_state_change
criteria:
  trigger.service:
    pattern: "^Disk"
    type: matchregex
  trigger.state_type:
    pattern: "HARD"
    type: matchregex
  trigger.state_id:
    pattern: "0"
    type: gt
action:
  ref: e_nagios.remediate_disk_workflow
  parameters:
    hostname: "{{ trigger.host }}"
    directory: "{{ trigger.service | regex_replace('^Disk\\s*', '') }}" 

Pretty similar to the one above that just posts messages to the chat room, right? Don't forget to enable it, as it comes disabled by default. Although we utilize the same trigger, criteria here are slightly different. We are matching service description to a certain regex pattern, and we explicitly matching hard state of the alert, and catching state ID greater than zero (since in Nagios 0 means RECOVERY, 1 - WARNING, 2 - CRITICAL). We do not want to automatically fire an action on recovery alerts, right? At least in this case. One thing is also important here, and I once spent a lot of time figuring out why my rule didn't work properly. The thing is that you really should wrap your patterns in double quotes when you define your criteria. Even if it's an integer (see trigger.state_id criterion as an example).

That being said, once we receive an alert with the matching criteria Stackstorm will execute the defined action and do some magic with the parameters we are passing to that action. Namely, the action is called e_nagios.remediate_disk_workflow and is defined under actions/ directory of the pack. We also pass hostname that triggered the alert and stripping Disk part out of the service description, leaving only the directory of the mounted partition itself (assuming that disk monitoring service has appropriate service description defined in the Nagios config, don't hesitate to adjust to your own environment here). Yes, Stackstorm supports Jinja2 filters in the rules definition when you pass parameters to actions!

Disk Space Remediation Action

It's time to design our auto-remediation action! Here's how it looks (and it does really look nice) in the Visual Flow tool that comes in the Enterprise Stackstorm edition:

The workflow itself is pretty simple and consists of the following steps:

  1. Report in the chatops that we received the task to check the disk space;
  2. Run the disk check action that confirms the alert from Nagios;
  3. If the disk usage is above the defined threshold, run an auto-remediation action, else report in the chat room that it was a false positive alert from Nagios;
  4. If the auto-remediation action completes with no errors try to check the disk space usage again, else report about the error in the chat room;
  5. If the disk space usage comes below the defined threshold assume that auto-remediation succeeded and report about it in the chatops, else report in the chat room that the auto-remediation failed.

Suffice to say that reporting to the channel can be substituted or extended to reporting via email or any other means to page you and ask for manual intervention. The good thing here is that the scripts that do all the job can be written in any language you like, just make sure they can be executed remotely on the host that is being checked.

Before we can design our workflows in the Visual Flow, we need to define two basic actions for our needs:

  1. check_dir_size
  2. disk_remediate

The meta-data for the check_dir_size action is defined in YAML and looks like this:

---
description: 'Check the total percentage of disk taken up by a specified directory'
enabled: true
entry_point: check_dir_size.py
name: check_dir_size
parameters:
  action:
    description: "Run as an action.  (Outputs structured data)"
    default: true
    immutable: true
    type: boolean
  directory:
    description: "The directory to check"
    required: true
    type: string
  threshold:
    description: "Maximum percentage of disk space that can be consumed by the directory."
    default: 80
    type: integer
  debug:
    description: "Turn on debug output"
    default: false
    type: boolean
  sudo:
    default: true
    immutable: true
runner_type: remote-shell-script

Three things to pay attention to: entry_point, parameters and runner_type. Since it's a remote-shell-script runner, there's an implied parameter hosts that this action will require (see https://docs.stackstorm.com/runners.html#remote-script-runner-remote-shell-script for details, for instance in case you need to provide password authentication). entry_point points to the script name that should reside in the same directory. parameters declares all parameters that will be passed to the script, their types and other options. As a homework you may want to transform the alert level (Warning or Critical) coming from Nagios into threshold level for the script. But it should be done in the workflow that is depicted earlier when we talked about the Visual Flow instrument.

And the most interesting action is the disk_remediate action. Let's take a look at the meta-data of the action:

---
description: 'Try to remediate disk space issues'
enabled: true
entry_point: disk_remediate.pl
name: disk_remediate
parameters:
  action:
    description: "Run as an action.  (Outputs structured data)"
    default: true
    immutable: true
    type: boolean
  directory:
    description: "The directory to check"
    required: true
    type: string
  debug:
    description: "Turn on debug output"
    default: false
    type: boolean
  sudo:
    default: true
    immutable: true
runner_type: remote-shell-script

Basically it looks very similar to the first one. And here's where your imagination comes forward. The dummy auto-remediation script may look something like this:

#!/usr/bin/perl
use strict;
use Getopt::Long;
#use JSON;

my $directory;
my $debug;
my %output;

GetOptions(
    "directory=s" => \$directory,
    "debug"      => \$debug
);

if ( !defined( $directory ) )
{
    $output{ 'result' } = 'fail';
    $output{ 'reason' } = "Directory is not provided!";
    finish( 1 );
}

if($directory eq '/var/log')
{
# do something with /var/log
}
elsif ($directory eq '/var')
{
# do something with /var
}
elsif ($directory eq '/home')
{
# do something with /home
}
elsif ($directory eq '/opt')
{
# do something with /opt
}

$output{'result'} = 'success';
finish(0);

sub finish
{
    my $exit_code = shift || 0;
    #my $json = encode_json \%output;
    #print "$json\n";
    exit( $exit_code );
}

Who writes in Perl nowadays you'd ask? I don't know. Some old farts like me, probably. But you may go ahead and use your favorite Bash, Python or Ruby. All that matters is that it should be executable remotely on the host, where the disk issue is appeared and reported by Nagios. You may want to compress logs, upload them, move them, just delete them, enable compression in logrotate configuration or even try to extend logical volumes if you have some spare space left when such need arises. It's completely up to you what to do. I have disabled JSON output in the dummy script since JSON module is not installed by default on the Linux distributions. In general it's a good idea to produce outcome in JSON format, since it then can be easily adopted and published by actions in a workflow.

And in the end the whole workflow after you put everything together will look like (which is also shown on the picture in the beginning of the chapter, but in Visual Flow representation):

---
version: '2.0'

e_nagios.remediate_disk_workflow:
  type: direct
  input:
    - hostname
    - directory
    - threshold
    - channel
  tasks:
    lets_work:
      # [466, 27]
      action: chatops.post_message
      input:
        channel: <% $.channel %>
        message: "epsibot is trying to take care of the disk space issue on <% $.hostname %> in <% $.directory %>"
      on-success:
        - check_dir_size
    check_dir_size:
      # [289, 149]
      action: e_nagios.check_dir_size
      input:
        hosts: <% $.hostname %>
        directory: <% $.directory %>
        threshold: <% $.threshold %>
      on-success:
        - hubot_error
      on-error:
        - remediate
    hubot_report:
      # [485, 568]
      action: chatops.post_message
      input:
        channel: <% $.channel %>
        message: "epsibot has cleared <% $.directory %> on <% $.hostname %> and it is now less than <% $.threshold %> percent!"
    hubot_error:
      # [114, 274]
      action: chatops.post_message
      input:
        channel: <% $.channel %>
        message: "Alert from Nagios was false positive for <% $.directory %> on <% $.hostname %>!"
    remediate:
      # [489, 233]
      action: e_nagios.disk_remediate
      input:
        hosts: <% $.hostname %>
        directory: <% $.directory %>
      on-success:
        - check_dir_size2
      on-error:
        - hubot_rem_fail
    check_dir_size2:
      # [485, 410]
      action: e_nagios.check_dir_size
      input:
        hosts: <% $.hostname %>
        directory: <% $.directory %>
        threshold: <% $.threshold %>
      on-success:
        - hubot_report
      on-error:
        - hubot_rem_fail
    hubot_rem_fail:
      # [82, 464]
      action: chatops.post_message
      input:
        channel: <% $.channel %>
        message: "Auto-remediation failed for <% $.directory %> on <% $.hostname %>. Please check manually."

As I mentioned earlier we can actually pass how critical the alert was (was it just a warning or the situation is critical) and act accordingly by altering the threshold or telling our script to be more aggressive.

At last, let's look at the workflow's metadata, as it contains parameters that tie it to the rule we started from:

---
  name: "remediate_disk_workflow"
  runner_type: mistral-v2
  description: "Remediation workflow for diskspace alerts"
  enabled: true
  entry_point: "workflows/remediate_disk_workflow.yaml"
  parameters:
    hostname:
      type: "string"
      description: "Host to remediate disk space on"
    directory:
      type: "string"
      description: "Directory to prune if over the threshold"
    threshold:
      type: "integer"
      description: "threshold for check diskspace action. percentage"
      default: 75
    channel:
      type: "string"
      default: "563b5f7f21f7a36d7bd5baaf"
      description: "Channel to post messages to"
    context:
      default: {}
      immutable: true
      type: object
    task:
      default: null
      immutable: true
      type: string

For example, we can define an array in the metadata with threshold levels for critical and warning alerts and use it to pass different numbers to disk check scripts later in the workflow. Think about it on your own and try to implement.

Hope that helps to get your started with auto-remediation and guard your sleep at night. There are a few other rules in the pack worth looking at, e.g. triggering actions on "proc" and "load" Nagios service alerts. That being said, you may want to restart processes when Nagios reports them down.

We will talk about Stackstorm and Openstack integration in the next series of my posts.