Links

Content Skeleton

This Page

Previous topic

Sys Admin Docs

Next topic

Mars

Monitoring

google:munin monit cacti nagios

problems in dayabay context

  1. do not want monitoring node to have the ssh keys to everything to be monitored

nagios

  1. nagios is venerable, apache + perl based, pain to configure, big community

http://www.macworld.com/article/1134079/nagios.html

macports is at 3.2.3

others

  1. Monit, God, Supervisord, Upstart
    1. focus on starting/restarting daemons and services
  2. Munin, Cacti
    1. focus on visualization of RRDTool data
  3. Collectd
  1. focus on collecting and publishing data

fabric-cuisine-watchdog/daemonwatch

Python based flexibility, more bare-bones : more suitable to simple monitoring

http://www.slideshare.net/confoo/server-administration-in-python-with-fabric-cuisine-and-watchdog

  1. google:fabric cuisine watchdog

  2. fabric : python based ssh access to remote nodes, low level

    1. cuisine : simple function extensions using fabric primitives to add file/dir/text/user/group/sudo ops

    2. daemonwatch : (formerly watchdog)

      1. https://github.com/sebastien/daemonwatch
      2. https://github.com/sebastien/daemonwatch/blob/master/Sources/daemonwatch.py
      3. service is a collection of rules, with a frequency associated
      4. rules can succeed or fail and have output
      5. actions are bound to rule, triggered on success or fail
  3. i dont see the integration between daemonwatch and the others, daemonwatch looks to be entirely localnode

#!/usr/bin/env python
from watchdog import *

send_email = Email( "name@whereever", "Subj", "confiug....")
send_xmpp =  XMPP( "name@jabber", "Subj", "confiug....")

Monitor(    # the "main"
  Service(      # Service monitors the rules
      name="...",
      monitor= (
           HTTP(     # HTTP rule allows to test url
              GET="http://...",
              freq=Time.s(1),
              timeout=Time.ms(80),
              #fail=[
              #    Print("Failed..."),send_email,send_xmpp,
              #  ]

              fail = [
                  Incident( errors=5, during=Time.s(10), actions=[send_email,send_xmpp])
                     ]
              )
              )
         )
       ).run()

 # also Incident (smart action) to check if something happening repeatedly within time windows