SNIPS Operations Guide

Version 1.0
Last Updated: Feb 2001

CONTENTS

1. Running SNIPS

2. User Interfaces

3. Notifications & Reports

4. Configurations for Large Networks

5. Appendix


You must read the Installation document prior to reading this Operations guide.

1.  Running SNIPS

File Locations

The main directory where snips gets installed is specified at compile time (default is set to /usr/local/snips). The following sub-directories exist under this main directory:

bin/ All monitors and utility scripts are in this directory.
data/ The raw data collected by the monitors
etc/ All configuration files, and the snmp MIB file.
msgs/ All files in this directory are displayed in the 'snipstv'  msgs subwindow.
run/ The PID files for all the monitors (used to ensure only one copy of a monitor runs at a time), and error file for runtime errors.
device-help/ Contains help files specific to a device (and optionally a variable) which is displayed when a user clicks on the HELP button in snipsweb.
init.d/ A SysV style 'init' directory which contains scripts to start/stop/restart   the various processes.

Configuration Files

There is a global config file for all the C monitors snips.conf, typically stored under /etc or /usr/local/snips/etc (the software automatically searches for the file in any of these locations). The snips loghost is read from this file, and this file also allows changing other global directory settings. A common config file for all the perl monitors is located in the snips directory under /usr/local/snips/etc/snipsperl.conf.

The configuration file for each individual monitor is located in the /usr/local/snips/etc/ . There are sample configuration files located in the etc/samples subdirectory. Using these sample files as templates, you should create configuration files for each monitor that you want to use. Note that in most monitors, the 'name' of the device is not really used by the monitor but is basically an operator friendly name for the device.

It is recommended that these configuration files be stored using RCS (or some other revision control system) to prevent multiple operators from editing a file at the same time and also keeping old revisions automatically.

Starting the Monitors

There are two ways to start the monitors- you can start a particular monitor manually using the corresponding init script in the init.d/  directory or automatically from crontab using the keepalive_monitors.pl script. This script is run periodically from crontab and ensures that all the desired monitors are running.

You must list all desired monitors in the keepalive_monitor.pl script (edit the @{$snipshost} variable). Ensure that you have setup email aliases for the operations staff and also created a 'snips' user to run all these programs (all these steps are listed in the Installation document). Any error messsages from the monitors are written to the run/xxxx.error  file. This file is mailed to the OPSMAIL email address when a monitor is restarted.

After you have created the config files and edited keepalive_monitors, ensure that the contents of bin/crontab.snips are loaded into cron (typically done using   cat bin/crontab.snips | cron).

When changes are made to a config file, you can reload these changes by sending the respective monitor a HUP signal (or using init.d/xxx.init  hup). Alternatively, you can also get the monitors to automatically reload their config files if they detect a change by starting the monitors with the '-a' flag for auto-reload. You should ensure that your changes are completely written out to the config file so that it is not half-edited and unusable if you use this flag.

There might be a slight delay in reloading the config files on recieving a HUP signal, since the monitor finishes its current polling cycle before reloading the file.

Generally the monitors do not need any command line argument- the default name and location of the configuration file and the data directory is compiled into the monitors. However, you can always specify an alternate config file or output data file using the '-c' or the '-o' command line options respectively. All monitors also accept the '-d' flag to indicate debug mode, in which case they write debug messages to the stderr. You can send a USR1 signal to any monitor to increase the level of debugging (this increases with each USR1 signal upto 2 and then resets to 0).

The keepalive_monitors.pl  script starts the logging daemon (snipslogd) first so that the monitors can log to this process (see next section for additional information on snipslogd).

snipslogd - the Logging Daemon

The snipslogd daemon listens on port 5354 of the logging host for any events sent by the monitors. The name of the host where snipslogd runs is set in the global snips.conf   config file.

The snipslogd process is similar to the Unix 'syslog' daemon and the configuration file allows piping the logged events to any external process. To prevent any random host from sending it any messages, the list of allowed IP addresses (which can log to it) is listed in the snipslogd configuration file.

Since this process can run external programs, it can be used to run the pager notification scripts, etc. This program can be used to log messages to a database, send emails, etc.

It should be noted that an 'event' in snips is generated only when a value crosses a threshold in any polling interval. Hence, normally you will not see any logging activity in snipslogd, but when a device variable changes its state, an event will be logged. This means that an event will be sent by a monitor to snipslogd both when it goes down (e.g. from info level to warning level) and also when it comes back up (e.g. warning level to info level). The loglevel is the worse of the current level and previous level (hence, when a device goes back from Critical to Info level, the event will be logged at loglevel Critical).

Messages Directory

Each of the displays has a 'messages' section where the contents of the files in the 'MSGDIR' are displayed. You can create any text file in this directory (preferably one line messages), and these are displayed in the 'Messages' subwindow.

Routine Maintenance

Routine admin tasks in SNIPS consist of ensuring that all the monitors are running (done by running keepalive_monitor.pl from cron),  and rotating all the log files maintained by snipslogd (done by running log-maint.pl periodically from crontab). The log-maint.pl script also runs the logstats.pl reporting tool which mails the report to the OPSMAIL email address. See the file snips.crontab where all these maintenance tasks are listed.


2.  User Interfaces

All the monitors store the current state of the devices in raw data format (in the /usr/local/snips/data directory). There are three different user interfaces to view and interpret this data.

Note that none of these interfaces displays historical data from 'snipslogd'- they all work directly on the data being collected by the monitors which represents the current state of the network.

snipstv

snipstv (snips TextView), is a non-graphical, text 'curses' based tool for displaying the raw data being collected by the monitors. Any user on the system where the monitors are running can run this tool. Entering the 'e' key will display different fields (since it is not possible to display all the possible fields in the limited 80 character displays). It is possible to filter events, etc.- enter 'h' to get detailed help on this tool.

SnipsWeb

The Web interface for displaying snips data is divided into three scripts- genweb.cgi which reads all the data files and generates HTML with hyperlinks to snipsweb.cgi. This script in turn invokes rrdgraph.cgi which generates RRD graphs for the device. All these programs read the etc/snipsweb-cfg.pl  configuration file on startup, and this file should be edited to set your site settings.

genweb.cgi  can be run periodically from crontab to generate 4 web pages (one for each severity level) or directly as a CGI program. When run as a CGI, it allows sorting, filtering, etc. In CGI mode, the script is reading and generating HTML in realtime, so if many users are accessing this CGI simultaneously, this could generate additional load on your server. You should protect this script using standard htaccess style authentication to restrict access to the script.

snipsweb.cgi  is the complement to genweb.cgi  and  gives added functionality such as historical graphs, device specific help troubleshooting, adding notes for an event, hiding a known event, etc. You should definitely protect this script using the htaccess web authentication, even though this script has its own built in access control also as an alternative.

rrdgraph.cgi  generates graphs for a device and all its monitored variables. It is invoked by snipsweb.cgi, and restricts access by allowing only the CGI's listed in the @OK_REFERER variable to run this script. This variable is customized in the snipsweb-cfg.cgi file mentioned above.  rrdgraph.cgi generates the images on the fly, and caches the images on disk (in the rrdimg-cache/ directory) also for efficiency.

You should also create an etc/users file which lists the access level of each user (which commands they are allowed to run in snipsweb.cgi). Additionally, you can create help files in the device-help/ directory which are named based on the device and/or variable name. When a snipsweb user clicks on help for a device, the program looks for a matching help file in the following order:

<devicename>:<deviceaddr>
<devicename>:<variable>
<devicename>:<sender>
default

where any of these can be the keyword 'default'.

All the CGI scripts print error messages on stderr, which get logged in the web server's  logfile when running in CGI mode. Look in these log files for errors in case of trouble.

tkSnips

This is a Tcl/tk based monitor using client-server technology. A simple daemon (called 'ndaemon') runs on the SNIPS monitoring server listening on TCP port 5005 and it periodically send the event raw data to all connected tksnips clients. The tksnips client's then parse and format/display this snips raw data. ndaemon has no access control at this time, so it is important to put a firewall to restrict unauthorized access to ndaemon's TCP port.


3.  Notifications & Reports

A very flexible notification script called  'notifier.pl'  is provided with SNIPS which has a configuration file with the type of event and required action. Currently the possible actions are  mail and page. A minimum and maximum age of the event can be defined indicating that the action should be taken (paging or email) only if the age of the event lies between these two values (in seconds). An option exists to allow 'repeat' notification (once every hour) until the age is exceeded.

This program should be run from crontab every 5 minutes (set the value of $crontime accordingly in the script if run at different time intervals). This program should also be run from snipslogd.conf, so that it can send a notification as soon as an event occurs. When run from crontab, this program only parses Critical events and events that are down (i.e. no notification when they come up when run from crontab). However, when run from snipslogd, it reads the log lines from the stdin, and sends messages both when a device goes down and comes back up. The event time is set to a negative -1 second when running as a filter from snipslogd, so the notifier-confg file entry should be set accordingly.

It is possible to write additional 'event' driven notification systems using snipslogd. Any event can be piped to an external script by snipslogd, so a page or email can be sent as soon as an event occurs and is logged to snipslogd. As another example, the 'utility/beep_oncall' script uses the sendpage program (available from ftp://ftp.net.ohio-state.edu/pub/pagers). Other (untested) alternatives to sendpage are are SNPP and YAPS.

Currently the only reporting tool for historical analysis is 'logstats' which parses the historical snipslogd event logs and generates a summary report. This is run by the 'log-maint' script which in turn is run periodically from Unix cron.


4.  Configurations for Large Networks

Currently SNIPS is being used to monitor devices with close to 2000 devices. The monitors which usually have large number of devices are:

ippingmon - using ICMP echo messages (typically used for router interfaces)
portmon - for TCP sockets (typically used for web, mail, pop, imap ports)
hostmon - for Unix host performance (disk space, load, memory)
snmpmon - for querying SNMP data

All these monitors except for snmpmon are designed for monitoring very large number of devices in parallel very efficiently. As an example, ippingmon can monitor 500 devices in a little over 2 minutes, and hostmon can poll 64 ports per minute. However, if the number of devices is still larger, you can split your devices into multiple configuration files   and then use the '-x' flag to a monitor or create a symlink to the monitor to read these alternate config files.

As an example, if you divide all your ippingmon devices between 2 config files, and name these configuration files  ippingmon-A-confg  and ippingmon-B-confg. All you have to do is either of the following methods:

ln -s bin/ippingmon bin/ippingmon-A
ln -s bin/ippingmon bin/ippingmon-B

OR

ippingmon -x A
ippingmon -x B

The monitor will automatically look for and load the respective config files based on its own name (or '-x' extension) by appending '-confg' to it.

Remember to update the  keepalive_monitors script with these new names or flags if you use either of these methods.


APPENDIX

Monitors

These are quick reference tables for the various field values in each of the current SNIPS monitors. The 'address' field is typically used by the monitor to query the device. The 'device name' field is usually a 'common' or 'alias' name for the device being monitored. If there is a sub-device or sub-element (such as an interface, file partition or domain name) being monitored, this is prefixed to the device name with a '+' as a separator.

Monitor Device Name Address Var Name Var Values Var Units
etherload interface - type + any name
(eth0-Ethernet+fileserver)
any name Bandwidth or
PktsPerSec
0-100 or u_long %age or pps
nsmon domainName + any name addr/fqdn named-status 0 or 1 SOA
ntpmon any name addr/fqdn ntp 1 - 16 Stratum
ippingmon any name addr/fqdn ICMP-ping 0 - 10 Pkts Rcvd
rpcpingmon any name addr/fqdn Portmapper 0 or 1 Status
portmon any name addr/fqdn NewsPort
WebPort, etc.
0 or 1 Port
radiusmon any name addr/fqdn radius 0 or 1 Status
tpmon any name addr/fqdn Thruput 0 - u_long Kbps
trapmon any name addr/fqdn trapname 0 Trap

  Perl Monitors:

Monitor Hostname Address Var Name Var Values Var Units
apcmon any name any name (from config file)
online, battery, temp
as measured from configfile
volts, hertz
armon zone name Net number Reg-ATalkRoute or Unreg_ATalkRoute 0 or 1 Entry
bgpmon Peer Hostname Peer IP BGP-routername 0 or 1 State
bpmon any name server IP Bootp_Server 0 or 1 Entry
ciscomon any name addr/fqdn CPUusage, Airflow, Inlet, +12V, etc. as measured Percent, Deg, mVolt, etc.
hostmon (from data file)
Subdevice+addr
(from data file)
Device addr/fqdn
(from config file)
NFStimeout, Diskfree%
as measured (from data file)
MB, drops
novellmon Server Name Service Type IPX_Server 0 or 1 Entry
nrmon Next Hop Network Reg_NovellRoute or Unreg_NovellRoute 0 or 1 Entry
smbmon Server + service address SMBserver 0 or 1 Entry
snmpmon (from data file)
Subdevice+addr
(from data file)
device address/fqdn
(from config file)
ifErrors, PktRate
measured (from data file)
Mbps
sqlmon any name any name SQLserver 0 or 1 Status
syslogmon Hostname reg expr from config file (from config file)
DiskErr, MemParity
as measured LogMest
upsmon any name any name AC_Power 0 or 1 Avail

Vikas Aggarwal