Core Operational Elements

Part of Hints for Operating High Quality Services by Mark Verber
Very Very Early Draft 0.2 -- November 11, 2005

[Note: Each core element will be split into it's own document soon.]

Configuration Management

There is now a separate document about Configuration Management.

Software Distribution

Should be completely integrated with configuration management system. There should be a single source of information which is used by the software distribution system. Historically software distribution will have a machine (or possible machines holding mirrors) which hold the "gold" copies. From the gold master machines you would push software to new machines, or sometimes the new machines would pull the software. If you are going to been to regularly update many machines at once, you might want to consider using a multicast distribution protocol which uses forward error correct to minimize the load on your scare resource (the gold masters CPU and network IO). Another innovative approach could be use to use a P2P protocol like BitTorrent. Since you have lots of machines running identical software, you should be able to make use of those machines to share the workload.

From iron to basic OS

Managing software on top of OS

Managing data (or content) for applications

Patching - hard to do well

Alert Management System - Manage Problems

Visualization is important

Event System to Alerting System Gateway - The event system will be a firehouse of information. It is critical to extract the information which can be useful for monitoring and alarming. We should have a tool which consumes the event log and possibility other data sources. This tool would be configured to look for events which match a specification. For events which contain performance data, the tool would save a copy of the data into a buffer, and a timestamp. The tool will also have a primitive to count the number of events you have seen (think of this as a heartbeat monitor) and a timestamp for the last event seen. This tool should be able to raise an alert if performance data moves out of a range, if the count moves out of a range, or if we fail to see events. Some alert management systems integrate health monitoring and/or trending systems.

Many people expect Alerting systems to draw cute graphs (ala HP OpenView). In most cases this is management eye candy which is not that useful.

Example alert management systems include

Nagios (used to be netsaint). Monitoring doesn't scale beyond around 1000 hosts / monitoring host with good resolution on data. Doesn't support an effective HA configuration today.
Big Brother
Big Sister
OpenNMS
NetCool (Expensive, the but best option for large networks)
HP OpenView

Health Monitoring - Detect Breakage

Plan for test and validation whenever building a new system. Effective monitoring helps you identity what is broken, and almost as important, what isn't broken so you know to look elsewhere for a problem.

Remember: make sure you aren't dependent on something you are monitoring.

If you run a lights out / NOCless operation, it is good to run a "parole" process which is running completely seperate from your production infrastructure. The parole processes sole purpose is to listen for period heartbeats from your monitoring infrastructure. Failure to hear a heartbeat results in a notification that the monitoring system (or maybe all of production) is down.

Error reporting: Make sure that all errors / warnings / etc are logged ot send traps, and those traps/logs are processed.

Basic "alive" testing: connect to the machine / service whatever and verify that you get a basic response

Heavy-weight end-to-end functional testing: You should be able to run a test which validates that the service / machine is actually able to fully perform whatever function you ask of it.

Passive monitoring: services should be built which log the work they are performing. Make sure that you are seeing the service is actually getting work done. For example, that the web server logs are growing at an expected rate, etc. One great example of this aspect of monitering was described in the paper Performance Assertion Checking.

Examples include:

Many of the alert management systems have integrated monitoring
Ganglia
mon

Trending System - Prevent problems by watch for trends

You should collect data for every resources which is consumed and "interesting" counters which will give you insight into the behavior of the system you are managing. You want to be able to see these values in real time while debugging, as well as in a time-series graphs to see current behavior in historical context.

One of the most important aspects of graphs is to clearly show how consumed resources compares to available resources. A common way to display this sort of information for graphs to show percent of a resource consumed. Another method is to have the top axis be the maximum value for that resource.

If you are using load sharing with failover, make sure you can run the service missing a machine. So if you have a pair of machines, you should be concerned if you see any resource consumed more than 40-50% depending on how linear resources are consumed -vs- amount of work and if there is a knee in the performance curve.

It is important to notice when trended data is exceeding a "safe" or "expected" level. People often assign thresholds which result in triggers or messages when that threshold has been exceeded. Sometimes these thresholds are expressed in terms fixed valued related to data. For example, conventional wisdom is that that exceeding 80% utilization of any resource can result in unpredictable behavior, so 80% of the maximum value would be a good threshold. Another approach is to use some sort of statistics based check which noticed changed to historical trends. Statistical analysis can be quite tricky.

It is also important to track trends be derived data. For example, it is very important to track transaction rate as compared to resource consumption. This is one of the best tools to detect performance problems as well as detecting hard to diagnosis systemic problems in a complex system.

You should characterize services in terms it's use of CPU, disk IO, network IO, and memory per transaction or some other unit of work. You should be able to profile your service each time you release a new version to ensure that you understand resources consumer / unit of work.

Once you have this data you can have a trending system which warns you if you are consuming more resources than you should for the work being performed.

Tools which provide monitoring functionality include

MRTG: the original
Cricket: easy to configure large numbers of probes... use to collect around several thousand data points (simple for small scale network traffic)
Orca: most flexible
Cacti: pretty looking, easy to use on small scale

Workflow & Trouble Ticketing System

Automatically generate tickets

Automatically resolve issues, and close tickets

Need to remember what you are doing

Aggregate command and control

Scriptable or command line

Every large system I have dealt with had the need for a batch jobs don't forget to cover this

Ad Hoc Change Tool - A tool which lets the operations team run arbitrary commands of the machines deployed in the service (aka netexec at WebTV). This tool should understand the configuration system so it is possible to specify in a declarative fashion what machines should have the work performed on.

References

Four Star Network Management – Jeff Allen, David Williamson. Slides from Invited Talk at USENIX LISA

Applying Logic Programming to Convergent System Management Processes.

Path-Based Failure and Evolution Management

Mark Verber's List of Tools for System Management

Building a Network Monitoring System

Bootstrapping Infrastructure