Hints to System Designers from Ops

Draft 0.2
October 5, 1999

Note: The follow are notes I have made to myself which is not even raw. A lot of work is needed before this page would be ready for prime time. I have abandoned working on this page. I am starting on Hints for Operating High Quality Services which will reused much of this content

Learn from the Past

"Motherhood" principles tend not to be followed, or correctly applied.
- Peter Neumann

There has been a lot of good work in the systems space. Anyone who is designing a system should be familiar with landmark systems papers and books. If you are starting a major project and you are not familiar which what has been published in that space, do a literature search. See what other folks have thought about. Too often people go off to solve problems which have been solved before, or worse, naively think they are going to solve problems which are quite a bit more difficult than in might seem on the surface. I have been amaze at the number of people who think building automatic diagnosis and repair systems are easy, or than they can solve the halting problem.

Caveat: There are a number of academic disciplines which study systems. Likewise, there are a large number of industries which build systems. Unfortunately the vocabularies used by the different communities varies. This has made it difficult for one community to learn from another communities experiences making it difficult to recognize common insights, and makes it more difficult to recognize what unique discoveries have been made in other disciplines. What has been clearly understood by one discipline for ten years might be a new discovery to someone in a related discipline.

Cross Functional Teams Are Key!

The most effective large scale services are driven by cross-functional teams. The team tends to include people from marketing, engineering, QA, and operations working together create and review specifications and set priorities for medium term planning.

There needs to be a particularly tight connection between the people developing the system and the people who are going to be running that system data-to-day. I have found that developers are bad at predicting what operations folks are likely to do. They also don't know all the tradeoffs. Working closely with operational folks to define what is best to be done in code, and what is best to be done through operational practices. One approach to this is that the operations group documents their requirements and acceptance criteria. The developers biuld something that means the documented requirements, the operations group verifies that it passes the acceptance criteria. In practice this doesn't work very well. Getting specifications fully correct is very difficult. Much more effective is to have regular communication between the developers and operational staff. Before significant code is written there should be an architecture and design review which has representatives from both the engineering and operational teams.

Architecture & Design

Insert text about what an architect does

Relationships among the subsystems are what give systems their added value.

Good interfaces is key to effective architectures

How people can stand on your shoulders
Consistency is good
Hide the hard work so everyone can use if. For example, threads are hard. Hide threads in higher level constructs which people use.
Interfaces are only as good as the documentation - Build components to reveal their interfaces by introspection or by some sort of system which enforces documentation.

Partition the system into scalable modules

subsystems so that they are independent as possible
complexity inside and/or high rates of information exchange
State exists inside subsystems -- explicitly track it.
Fault isolate - protect subsystem, isolate unsafe subsystem
Enable side by side versioning if not simultaneous running of services, e.g. don't use globals

Design a management API

Everything is done through a well designed API.
Have your GUI use the API
Make a command line UI as well which is scripting friendly
Servers should have control / status port if long running service

Address security at the beginning - it's very expensive after the fact. You must address:

trusted source
trusted server
trustworthy communications path

Think end-to-end

Design
Monitoring
Performance
Technology (make use of modern network fabric...)

Simple is Best

Small is beautiful, KISS. Occum's razor. Unfortunately this is not truly possible… but virtual simplicity is using a modular / abstraction / information hiding design.
Build tools, not monolithic systems... it is easier to update small tools, It is also possible to tie tools together.

Other issues

Make easy things easy, and hard things possible
Don't hide the power
Push to the edge
Translate unknown failures into known failures
Design for the future, implement to the minimum requirements
system should continue to operate under a certain set of failure conditions
Know the upper bound of capacity that a particular unit is required to support

Expect Change

Fred Brooks in The Mythical Man Month has suggested that you should build a prototype system, and then build the real system. While in theory this is a great idea, time-to-market pressures will almost always make this impossible. If you believe that your system really needs to be prototyped, and then built from scrap, build the system on a platform, in a language or environment which can't be used for the final product. Otherwise, there will be too great a temptation to reuse the the prototype code rather than learning the lessons.

In most cases, you won't have the luxury of building a prototype system, and then the real system. Therefore, it is critical to design for evolution. Your first system will almost always be wrong, hopefully you can fix it.

If you have a rapid release cycle, don't try to get it right the first time. Implement your best guess and then learn from the experience. The next release you can replace what was a bad idea, and improve what was good. Also go after "low hanging" fruit.

Do not put knowledge into the program. Write little languages, or use extensive configuration. Both can kill you if they are too complex or inconsistent.

Avoid bulk updates

Renumbering Happens, Mergers Happen

Users with Want Something You Didn't Expect

Blessed are the Paranoid

Murphy was an optimist.

Most of the time you can't use the law of big numbers, you have to use the law of medium numbers: e.g. you will see all exceptions

Perfect Storm / Titanic Coincidence: Most accidents in well-planned systems involve two or more events of low probability occurring in the worst possible combination.

Make sure your testing is valid. Don't put it into place without testing it in the environment you are going to run it in. - Beware of code which have special cased for testing. <Hint by mail>

Dependencies Will Get You

Don't trust the data you are handed
Test all failure conditions

If you must fail, do it quickly. Come back quickly.

manage external dependencies - Give the best possible degraded service

Make things atomic

Be self healing... but only if the recovery process is deterministic

Leave a path home

Route around failure

Don't Shoot Yourself in the Foot

History and programming styles - buffer overflows, type mismatches.

Role of language design, programming discipline, and good architecture

Always build a test suite so you can do regression testing in the future. Make it possible for your tests to be used operationally in your production environment.

Provide the justifications, so that people can reuse the concepts, if not the code.

NP completeness - regex performance graph of group matches

Provide a way back.

Don't remove things: don't replace vendor code unless you can't help it, save backup copies of things before replacing them
Ehrlich's Rule: The first rule of intelligent tinkering is to save all the parts.

Know Your Customer

Services often need consistency rather than additional features

Don't guess what the users want; find out. If possible, watch what they do, not what they say.

There's more than one right way to do it: Some arguments are religious, and it doesn't matter which side you're on. Sites differ; what's right for one site may be stupid at another.

Look for price discontinuities

If you can't solve the problem, change it <Unistrokes>

Real Data

Metrics - Real data for analysis

Repeatable, unbiased testing (scientific method - lab testing) Compare apples to apples

Don't believe the documentation; check the code, stub it out and see if it works.

Don't believe benchmarks; test things yourself with your load. Don't assume that what people tell you is true for your machines. (It may not be true at all.)

incremental release

Common Building Blocks

Most web oriented infrastructures use:

stateless front ends
partitioning around back-ends (especially state stores)
state stores: often replicated data stores with single keys which aren't fully ACID
price / performance / manageability sweet points
service unit -vs- n-tier model
Almost everything has database as the critical core - that's why DBAs are always grumpy. Better think about how you are going to scale/distribute your database.

To make this all going there are many thing you will need to address. You might want to take a look at Four Star Network Management.

Event Logging Systems

Effective system monitoring is a must. It must be possible to communicate system state to external agents. These agents might be software or human beings. TAs events happen, it must be possible to take an automatic action (invoke a script), set an alarm, forward the event to another system, and/or save the data for future trend analysis.

The event logging system should run on a streaming model. Don't try to make a batch oriented harvesting work. Stream the data up to an aggregating host. Make it easy for the aggregating host to forward the data on as well.

The logging system should supports a rich variety of information sources and values. Most likely each event should have an id, and a collection of tags and corresponding values.

Low overhead with fast propagation of information or it won't be used.

Take action based on the tags and values of an event

rewrite events
discard events. A counter should be kept for each matching rules. It should be possible to set threshold alarms if the counter increases too quickly.
generate events if counter exceeds a threshold, if heart beats don't appear from a system, or if an events content matches a rule set.
commit an event to a standard database
execute a procedure if an event matches a rules. It should be possible to use the contents of the log event as parameters to the script.

Health Monitoring - Detect Breakage

Plan for test and validation whenever building a new system.

Error reporting: Make sure that all errors / warnings / etc are logged, and those logs are processed.

Basic "alive" testing: connect to the machine / service whatever and verify that you get a basic response

Heavy-weight end-to-end functional testing: You should be able to run a test which validates that the service / machine is actually able to fully perform whatever function you ask of it.

Passive monitoring: services should be built which log the work they are performing. Make sure that you are seeing the service is actually getting work done. For example, that the web server logs are growing at an expected rate, etc.

Make sure you aren't dependent on something you are monitoring

Alert Management System - Manage Problems

Visualization is important

Workflow & Trouble Ticketing System - Remember what you're doing

Automatically generate tickets

Automatically resolve issues, and close tickets

Configuration Management

Change one place... result everywhere

side by side versioning

encapsulate

provide a way back

Base level software distribution could be part of configuration management or separately addressed.

Aggregate command and control

Scriptable or command line

Every large system I have dealt with had the need for a batch jobs don't forget to cover this