Hints to System Designers from Ops

Mark Verber

Draft 0.2
October 5, 1999

Note: The follow are notes I have made to myself which is not even raw. A lot of work is needed before this page would be ready for prime time. I have abandoned working on this page. I am starting on Hints for Operating High Quality Services which will reused much of this content

Learn from the Past

"Motherhood" principles tend not to be followed, or correctly applied.
- Peter Neumann

There has been a lot of good work in the systems space. Anyone who is designing a system should be familiar with landmark systems papers and books. If you are starting a major project and you are not familiar which what has been published in that space, do a literature search. See what other folks have thought about. Too often people go off to solve problems which have been solved before, or worse, naively think they are going to solve problems which are quite a bit more difficult than in might seem on the surface. I have been amaze at the number of people who think building automatic diagnosis and repair systems are easy, or than they can solve the halting problem.

Caveat: There are a number of academic disciplines which study systems. Likewise, there are a large number of industries which build systems. Unfortunately the vocabularies used by the different communities varies. This has made it difficult for one community to learn from another communities experiences making it difficult to recognize common insights, and makes it more difficult to recognize what unique discoveries have been made in other disciplines. What has been clearly understood by one discipline for ten years might be a new discovery to someone in a related discipline.

Cross Functional Teams Are Key!

The most effective large scale services are driven by cross-functional teams.  The team tends to include people from  marketing, engineering, QA, and operations working together create and review specifications and set priorities for medium term planning.

There needs to be a particularly tight connection between the people developing the system and the people who are going to be running that system data-to-day.  I have found that developers are bad at predicting what operations folks are likely to do.  They also don't know all the tradeoffs.  Working  closely with operational folks to define what is best to be done in code, and what is best to be done through operational practices.  One approach to this is that the operations group documents their requirements and acceptance criteria.  The developers biuld something that means the documented requirements, the operations group verifies that it passes the acceptance criteria.  In practice this doesn't work very well.  Getting specifications fully correct is very difficult.  Much more effective is to have regular communication between the developers and operational staff.  Before significant code is written there should be an architecture and design review which has representatives from both the engineering and operational teams. 

Architecture & Design

Insert text about what an architect does

Relationships among the subsystems are what give systems their added value.

Good interfaces is key to effective architectures

Partition the system into scalable modules

Design a management API

Address security at the beginning - it's very expensive after the fact.  You must address:

Think end-to-end

Simple is Best

Other issues

Expect Change

Fred Brooks in The Mythical Man Month has suggested that you should build a prototype system, and then build the real system. While in theory this is a great idea, time-to-market pressures will almost always make this impossible. If you believe that your system really needs to be prototyped, and then built from scrap, build the system on a platform, in a language or environment which can't be used for the final product. Otherwise, there will be too great a temptation to reuse the the prototype code rather than learning the lessons.

In most cases, you won't have the luxury of building a prototype system, and then the real system. Therefore, it is critical to design for evolution. Your first system will almost always be wrong, hopefully you can fix it.

If you have a rapid release cycle, don't try to get it right the first time. Implement your best guess and then learn from the experience. The next release you can replace what was a bad idea, and improve what was good.  Also go after "low hanging" fruit.

Do not put knowledge into the program. Write little languages, or use extensive configuration.  Both can kill you if they are too complex or inconsistent.

Avoid bulk updates

Renumbering Happens, Mergers Happen

Users with Want Something You Didn't Expect

Blessed are the Paranoid

Murphy was an optimist.

Most of the time you can't use the law of big numbers, you have to use the law of medium numbers: e.g. you will see all exceptions

Perfect Storm / Titanic Coincidence: Most accidents in well-planned systems involve two or more events of low probability occurring in the worst possible combination.

Make sure your testing is valid.  Don't put it into place without testing it in the environment you are going to run it in.  - Beware of code which have special cased for testing. <Hint by mail>

Dependencies Will Get You

If you must fail, do it quickly. Come back quickly.

manage external dependencies - Give the best possible degraded service

Make things atomic

Be self healing... but only if the recovery process is deterministic

Leave a path home

Route around failure

Don't Shoot Yourself in the Foot

History and programming styles - buffer overflows, type mismatches.

Role of language design, programming discipline, and good architecture

Always build a test suite so you can do regression testing in the future.  Make it possible for your tests to be used operationally in your production environment.

Provide the justifications, so that people can reuse the concepts, if not the code.

NP completeness - regex performance graph of group matches

Provide a way back.

Know Your Customer

Services often need consistency rather than additional features

Don't guess what the users want; find out. If possible, watch what they do, not what they say.

There's more than one right way to do it: Some arguments are religious, and it doesn't matter which side you're on. Sites differ; what's right for one site may be stupid at another.

Look for price discontinuities

If you can't solve the problem, change it <Unistrokes>

Real Data

Metrics - Real data for analysis

Repeatable, unbiased testing (scientific method - lab testing) Compare apples to apples

Don't believe the documentation; check the code, stub it out and see if it works.

Don't believe benchmarks; test things yourself with your load.  Don't assume that what people tell you is true for your machines. (It may not be true at all.)

incremental release

Common Building Blocks

Most web oriented infrastructures use:

To make this all going there are many thing you will need to address.  You might want to take a look at Four Star Network Management.

Event Logging Systems

Effective system monitoring is a must. It must be possible to communicate system state to external agents. These agents might be software or human beings. TAs events happen, it must be possible to take an automatic action (invoke a script), set an alarm, forward the event to another system, and/or save the data for future trend analysis.

The event logging system should run on a streaming model. Don't try to make a batch oriented harvesting work. Stream the data up to an aggregating host. Make it easy for the aggregating host to forward the data on as well.

The logging system should supports a rich variety of information sources and values. Most likely each event should have an id, and a collection of tags and corresponding values.

Low overhead with fast propagation of information or it won't be used.

Take action based on the tags and values of an event

Health Monitoring - Detect Breakage

Plan for test and validation whenever building a new system.

Error reporting:  Make sure that all errors / warnings / etc are logged, and those logs are processed.

Basic "alive" testing: connect to the machine / service whatever and verify that you get a basic response

Heavy-weight end-to-end functional testing: You should be able to run a test which validates that the service / machine is actually able to fully perform whatever function you ask of it.

Passive monitoring: services should be built which log the work they are performing.  Make sure that you are seeing the service is actually getting work done.  For example, that the web server logs are growing at an expected rate, etc.

Make sure you aren't dependent on something you are monitoring

Trending System - Prevent problems by watch for trends

You should characterize services in terms it's use of CPU, disk IO, network IO, and memory per transaction or some other unit of work.  You should be able to profile your service each time you release a new version to ensure that you understand resources consumer / unit of work.  Once you have this data you can can have a trending system which warns you if you are consuming more resources than you should for the work being performed.

If you are using load sharing with failover, make sure you can run the service missing a machine.  So if you have a pair of machines, you should be concerned if you see any resource consumed more than 40-50% depending on how linear resources are consumed -vs- amount of work and if there is a knee in the performance curve.

Example of Cricket for data collection

Alert Management System - Manage Problems

Visualization is important

Workflow & Trouble Ticketing System - Remember what you're doing

Automatically generate tickets

Automatically resolve issues, and close tickets

Configuration Management

Change one place... result everywhere

side by side versioning

encapsulate

provide a way back

Base level software distribution could be part of configuration management or separately addressed.

Aggregate command and control

Scriptable or command line

Every large system I have dealt with had the need for a batch jobs don't forget to cover this