Part of Hints for Operating High Quality Services by
Very Very Early Draft 0.1 -- July 10, 2006
Everything will fail... give up on trying to make everything work correctly all the time.
Since everything will fail, you need to make a system survive in the face of losing components.
Since you have to survive losing service components, you should think about making this robust enough that it can become a tool that enables no downtime updates, rollbacks, service testing, etc.
So making a HA service means that you need to be able to control were work is done, and you need more that one place that work can be performed.
Keep is that work is broken up into small, atomic units of work. It's also important for the over all service architecture not to assume that the same machine, process, or even data center will handle successful requests.
Even it the requests are discrete and atomic, the overall system still has to deal with failed request, because at some point it should be assumed that a machine / process / etc will fail in the middle of processing a request.
Load balancing not equal to making something HA
Places to do service routing:
Push state back to the client... client can retry request. Client can be given a list of rendezvous to use, or one rendezvous which is virtualized. This tends to be one of the most reliable solutions with the downside that the retries have to go all the way back to the client. It also limits the service providers control of where requests are landing. Requires a smarter client which could be compromised.
Name to ip address binding
round robin with health checks, CDN conditional responses, etc
lots of dns resolvers don't play nice, even if you say don't cache, some will. Can give you nearly free, very course high availibility, though even though your service is running, some users might see nothing but failures since they are talking to an old, cached, address which is down.
Name to multiple IPs that are passed around:
Indirection at layer 2
Making HA IP addresses:
pair of machines running heartbeat. cheap, easy, will run into scaling issues.
They typically have some primitive, poll based heart beat to determine what services are available, and spread traffic between those services. These appliances typically don't support retrying a failed request. Most don't mark services are being unavailable until their poll service check runs even though an establish connection fails. Most of these devices is silly expensive when you think about the hardware costs. The health check and load balancing is fairly primative which means that it will typically not be an ideal match for any appliance. The flip is that these are typically quick to deploy which can drop the time to market of a reasonable HA solution.
Other useful references