Part of Hints for Operating High Quality Services by
Very Very Early Draft 0.13 -- July 10, 2006
The management of persistent state is arguably the most thorny issue when building and operating a high quality service. Whenever possible design your system to be stateless so that each transaction / message / whatever fully contains all the information required to the requested action to be performed, removing the need for the service to associate your request with some pile of data.
When you can't avoid state, try to make it someone else's problem:
There are a variety of factors which will effect what type of state store(s) are appropriate for a given system. Factors which need to be consider include:
If you can't punt state storage... some rules of thumb:
Most people are familiar with complex, relational data which require ACID guarantees. If you are dealing with this sort of data, then you should go with a real database system. There are a variety of commercial database systems such as Microsoft SQLserver and Oracle. There are also a number of open source databases such as Postgres, SQLite, and MySQL which provide many features found in traditional databases. There has been discussions about problems-with-acid-and-how-to-fix-them
In most large, real-world production systems, some of the ACID guarantees have been weakened to improve system performance. There is typically some sort of audit log which enable detecting and repairing data corruption.
A very common approach to scale a large systems is to split data up into slow changing (small size) data which must always be available, and larger/more rapidly changing data which can tolerate occasional outages. This is often implement by having a read-mostly table which is replicated on all databases for data which must always be available (and also contains the location of the rapidly changing data), and then tables which are stored on individual (or a pair of machines) which holds the more rapidly changing / larger data. It is expected that some of these clusters will be unavailable time to time, but in a properly designed system those outages will only effect a small percentage of the overall user population.
Examples of this sort of approach include:
Most of us manage "state" in our day to day lives though simple interactions with a file system. Many services which require state to be stored often start out as files in a basic file system. So long as a file system on a single machine is sufficient, things are easy and strait forward. On the same machine there are a number of options to enable sharing, synchronization, and locking between processes or threads. As soon as single machine isn't sufficient, things get a lot more complex. The number of system calls which work across machines is often limited. Furthermore, the is the question of how to make the file system reliable in the face of machine failures.
When a service outstrips a single machine CPU resources, but is fairly light in terms of file system I/O (common in web services) people will often move the file based state store to a dedicated file server, which has multiple machines mounting it and performing their services against the common file system. A two edge-sword with this approach is that you can use the same interconnect (IP over your existing Ethernet connection) to offer your services and access the state store. The possible downside is bandwidth contention. [This shouldn't be a problem in most cases.] There are a three possible issues that you can run into when doing this:
Coordination: In theory, fcntl and lockf will work over NFS, but this is often broken, and most lockfd don't scale well. Much more common is to use file system operations which are atomic across all machines (like link) and use opportunistic locking with some way to recover in the face of a lock grabbed from a machine which later died before the lock was released.
Reliability: Using a commodity server as the NFS server creates a huge single point of failure. One of the most common approaches is to use a harden file server (such as those made by Network Appliance) which has a high MTBF and fast recovery. People wanting a more reliable solution will go with an expensive clustered solutions which typically is a pair of machines in a high available configuration.
Scalability: NFS client libraries send their RPC requests to a specific address for a given file system. This means that an NFS server has to either be a single machine, or a set of machines which are sharing (and coordinating the service) on a shared IP address. In most causes this is not a problem, but in exceptional, extremely heavy transaction situations, this could be a problem.
For more information:
SAN = storage area networks. SAN are very profitable for the companies that have succeeded in this marketplace such as EMC. I am sort of down on SAN. You have to deploy a separate infrastructure for interconnections (I don't like betting against Ethernet as an interconnect), they tend to be operational complex, and rather expensive. SAN is typically made up of inter-connected fibers which connect dedicated storage bricks (embedded controllers with disks), to hosts running special drives. Half way between NFS based NAS and SAN is iSCSI.
For more information
There are a number of file systems which have been designed to run on a cluster of machines. These have attempted to provide effective coordination and address reliability through automatic replication. They typically address scalability horizontally (many boxes rather than one big/fast box). Sometimes clustered file systems are not fully integrated into the operating system. For example, the googlefs and MogileFS require applications to be linked against the "file-system" library. Brent Welch wrote up a brief taxonomy of clustered file systems entitled What is a Cluster File System.
For more information:
Often times, it turns out that state management can be reduced to simple atomic actions on a simple key/value pair, hashes, or other basic data structures like Btrees. Historically many services built simple stores on top of file systems because that was the quickest / easiest way to built a blob store. Richard Jones of last.fm wrote up a decent survey of several of the more interesting open source distributed key value stores. Some projects that have caught my interesting over the years: