Estimating IT Staffing / Factors that effect Scaling IT Staff
Appeared in Unix Review, April 1991
Minor update December 1, 2008... this really needs a full rewrite
I posted a note to Usenet responding to a question about staffing ratios. Rob Kostad asked me to expand that short note into an article for Unix Review. The original article in troff -ms form is still around. Over the years I have made some, mostly minor updates to the original article. One of these days I will rewrite it completely. While this article was written a long time ago, I find that the ratios are still pretty accurate. If you think I am wrong, send me mail with your experiences.
I will note that this article is primarily written thinking about classic enterprise computing... what happens inside a company or a university where IS/IT solutions are delivered. There was a nice graph found in the slide deck Impliance: an Information Management Appliance by folks from IBM Research which captured how staffing costs have gone up in comparison to the cost of hardware for enterprise computing. While related, running a production service which is delivering software as service is quite a bit different from enterprise computing. There was a brief article in CIO which demonstrates what happens when you benchmark enterprises against service providers. In a production environment, there is typically significantly more investment make into infrastructure and tools, work is often shared between an engineering group and an operations group, and there are often economies of scale that I only hint at in this paper. I discuss some of these issues in my Hints for Operating High Quality Services. James Hamilton has noted that in large scale operations, human staff accounts for less than 10% of overall costs.
"How many system administrators does a site need?" is a commonly asked and difficult to answer question. There is no magic ratio. There is no ideal staffing level. The appropriate number of administrators depends on what each system administrator is responsible for and on the level of service expected in each area of responsibility.
The best way to estimate the number of administrators needed is to figure out what level of service is required and how various factors (for instance networking infrastructure and heterogeneity of the machines being supported) will affect the the fulfillment of those responsibilities. Rarely are system administrators doing only "administrator" tasks. The first part of this article will detail the tasks that I find myself performing in addition to the normal "administrator" tasks, such as backups, installing new users, operating-system maintenance, and so forth. Additional tasks are presented (for the most part) in the form of questions. The second part details some of the various factors that will affect staff levels. The third part details some simple perspectives that system administrators can adopt to make their environment more easily administrable. Finally, I will end by quickly examining some ratios which might help you to approximate your staffing needs.
Does the site want you to conduct workshops or prepare extensive local documentation? To what extent are you expected to consult on technical issues? Do you concern yourself with just UNIX or other realms? For example, let's say your site has heavy users of TeX, Mathematica, Common LISP, C++, X11, PostScript, and Sybase. Are you supposed to be able to answer detailed questions on all those topics? Few people are experts at all these things. Something that many people don't appreciate is that development of expertise in any given topic area requires time to play, experiment, and mature in that area.
Another problem is that the site with a single system administrator has a single point of failure: when the administrator is on vacation (or gets run over by a bus), the site is vulnerable. Carrying a pager on vacation isn't my idea of fun; however, no one can predict when a crisis might occur. Of course, it's hard to interest a high-level person in a job that also involves changing the backup tapes and crawling through the ceilings.
Secondly, larger sites can leverage off previous work. The first installation of a machine or piece of software is always the most difficult. The second is easier. By the time you have done 50 or 100 installations, you have developed automatic scripts and can do installations in your sleep. I have seen large sites at a 1:100 administrator-to-machine ratio where things ran pretty well. I must caution the reader though: this sort of ratio is only feasible with top-notch people working in a carefully engineered environment with many hundreds of users. Most sites can't get productive work done with this sort of ratio. This sort of ratio also limits the professional growth of members of the system staff because they will spend most of their time with the day-to-day issues and fire-fighting. This is a shame since an organization's most valuable resource is its people.
Site which need to be highly available (e.g. greater than 99.9% service delivery) will require a higher level of staffing. The reason for this is you need people who can respond almost immediately to any service issues (e.g. 24x7 coverage, ideally at least 2 people deep who can do first and second level resolution, and be able to escalate to subject area experts). You also need to have multiple people for each subject area who are able to diagnosis and resolve complex issues quickly.
Note: The above list are applicable for small organizations. When facing larger scale issues, there are numerous other factors that come into play. I have attempted to capture some of the lessons I have learned in Hints for Operating High Quality Services. I would recommend looking at the papers noted at the end of Design Principles to Remember.
A colleague suggested to me that it is critical to keep in mind what factors effect scaling of a team. He provided a nice summary table.
|Increased SA Efficiency||Decreased SA Efficiency|
Common SA tools
Robust IS security
Tight control over what gets loaded on HW/SW baseline
Redundancy of critical services
Separating services (single service machines)
Good training program
Detailed disaster recovery plans, by system
System which don't require backups
Good backup/restore program, centrally managed
Diverse hardware baseline
Diverse software baseline
Lax IS security
Little or no training
A staff that is reactive, not proactive
ad-hoc backups or no backups
The following is a very rough set of rules I use to estimate staffing requirements. Your mileage will vary. I should note that these numbers assume maintaining a reasonably stable environment. Rapid turnover of user base, machines, abnormally frequent software changes, growth of the environment, etc results in more work and effect the ratios.
|Type of Work||Units of labor to deliver best practice performance and scaling factors|
|IS/IT User Service||1 unit for every 10 computer-phobia users who need to do "complex things" (hand-holding factor), 1 unit to 30 users who get good service. 1 unit for every 120 who get basic service, and (e.g. students in an educational factory who mostly self-serve :-) assuming 8x5 support. Ratio has to go up if you want help desk to run extended hours. Note: companies that are running well specified software as service typically have a completely separate organization that hands end-user requests. In these cases, the number of people being provided "user service" by the operations group is the size of the customer service representatives, not the end-user customers.|
|24 x 7 Support (Partners, clients, etc)||Doing a 24x7 NOC which requires proactive notification and rapid problem resolution scales against the complexity of the service that is being managed and the number of high touch clients. Places that really care about this have a step in cost of 14 people... a manager, an assistant manager, three shifts, with each shift having two people, one shift running sunday-wednesday, and the other running wednesday-saturday so there is overlap between teams, clean handoffs, and times to do group training. Less that this can easily result in shifts not being covered. For example, having a single person / shift can fail if the night shift person falls asleep, or if someone working one of the weekend shifts gets sick. This doesn't count folks to escalate to. Some places have their ops / engineers carry pagers and back up the front line NOC. Other places have dedicated folks to handle escalations. The number of people needed per shift is related to how much normal work there is, and how many simultaneous disasters the team is expected to be able to handle.|
|Operating System Management||2 units for each make of OS requiring basic background. If you are pushing the OS beyond mainstream / tested scale add an addition 4 units. Doing very complex things requiring hacked kernels, non standard device drivers, etc then add 4 units. If you really care about security add an additional 4 units. Need functionality which isn't in the kernel at this time and/or something more than basic jumpstart or kickstart for installation and management? Manage this like a software development project and get good engineers working on it.|
|Hardware Management / Host Imaging (OS Deployment)||1 unit for every 20 boxes if you can't protect the OS and system configurations from the users (Windows in many environments). 1 unit for every 40 boxes if you can protect the OS from the users without hindering the user, but can't be automatically build / rebuild / update OS and software without sysadmin oversight. 1 unit for every 120 boxes without users which have network based software installs (compute clusters). Extremely large scale operations (1000s of machines running completely cookie cutter) scale more like 500 boxes / unit and might scale as high as 2500 boxes/unit at a google scale where you don't have to worry about the health of individual machines.|
|Platform Interoperability||2 units * # of OS if tight coupling. (shared filing, etc)|
|Simple Network Services||1 unit for every two basic services that are set up network wide instead of machine wide. e.g. newsspool, httpd, DNS, mail, printing, SAMBA. Add 2 units if you want to make them highly available (better than 99.8%). Add 2 units if you care about security. Add 2 units if you are scaling larger than the average. Add 4 if you are scaling to mega size and are beyond what the software was designed for. If you are completely beyond scale, treat a development project and staff accordingly with real engineers.|
|Complex Network Services||Highly variable. For example, multi-terabyte database used for data mining could easy consume multiple DBAs + multiple senior system administrators who specialize in performance tuning and large scale storage system.|
|Network Connectivity||Scales against number of network devices, number of networks, security issues, complexity of routing, HA requirements. Don't have good numbers at this time.|
|Coordination and Management||The larger and more complex an organization, the more there is a need for coordination roles. People who focus on human management, systems architecture, program management, project management. This is quite complex. It would be presumptuous to suggest a ratio.|
A solid SAGE II system administrator can handle 4 units of work. A strong SAGE III system administrator can handle 8 units of work. A superior SAGE IV system administrator can handle 12 units of work. This counting system is loosely based on an equation proposed by Sherwood Botsford and found in the comp.unix.admin FAQ. A some point I will update the counting to use my Operations Skill Matrix (excel).
In the last few years there have been a lot of people who have talked about the ratios they think are reasonable. It is common to hear people talking about staff/user ratios of 1:60 where there is some variation in the population and a lot of custom work, and staff/user ratios of 1:150 (or higher) in locations that can use "cookie cutter" solutions, eg universities with hordes of undergraduates or enterprises where people are using computing as a tool rather than looking to innovate on the machines that are being administered. A more realistic set of ratios (based on best practices in the field rather than vendor white pages on TCO) was the Mega Group's Improve staffing ratios article. There are a number of other studies that have found that in the real world most organizations have not been able to support ratios greater than 30:1. A Mitre study from 2000 suggested that the ratio is 47:1 +/- 17%. In a video about User to Technician Ratios by Justin Nguyen a base ratio of 60:1 was suggestion, with a number of factors which impacts this ratio.
An example of over inflated numbers can be found in Staffing for Technology Support, a white paper for education institutions. Unfortunately, these folks are trying to apply staffing ratios from MIT's Project Athena to the rest of the world. This is flawed for three reasons. First, most sites don't have the sophisticated tools that Athena had. Second, Athena had people who make Athena run which were not capture in their ratios: student volunteers that did a lot of work and hard core system programmer that developed tools which met MIT's requirements. Finally, MITs user population is not an average user population.
David Cappuccio of the Gartner Group suggested in his article Know The Types: Sizing up Support Staffs that there are two ratios that you need to consider. The first ratio is staff to users, an attempt to capture the human part of the equation. This ratio is looking at how many people you need to do what is often called Tier I, help desk, or user support. The second ratio is the number of machines and subsystems per staff, that is capturing how many people are needed to take care of the technical infrastructure. While I like David's framework, I think that his ratios are too high for user support, and that he has failed to capture the diverse set of technologies most organizations deploy: there is much more than print, file, web, and database servers. There are directory, security, messaging, and collaborative services. To complicate matters, many sites are heterogeneous requiring extra efforts to make one service work for all clients, or worse, resulting in the need services which are based on the client platform. A final complicating factor is that these services often have complex interactions and dependencies which makes them more difficult to deploy and maintain. The result is that David's ratios will result staffing which will be able to deliver only the most basic services at an adequate level.
The itbenchmark blog has a number of postings on the topic of staff sizing.