How Many Administrators are Enough?

Estimating IT Staffing / Factors that effect Scaling IT Staff

MARK VERBER
Appeared in Unix Review, April 1991
Minor update December 1, 2008... this really needs a full rewrite

I posted a note to Usenet responding to a question about staffing ratios.  Rob Kostad asked me to expand that short note into an article for Unix Review. The original article in troff -ms form is still around. Over the years I have made some, mostly minor updates to the original article. One of these days I will rewrite it completely. While this article was written a long time ago, I find that the ratios are still pretty accurate. If you think I am wrong, send me mail with your experiences.

I will note that this article is primarily written thinking about classic enterprise computing... what happens inside a company or a university where IS/IT solutions are delivered. There was a nice graph found in the slide deck Impliance: an Information Management Appliance by folks from IBM Research which captured how staffing costs have gone up in comparison to the cost of hardware for enterprise computing. While related, running a production service which is delivering software as service is quite a bit different from enterprise computing. There was a brief article in CIO which demonstrates what happens when you  benchmark enterprises against service providers. In a production environment, there is typically significantly more investment make into infrastructure and tools, work is often shared between an engineering group and an operations group, and there are often economies of scale that I only hint at in this paper. I discuss some of these issues in my Hints for Operating High Quality Services. James Hamilton has noted that in large scale operations, human staff accounts for less than 10% of overall costs.

The Enigmatic Question

"How many system administrators does a site need?" is a commonly asked and difficult to answer question. There is no magic ratio. There is no ideal staffing level. The appropriate number of administrators depends on what each system administrator is responsible for and on the level of service expected in each area of responsibility.

The best way to estimate the number of administrators needed is to figure out what level of service is required and how various factors (for instance networking infrastructure and heterogeneity of the machines being supported) will affect the the fulfillment of those responsibilities. Rarely are system administrators doing only "administrator" tasks. The first part of this article will detail the tasks that I find myself performing in addition to the normal "administrator" tasks, such as backups, installing new users, operating-system maintenance, and so forth. Additional tasks are presented (for the most part) in the form of questions. The second part details some of the various factors that will affect staff levels. The third part details some simple perspectives that system administrators can adopt to make their environment more easily administrable. Finally, I will end by quickly examining some ratios which might help you to approximate your staffing needs.

What System Administrators Do

User Services (e.g. Helpdesk)

How much hand-holding is expected? Some sites have users who are pretty self-sufficient; other sites have users who need assistance for everything they do. Can your users take care of themselves or do they need and want the administrator to perform even the simplest tasks for them? For example, I have a friend whose users demand that he perform the most basic tasks for them (such as moving their files from one directory to another). Anything that isn't simply invoking the text editor or reading mail is "UNIX" and hence a job for the administrator. This sort of support requires a ratio something like one administrator for every four users.

Does the site want you to conduct workshops or prepare extensive local documentation? To what extent are you expected to consult on technical issues? Do you concern yourself with just UNIX or other realms? For example, let's say your site has heavy users of TeX, Mathematica, Common LISP, C++, X11, PostScript, and Sybase. Are you supposed to be able to answer detailed questions on all those topics? Few people are experts at all these things. Something that many people don't appreciate is that development of expertise in any given topic area requires time to play, experiment, and mature in that area.

Software Support

How much public domain software or freeware do people want installed? What level of support are they expecting? Just compiling and installing software doesn't take much time. Often though, software doesn't just compile and install properly. There are often assumptions in the software which need to be changed before the software can be used at a given site. In addition, administrators are often expected (and rightly so) to continue maintenance of the software (bug fixes and what not) and to become an expert in the use of the software. Compiling and installing (coupled with frequent patches) or many hardware/software platforms can make this incredibly time consuming for even just a few software packages. The time this takes varies with the quality and complexity of the software. Keeping a current version of kermit or perl isn't hard (I wish everyone did as nice a job as Larry Wall has with perl); keeping up with g++ is much more time-consuming.

Custom Software

Most places not only expect the system administrators to keep their world running, but also to create -- on demand -- tools for the user population. This is understandable, especially in small sites where the administrator might be the only professional programmer. If there is this expectation, time must be allocated for this development process.

Site Planning/Administration Overhead

How much site planning is the administrator expected to handle? Must the administrator know that the average person generate 115 watts, and how to factor that and heat loads from machines to scale appropriate AC/heating loads and power? How much paperwork is there?

Hardware/Network Maintenance

Who crawls through the ceiling to pull wires? Who finds the flaky transceiver when the Ethernet starts to go crazy? When a terminal or workstation dies, does a secretary just call your vendor and wait, or are more creative solutions required? Does your site buy all its peripherals ready-to-install or do you save money by purchasing components and do the integration yourself? Having a system administrator do any of these things takes time.

Anticipate Technology

Is the administrator supposed to anticipate new technology and advise the company about new approaches? Most places I have worked expect administrators to have a good feel for the state of the art and new technology that looks promising (not just products, but research, too). Anticipation is often necessary given many sites have a two-to-five year planning or depreciation schedule. Keeping up with our field isn't easy. There are a variety sources one much draw upon to stay current. I have found a variety of good sources for current information. Trade rags can give you a picture of what is being sold, Usenet (and other electronic media) is great for questions regarding current issues and problems. Professional journals from ACM, IEEE, etc are useful to see what is happening on the almost done research front. There is no substitute however, for a good network of professional contacts. This network can be maintained with phone calls, electronic mail, and attending conferences.

Other Issues

Site with one administrator are not very desirable.

They are a fact of life since many small sites can neither afford nor justify more than one system administrator. It is difficult for one person to have the breadth of knowledge and experience to run a really first-class site, no matter how few machines it has. There will always be some area that is not the strength of a sole administrator.

Another problem is that the site with a single system administrator has a single point of failure: when the administrator is on vacation (or gets run over by a bus), the site is vulnerable. Carrying a pager on vacation isn't my idea of fun; however, no one can predict when a crisis might occur. Of course, it's hard to interest a high-level person in a job that also involves changing the backup tapes and crawling through the ceilings.

The more homogeneous a site is, the easier it is to support.

The number of different platforms supported (different machine architectures or different operating systems) increases the complexity of the support task. Upgrading the operating system will have to be done at least once by hand for each platform. Each operating system has it own idiosyncrasies that must be learned and mastered. Most sites want all the platforms to appear identical so that their users can sit down on any of the workstations and get work done. This requires that each platform have identical tools, window systems, etc. This can greatly increase the amount of work the administrator must do. In the best of circumstances this means recompiling programs for each platform. In the worst circumstances, it involves porting software, and fighting with vendor-supplied software. My personal nightmare is trying to support all of X11R4 (from MIT), DECwindows, OSF/Motif, and Sun's OpenWindows on three different platforms.

Larger sites can exploit economies of scale.

Large sites can expand their administration staffs less rapidly than the number of users (or workstations) grows. The reason for this is that as your staff gets larger it is possible for people to specialize. This specialization permits individual staff members to develop a depth of expertise that enables them to understand all the issues on a given topic and solve more quickly whatever problems crop up.

Secondly, larger sites can leverage off previous work. The first installation of a machine or piece of software is always the most difficult. The second is easier. By the time you have done 50 or 100 installations, you have developed automatic scripts and can do installations in your sleep. I have seen large sites at a 1:100 administrator-to-machine ratio where things ran pretty well. I must caution the reader though: this sort of ratio is only feasible with top-notch people working in a carefully engineered environment with many hundreds of users. Most sites can't get productive work done with this sort of ratio. This sort of ratio also limits the professional growth of members of the system staff because they will spend most of their time with the day-to-day issues and fire-fighting. This is a shame since an organization's most valuable resource is its people.

High Availability Sites Require higher staffing.

Site which need to be highly available (e.g. greater than 99.9% service delivery) will require a higher level of staffing. The reason for this is you need people who can respond almost immediately to any service issues (e.g. 24x7 coverage, ideally at least 2 people deep who can do first and second level resolution, and be able to escalate to subject area experts). You also need to have multiple people for each subject area who are able to diagnosis and resolve complex issues quickly.

Hints for Making Administration Easier

As is sadly too often the case with support staff, system administrators are not highly regarded (even though everyone at the site depends on them). My experience is that there are never enough system administrators. Because staffing levels are not what they should be, a system administrator needs to take all possible (productive) short cuts and have a proactive rather than reactive approach to system-administration tasks. If administrators do not employ a proactive approach, they will find themselves constantly in a "fire-fighting mode," which is counter-productive. System administrators need to leverage their time as much as possible. Here are some of the things that help me survive at my site.

Note: The above list are applicable for small organizations. When facing larger scale issues, there are numerous other factors that come into play. I have attempted to capture some of the lessons I have learned in Hints for Operating High Quality Services. I would recommend looking at the papers noted at the end of Design Principles to Remember.

What About Other Platforms?

The platform which is being supported makes a great deal of difference. My experience is that support of Macintosh and UNIX communities take approximately the same staffing levels. Historically support of PCs running any Microsoft OS seems to require at least double the staffing and delivers a lower level of service. Since Windows XP the ratio doesn't need to be as high... but I still find administration scales better on UNIX than Windows.

Know What Drives Scaling

A colleague suggested to me that it is critical to keep in mind what factors effect scaling of a team.  He provided a nice summary table.

Increased SA Efficiency Decreased SA Efficiency
Enterprise monitoring
Common SA tools
Standards
Robust IS security
Tight control over what gets loaded on HW/SW baseline
Redundancy of critical services
Separating services (single service machines)
Good training program
Detailed disaster recovery plans, by system
System which don't require backups
Good backup/restore program, centrally managed
Diverse hardware baseline
Diverse software baseline
Lax IS security
Little or no training
A staff that is reactive, not proactive
ad-hoc backups or no backups

My Ratios

The following is a very rough set of rules I use to estimate staffing requirements. Your mileage will vary. I should note that these numbers assume maintaining a reasonably stable environment. Rapid turnover of user base, machines, abnormally frequent software changes, growth of the environment, etc results in more work and effect the ratios.

Type of Work Units of labor to deliver best practice performance and scaling factors
IS/IT User Service 1 unit for every 10 computer-phobia users who need to do "complex things" (hand-holding factor), 1 unit to 30 users who get good service.  1 unit for every 120 who get basic service, and  (e.g. students in an educational factory who mostly self-serve :-) assuming 8x5 support.  Ratio has to go up if you want help desk to run extended hours. Note: companies that are running well specified software as service typically have a completely separate organization that hands end-user requests. In these cases, the number of people being provided  "user service" by the operations group is the size of the customer service representatives, not the end-user customers.
24 x 7 Support (Partners, clients, etc) Doing a 24x7 NOC which requires proactive notification and rapid problem resolution scales against  the complexity of the service that is being managed and the number of high touch clients. Places that really care about this have a step in cost of 14 people... a manager, an assistant manager, three shifts, with each shift having two people, one shift running sunday-wednesday, and the other running wednesday-saturday so there is overlap between teams, clean handoffs, and times to do group training.  Less that this can easily result in shifts not being covered.  For example, having a single person / shift can fail if the night shift person falls asleep, or if someone working one of the weekend shifts gets sick. This doesn't count folks to escalate to. Some places have their ops / engineers carry pagers and back up the front line NOC. Other places have dedicated folks to handle escalations. The number of people needed per shift is related to how much normal work there is, and how many simultaneous disasters the team is expected to be able to handle.
Operating System Management 2 units for each make of OS requiring basic background. If you are pushing the OS beyond mainstream / tested scale add an addition 4 units. Doing very complex things requiring hacked kernels, non standard device drivers, etc then add 4 units. If you really care about security add an additional 4 units. Need functionality which isn't in the kernel at this time and/or something more than basic jumpstart or kickstart for installation and management? Manage this like a software development project and get good engineers working on it.
Hardware Management / Host Imaging (OS Deployment) 1 unit for every 20 boxes if you can't protect the OS and system configurations from the users (Windows in many environments). 1 unit for every 40 boxes if you can protect the OS from the users without hindering the user, but can't be automatically build / rebuild / update OS and software without sysadmin oversight. 1 unit for every 120 boxes without users which have network based software installs (compute clusters).  Extremely large scale operations (1000s of machines running completely cookie cutter) scale more like 500 boxes / unit and might scale as high as 2500 boxes/unit at a google scale where you don't have to worry about the health of individual machines.
Platform Interoperability 2 units * # of OS if tight coupling. (shared filing, etc)
Simple Network Services 1 unit for every two  basic services that are set up network wide instead of machine wide. e.g. newsspool, httpd, DNS, mail, printing, SAMBA. Add 2 units if you want to make them highly available (better than 99.8%). Add 2 units if you care about security. Add 2 units if you are scaling larger than the average.  Add 4 if you are scaling to mega size and are beyond what the software was designed for. If you are completely beyond scale, treat a development project and staff accordingly with real engineers.
Complex Network Services Highly variable. For example, multi-terabyte database used for data mining could easy consume multiple DBAs + multiple senior system administrators who specialize in performance tuning and large scale storage system.
Network Connectivity Scales against number of network devices, number of networks, security issues, complexity of routing, HA requirements. Don't have good numbers at this time.
Coordination and Management The larger and more complex an organization, the more there is a need for coordination roles. People who focus on human management, systems architecture, program management, project management.  This is quite complex. It would be presumptuous to suggest a ratio.

A solid SAGE II system administrator can handle 4 units of work. A strong SAGE III system administrator can handle 8 units of work. A superior SAGE IV system administrator can handle 12 units of work. This counting system is loosely based on an equation proposed by Sherwood Botsford and found in the comp.unix.admin FAQ. A some point I will update the counting to use my Operations Skill Matrix (excel).

Other People's Ratios

In the last few years there have been a lot of people who have talked about the ratios they think are reasonable. It is common to hear people talking about staff/user ratios of 1:60 where there is some variation in the population and a lot of custom work, and staff/user ratios of 1:150 (or higher) in locations that can use "cookie cutter" solutions, eg universities with hordes of undergraduates or enterprises where people are using computing as a tool rather than looking to innovate on the machines that are being administered. A more realistic set of ratios (based on best practices in the field rather than vendor white pages on TCO) was the Mega Group's Improve staffing ratios article. There are a number of other studies that have found that in the real world most organizations have not been able to support ratios greater than 30:1.  A Mitre study from 2000 suggested that the ratio is 47:1 +/- 17%.  In a video about  User to Technician Ratios by Justin Nguyen a base ratio of 60:1 was suggestion, with a number of factors which impacts this ratio.

An example of over inflated numbers can be found in Staffing for Technology Support, a white paper for education institutions.   Unfortunately, these folks are trying to apply staffing ratios from MIT's Project Athena to the rest of the world. This is flawed for three reasons. First, most sites don't have the sophisticated tools that Athena had. Second, Athena had people who make Athena run which were not capture in their ratios: student volunteers that did a lot of work and hard core system programmer that developed tools which met MIT's requirements. Finally, MITs user population is not an average user population.

David Cappuccio of the Gartner Group suggested in his article Know The Types: Sizing up Support Staffs that there are two ratios that you need to consider. The first ratio is staff to users, an attempt to capture the human part of the equation. This ratio is looking at how many people you need to do what is often called Tier I, help desk, or user support. The second ratio is the number of machines and subsystems per staff, that is capturing how many people are needed to take care of the technical infrastructure. While I like David's framework, I think that his ratios are too high for user support, and that he has failed to capture the diverse set of technologies most organizations deploy: there is much more than print, file, web, and database servers. There are directory, security, messaging, and collaborative services. To complicate matters, many sites are heterogeneous requiring extra efforts to make one service work for all clients, or worse, resulting in the need services which are based on the client platform. A final complicating factor is that these services often have complex interactions and dependencies which makes them more difficult to deploy and maintain. The result is that David's ratios will result staffing which will be able to deliver only the most basic services at an adequate level.

The itbenchmark blog has a number of postings on the topic of staff sizing.

Conclusion

The number of administrators required varies greatly from site to site. The one constant is that there are rarely enough system administrators for the responsibilities that they have. My personal experience is that it is possible for a single person to maintain up to 120 machines (with three different platforms) and give adequate user services to a fairly sophisticated user population. My time is divided between user services (30 percent), general system administration tasks (20 percent), installing new machines and hardware/network support (10 percent), software installation and maintenance (40 percent), custom software development and tracking of trends (25 percent), and site planning (10 percent). You will note that this adds up to 135 percent.

About the Author

Mark Verber wrote the first edition of this article while he was working for the Physics Department at The Ohio State University based on his eleven years of experience providing computing services to an large numbers of under-graduate students, and supporting a moderate number of highly technical graduate students, research staff, and faculty members. This article has been revised based on an additional sixteen years on experience in industry having worked at Xerox's Palo Alto Research Center supporting researchers and running core infrastructure for Xerox's world wide corporate network, WebTV/Microsoft supporting approx 1M subscribers while working about security issues, Tellme Networks where he built the team and co-architected a service which delivered 99.994% serviceability to fortune 100 companies, and now Metaweb Technologies where is his the director operations.