How Many Administrators are Enough?
Estimating IT Staffing / Factors that effect Scaling IT Staff
MARK VERBER
Appeared in Unix Review, April 1991
Minor update March 20, 2008... this really needs a full rewrite
I
posted a note to Usenet responding to a question about staffing ratios.
Rob Kostad asked me to expand that short note into an article for Unix Review.
The original article in troff -ms form is still
around.
The Enigmatic Question
"How many system administrators does a site need?" is a commonly asked and
difficult to answer question. There is no magic ratio. There is no ideal
staffing level. The appropriate
number of administrators depends on what each system administrator is
responsible for and on the level of service expected in each area of
responsibility.
The best way to estimate the number of administrators needed is to
figure out what level of service is required and how various
factors (for instance networking infrastructure and heterogeneity of
the machines being supported) will affect the the fulfillment of those
responsibilities. Rarely are system administrators doing only
"administrator" tasks. The first part of this article will detail
the tasks that I find myself performing in addition to the normal
"administrator" tasks, such as backups, installing new users,
operating-system maintenance, and so forth. Additional tasks are
presented (for the most part) in the form of questions. The second
part details some of the various factors that will affect staff
levels. The third part details some simple perspectives that system
administrators can adopt to make their environment more easily
administrable. Finally, I will end by quickly
examining some ratios which might help you to approximate your staffing needs.
What System Administrators Do
User Services (e.g. Helpdesk)
How much hand-holding is expected? Some sites have users who are
pretty self-sufficient; other sites have users who need assistance
for everything they do. Can your users take care of themselves or do
they need and want the administrator to perform even the simplest tasks for
them? For example, I have a friend whose users demand that he perform
the most basic tasks for them (such as moving their files from one
directory to another). Anything that isn't simply invoking the text editor
or reading mail is "UNIX" and hence a job for the administrator. This
sort of support requires a ratio something like one administrator
for every four users.
Does the site want you to conduct workshops or prepare
extensive local documentation? To what extent are you
expected to consult on technical issues? Do you
concern yourself with just UNIX or other realms? For
example, let's say your site has heavy users of TeX,
Mathematica, Common LISP, C++, X11, PostScript, and Sybase. Are you
supposed to be able to answer detailed questions
on all those topics? Few people are experts at all these things.
Something that many people don't appreciate is that development of
expertise in any given topic area requires time to play, experiment,
and mature in that area.
Software Support
How much public domain software or freeware do people want installed?
What level of support are they expecting? Just compiling and
installing software doesn't take much time. Often though, software
doesn't just compile and install properly. There are often
assumptions in the software which need to be changed before the
software can be used at a given site. In addition, administrators are
often expected (and rightly so) to continue maintenance of the
software (bug fixes and what not) and to become an expert in the use
of the software. Compiling and installing (coupled with frequent
patches) or many hardware/software platforms can make this incredibly
time consuming for even just a few software packages. The time this
takes varies with the quality and complexity of the software. Keeping a
current version of kermit or perl isn't hard (I wish
everyone did as nice a job as Larry Wall has with perl); keeping
up with g++ is much more time-consuming.
Custom Software
Most places not only expect the system administrators to keep their
world running, but also to create -- on demand -- tools for the user
population. This is understandable, especially in small sites where
the administrator might be the only professional programmer. If there
is this expectation, time must be allocated for this development
process.
Site Planning/Administration Overhead
How much site planning is the administrator expected to handle? Must
the administrator know that the average person generate 115
watts, and how to factor that and heat loads from machines to scale appropriate
AC/heating loads and power? How much
paperwork is there?
Hardware/Network Maintenance
Who crawls through the ceiling to pull wires? Who finds the flaky
transceiver when the Ethernet starts to go crazy? When a terminal or
workstation dies, does a secretary just call your vendor and wait, or are more
creative solutions required? Does your site buy all its peripherals
ready-to-install or do you save money by purchasing components and do the
integration yourself? Having a system administrator do any of these
things takes time.
Anticipate Technology
Is the administrator supposed to anticipate new technology and advise
the company about new approaches? Most places I have worked expect
administrators to have a good feel for the state of the art and new
technology that looks promising (not just products, but research,
too). Anticipation is often necessary given many sites have a
two-to-five year planning or depreciation schedule. Keeping up with
our field isn't easy. There are a variety sources one much draw upon
to stay current. I have found a variety of good sources for current
information. Trade rags can give you a picture of what is being sold,
Usenet (and other electronic media) is great for questions regarding
current issues and problems. Professional journals from ACM, IEEE, etc
are useful to see what is happening on the almost done research
front. There is no substitute however, for a good network of professional
contacts. This network can be maintained with phone calls, electronic mail,
and attending conferences.
Other Issues
Site with one administrator are not very desirable.
They are a fact of life since many small sites can neither afford nor
justify more than one system administrator. It is difficult for
one person to have the breadth of knowledge and experience to run a
really first-class site, no matter how few machines it has. There
will always be some area that is not the strength of a sole
administrator.
Another problem is that the site with a single system
administrator has a single point of failure: when the administrator is
on vacation (or gets run over by a bus), the site is vulnerable.
Carrying a pager on vacation isn't my idea of fun; however, no one can
predict when a crisis might occur. Of course, it's hard to
interest a high-level person in a job that also involves changing the
backup tapes and crawling through the ceilings.
The more homogeneous a site is, the easier it is to support.
The number of different platforms supported (different machine
architectures or different operating systems) increases the complexity
of the support task. Upgrading the operating system will have to be
done at least once by hand for each platform. Each operating system
has it own idiosyncrasies that must be learned and mastered. Most
sites want all the platforms to appear identical so that their users
can sit down on any of the workstations and get work done. This
requires that each platform have identical tools, window systems, etc.
This can greatly increase the amount of work the administrator must do. In the
best of circumstances this means recompiling programs for each
platform. In the worst circumstances, it involves porting software,
and fighting with vendor-supplied software. My personal nightmare is
trying to support all of X11R4 (from MIT), DECwindows, OSF/Motif, and Sun's
OpenWindows on three different platforms.
Larger sites can exploit economies of scale.
Large sites can expand their administration staffs less rapidly than
the number of users (or workstations) grows.
The reason for this is
that as your staff gets larger it is possible for people to
specialize. This specialization permits individual staff members
to develop a depth of expertise that enables them to understand all
the issues on a given topic and solve more quickly
whatever problems crop up.
Secondly, larger sites can leverage off previous
work. The first installation of a machine or piece of software is
always the most difficult. The second is easier. By the time
you have done 50 or 100 installations, you have developed automatic
scripts and can do installations in your sleep. I have seen large
sites at a 1:100 administrator-to-machine ratio where things ran pretty
well. I must caution the reader though: this sort of ratio is only
feasible with top-notch people working in a carefully engineered
environment with many hundreds of users. Most sites can't get
productive work done with this sort of ratio. This sort of ratio also
limits the professional growth of members of the system staff because
they will spend most of their time with the day-to-day issues and
fire-fighting. This is a shame since an organization's most valuable
resource is its people.
High Availability Sites Require higher staffing.
Site which need to be highly available (e.g. greater than 99.9% service
delivery) will require a higher level of staffing. The reason for this is
you need people who can respond almost immediately to any service issues (e.g.
24x7 coverage, ideally at least 2 people deep who can do first and second level
resolution, and be able to escalate to subject area experts). You also need to have
multiple people for each subject area who are able to diagnosis and resolve complex
issues quickly.
Hints for Making Administration Easier
As is sadly too often the case with support staff, system
administrators are not highly regarded (even though everyone at the
site depends on them). My experience is that there are never enough
system administrators. Because staffing levels are not what they
should be, a system administrator needs to take all possible
(productive) short cuts and have a proactive rather than reactive
approach to system-administration tasks. If administrators do not
employ a proactive approach, they will find themselves constantly in a
"fire-fighting mode," which is counter-productive. System
administrators need to leverage their time as much as possible.
Here are some of the things that help me survive at my site.
- Build Tools!
Always do your work with scripts or
tools. If you have to install a program or modify a set of
configuration files, you will most likely have to do it again. Build
small tools to do the work for you. Never do things by hand (or least
never do things by hand more than once).
- Automate Everything!
Use tools that take care of things automatically. Clean your logs
with a shell script that runs from cron. Use programs that will
automatically update workstations from a "master" so you only have
to install software on one machine and the software is automatically
"distributed" to all the other machines. Berkeley's rdist
program does this by "pushing" new copies of software. CMU's
sup does this by having the workstations "pull" new copies of
a program to the workstation.
- Carefully Encapsulate Localizations
Minimize the number of nonstandard pieces you have to add when you perform new
installations on the operating system. Concentrate local changes in
/usr/local (or use some similar scheme) as much as possible.
Try to refrain from hacking on and reinstalling
local versions of things in /bin, /usr/ucb, and so on. Throw out
vendor supplied /etc/rc* files and create your own. Your /etc/rc*
should provide all of the parameters in a single file
that you need to change to localize your machines.
- Standardize Environments and Configurations
If each of your machines is configured differently (such as different swap
sizes; some diskless, some dataless, some diskful; different software
installed), you are creating headaches for yourself. If you can have a
single "prototypical" machine from which you can clone distributions and
upgrades, software dissemination can be performed automatically.
For example, let's say you run a dataless configuration with /,
/var, and parts of /usr on a local disk (all other
files are accessed via the automounter). You could configure a diskless
partition on a server that would boot up, install your localized
operating system on the small local disk, and reboot as a newly
configured workstation ready for action just by editing two or three
files on your server that specify how to boot your diskless client.
If you have to configure and install each machine by hand, you will
waste time whenever you install a new machine or
have to do an OS upgrade. Leverage uniformity!
- Document your Environment If you regularly get the same
questions from your users, you have failed to effectively document.
Spend the up-front time to document, you will save that time (and more) on the
backside by fewer and shorter support calls on those topics.
- Share the Work Most sites have a number of highly motivated
and clueful users. Harness their energy. Find ways for them to
help out and find ways to encourage people to serve themselves.
What About Other Platforms?
The platform which is being supported makes a great deal of
difference. My experience is that support of Macintosh and UNIX
communities take approximately the same staffing levels. Historically support of
PCs running any Microsoft OS seems to require at least double the
staffing and delivers a lower level of service. Since Windows XP the ratio
doesn't need to be as high... but I still find administration scales better on
UNIX than Windows.Know What Drives Scaling
A colleague suggested to me that it is critical to keep in mind what factors
effect scaling of a team. He provided a nice summary table.
| Increased SA Efficiency |
Decreased SA Efficiency |
Enterprise monitoring
Common SA tools
Standards
Robust IS security
Tight control over what gets loaded on HW/SW baseline
Redundancy of critical services
Separating services (single service machines)
Good training program
Detailed disaster recovery plans, by system
System which don't require backups
Good backup/restore program, centrally managed
|
Diverse hardware baseline
Diverse software baseline
Lax IS security
Little or no training
A staff that is reactive, not proactive
ad-hoc backups or no backups
|
My Ratios
The following is a very rough set of rules I use to
estimate staffing requirements. Your mileage will vary. I
should note that these numbers assume maintaining a reasonably stable
environment. Rapid turnover of user base, machines, abnormally frequent
software changes, growth of the environment, etc results in more work and effect
the ratios.
| Type of Work |
Units of labor to deliver best
practice performance and scaling factors |
| IS/IT User Service |
1 unit for
every 30 users who get good service. 1 unit for every
120
who get minimal service (e.g. students in an educational factory who mostly self-serve
:-) assuming 8x5 support. Ratio has to go up if you want help desk to
run extended hours. |
| 24 x 7 Support (Partners, clients, etc) |
Doing a 24x7 NOC which requires proactive notification and rapid problem resolution
scales against number of clients and the complexity of the service that is
being managed. Places that really care about this have a step in cost of 14
people... a manager, an assistant manager, three shifts, with each shift
having two people, one shift running sunday-wednesday, and the other running
wednesday-saturday so there is overlap between teams, clean handoffs, and
times to do group training. Less that this can easily result in shifts
not being covered. For example, having a single person / shift can
fail if the night shift person falls asleep, or if someone working one of
the weekend shifts gets sick. This doesn't count folks to escalate to. Some
places have their ops / engineers carry pagers and back up the front line
NOC. Other places have dedicated folks to handle escalations. The number of
people needed per shift is related to how much normal work there is, and how
many simultaneous disasters the team is expected to be able to handle. |
| Operating System Management |
2 units for each make of OS requiring basic
background. If you are pushing the OS beyond mainstream / tested scale add
an addition 4 units. Doing very complex things requiring hacked kernels, non standard device
drivers, etc then add 4 units. If you really care about security add
an additional 4 units. |
| Hardware Management |
1 unit for every 20 boxes if you
can't protect the OS and system configurations from
the users (Windows in many environments).
1 unit for every 40 boxes if you
can protect the OS from the users without
hindering the user, but can't be automatically
build / rebuild / update OS and software without sysadmin oversight. 1
unit for every 120 boxes without users which have network based software
installs (compute clusters). Extremely large scale operations
(1000s of machines running completely cookie cutter) scale more like 500
boxes / unit. |
| Platform Interoperability |
2 units * # of OS if tight coupling. (shared
filing, etc) |
| Network Connectivity |
Scales against number of
network devices, number of networks, security issues, complexity
of routing, HA requirements. Don't have good numbers at
this time. |
| Simple Network Services |
for each basic subsystems that are set up network
wide instead of machine wide.
e.g. newsspool, httpd, DNS, mail,
printing, SAMBA. Add 2 units if you want to make them highly available
(better than 99.9%). Add 2 units if you care about
security. Add 2 units if you are scaling larger than the
average. |
| Complex Network Services |
Highly variable. For
example, multi-terabyte database used for data mining could
easy consume multiple DBAs + multiple senior system
administrators who specialize in performance tuning and large
scale storage system. |
A solid SAGE II system administrator can handle 4 units of work. A strong SAGE
III system administrator can handle 8 units of work. A superior SAGE IV system
administrator can handle 12 units of work. This counting system is loosely based
on an equation proposed by Sherwood Botsford and found in the comp.unix.admin
FAQ. A some point I will update the counting to use my
Operations Skill Matrix (excel).
Other People's Ratios
In the last few years there have been a lot of people who have talked
about the ratios they think are reasonable. It is common to hear
people talking about staff/user ratios of 1:60 where there is some
variation in the population, and staff/user ratios of 1:150 (or higher) in
locations that can use "cookie cutter" solutions, eg universities with hordes of
undergraduates. A more realistic set of ratios (based
on best practices in the field rather than vendor white pages on TCO) was the Mega Group's
Improve staffing ratios article. There are a number of other studies that
have found that in the real world most organizations have not been able to
support ratios greater than 30:1. A
Mitre study
from 2000 suggested that the ratio is 47:1 +/- 17%.An example of over
inflated numbers can be found in
Staffing
for Technology Support, a white paper for
education institutions. Unfortunately, these folks are trying to apply
staffing ratios from MIT's Project Athena to the rest of the world. This
is flawed for three reasons. First, most sites don't have the
sophisticated tools that Athena had. Second, Athena had people who make
Athena run which were not capture in their ratios:
student volunteers that did a lot of work and hard core system programmer that developed tools which met
MIT's requirements. Finally, MITs user population is not an average user population.
David Cappuccio of the Gartner Group suggested in his article Know The
Types: Sizing up Support Staffs that there are two ratios that you
need to consider. The first ratio is staff to users, an attempt to
capture the human part of the equation. This ratio is looking at how
many people you need to do what is often called Tier I, help desk, or
user support. The second ratio is the number of machines and
subsystems per staff, that is capturing how many people are needed to
take care of the technical infrastructure. While I like David's
framework, I think that his ratios are too high for user support, and
that he has failed to capture the diverse set of technologies most
organizations deploy: there is much more than print, file, web, and
database servers. There are directory, security, messaging, and
collaborative services. To complicate matters, many sites are
heterogeneous requiring extra efforts to make one service work for
all clients, or worse, resulting in the need services which are based
on the client platform. A final complicating factor is that these
services often have complex interactions and dependencies which makes
them more difficult to deploy and maintain. The result is that David's
ratios will result staffing which will be able to deliver only the
most basic services at an adequate level.
Conclusion
The number of administrators required varies greatly from site to
site. The one constant is that there are rarely enough system
administrators for the responsibilities that they have. My personal
experience is that it is possible for a single person to
maintain up to 120 machines (with three different platforms) and give
adequate user services to a fairly sophisticated user population. My
time is divided between user services (30 percent), general system
administration tasks (20 percent), installing new machines and
hardware/network support (10 percent), software installation and
maintenance (40 percent), custom software development and tracking of
trends (25 percent), and site planning (10 percent). You will note
that this adds up to 135 percent.
About the Author
Mark Verber wrote the first edition of
this article while he was working for the Physics Department at The Ohio State
University based on his eleven years of experience providing computing services
to an large numbers of under-graduate students, and supporting a moderate number
of highly technical graduate students, research staff, and faculty members. This
article has been revised based on an additional sixteen years on experience in
industry having worked at Xerox's Palo Alto Research Center supporting
researchers and running core infrastructure for Xerox's world wide corporate
network, WebTV/Microsoft supporting approx 1M subscribers while working about
security issues, Tellme Networks where he built the team and co-architected a
service which delivered 99.994% serviceability to fortune 100 companies, and now
Metaweb Technologies where is his the
director operations.