Lessons Learned Running a Production Service

Mark Verber


Once upon at time I was working for a company facing tough economic times. We had to lay some people off and needed to run lean until revenue caught up with expense. We didn't want to radically shift people's jobs or create a multi-class workplace (e.g. people who were doing the day to day, and the "special people" who got to work on project,  so we tried to use people's skills for both internal IS/IT as well as running the production service that customers paid for. We tried to make use of all the people's skills, so people typically had projects in addition to their day-to-day jobs.

What resulted was little work on IS/IT since production was always more important. Most of the team had trouble finishing the long term projects because they let the day-to-day consume them. This set up an situation where people didn't feel bad when they missed long term goals because they were busy keeping the existing world running.

We needed to split day-to-day from longer term projects, and call for the long-term projects to deliver. We could have fought against the multi-class workplace by periodically rotation between the two group, and really stressing to the long term people that their work will be judged by how successful the day-to-day folks are. The day-to-day folks would be encouraged to speak up for themselves and understand that they hold in their heads lots of things which the project oriented people need to know... but also realize they can be so down in the details that they might miss how to make things better.


Too often operations groups commit to a workload that isn't possible. As a result ops teams often felt that they are behind, even when they were working very well and hard. It never let up. This can quickly lead to a team feeling under siege and defeated. Rather than accept commitments that can be kept, it's better to realistically communicate what can and can't be done up-front. That way the people who are depending on the ops team to deliver have a realistic understanding of what may or may not get finished. They can change plans or work with the ops team to get additional resources. What you don't want is a project to run for several month only to have the ops group say in the 11th hour "We don't have time to deploy this". The other thing that can often happen is that the larger projects end being "so important" (aka too big to fail) that everything is sacrified to make them success. Often this means that small things, some which would only take a minute or two and really unblock someone are rejected because they weren't "in the plan".  If a group is falling to make it's commitments, needs should be assessed and re-negotated.


Several times I have worked places were we planned to have a tools team but we didn't have the headcount to just do tools We did a tools / service ops group. This was a mistake. Day-to-day needs made it very difficult to get longer-term projects done. Also, the people who were becoming our "tools" people weren't the strongest engineers and were not getting enough input to turn them into good engineers. We hand our tools folks to engineering where we were promised that they would have additional help, more active and higher quality mentoring / management, not to mention be combined with others to have more people working on tools. This happened for a short time but two things happened. First, tools were written to solve the engineers problems, not necessarily the operations teams problems. Second, as soon as the engineers problems were solved they moved onto other issues, leaving the netops team without good tools support.

There are two solutions for this.  The first is for an ops team to have at least one person who is given time to do really tools development without day-to-day interrupts. The second option is to distribute tools work through the teams (people make small tools for themselves), and when larger things are needed to in essence contact out the development process to some sort of dedicated resource (internal or external) with very well developed specifications.


Often operations groups are willing to take things from engineering which weren't ready and through people at the flaws. This is often refered to "products are thrown over the wall". Often promised fixes don't arrive because other issues took priority. Operations groups should establish a clear set of deployment requirements. Accepting incomplete work often sends the wrong message... people notice that the deployment was successful and fail to see the operations expense and the opportunity cost... so there will be not pressure to finish the job in engineering. If something isn't shipping, that produces a lot more pressure on the engineers to finish the job.

For this to work, the operations group really needs to work with engineering on good requirements. The operations group needs to actively engage with larger projects so they aren't  surprised as the project changes. As deadlines loom, "less critical features" will get cut. Customer facing features are typically championed by product managers. The ops team needs to make sure their "features" don't get cut. Interfaces between organizations need people to actively manage those interfaces.


Often times we have been faced with more projects that we can possibly complete. The first thing we would do was ask is if we scaled back the work to 80% solutions (which is roughly good enough) could we get enough of the important project finished. There where two problems with this. First, the cost of managing the remaining 20% by hand was almost always more expensive than we expected, and finishing the project (which we were unlikely to do) hung over people's head.

Our team is addicted to 80% solutions. But if we donít fully finish things, then we need smart people to use the tools, follow the process, etc. There are only so many clueful people. Staffing will be hard, and it will be very hard for new people to get up to speed.

Furthermore, issues related to the last 20% will often pop up at the worse possible time which makes scheduling even more difficult.


It's critically important to document the overall architecture and core design principles. Failure to do this has a number of consequences. First, new people (and even older people) have trouble coming up to speed. Second, not having the architecture clearly in most people's head makes it likely that people will be pulling in different directions. Finally, but not having clearly documented and agreed to architecture / design principles mean that any time we wanted to do something new, we had to get lots of people in a room to hash through things because you never knew what bits of info where in people's heads (each person might know a constraints which are needed) and each person might be pulling in a different direction. By making the documents we would have driven to a clearly articulated conclusion which would allow one person to operate consistent with the group direction without haven't to involve the whole team. It would also make is much easier for team members to come up to speed quickly, and facilitate cross team communication.


In general, I have worked places were we were very successful at getting the right people on the bus (e.g. good hiring). But we typically didn't take effective actions in the few cases when we brought in people who weren't getting the job done or when we move people into positions where they were not succeeding. It's easy to thing that the wrong person is costing you 1/2 an FTE... e.g. the right person would be twice as effective. This is not the case. My experience is that the wrong person can cost you 2-4 FTE worth of labor. Why?  People get frustrated and unmotivated. The wrong things are done which requires everyone to have to stop what problems are fixed. People end up having to engineer around the problem person. As soon as you see a problem, fix the problem.


Often times, especially when finances are tight, there is a tendency to defer hiring. This can work with a static work load. In a situation where a company is growing their customer base / work load, this will, long term, be a disaster. The longer the staffing is put off, the more likely there will be a crisis. Once the crisis hits there will be a tendency to hire in desperation which increases the odds that the wrong people will get on the bus. Also, there are many solutions a few people with more time will be able to solve better than throwing lots of people near the end. See mythical man month for more insight.


In large ops groups, most teams function as service bureaus or technology centers without any group with the role of coordination (other than project management which is more narrowly focused).  This isn't good. When an ops group is running a complex infrastructure there is a need for a set of people to "own" the overall service. These sorts of teams can have a variety of names "Service Engineering", "Service Integration", "Systems Group", etc. What's important is that they have the  explicit responsibility to coordination changes to the production service This group's job was to take the various technologies that were being supplied to operations and make sure they really were ready to be deployed, and the work with the NOC to do the deployments. They provided an interface with the platform team (advocating for features needed to practically run the platform, and helping answer engineer questions). They would sometimes write tools which pulled things together or make it easier for all of netops (before we had a NOC), or the NOC run the service.