Part of Hints for Operating High Quality Services by
Very Very Early Draft 0.2 -- September 30, 2005
The most effective large scale services are driven by cross-functional teams. The team tends to include people from marketing, engineering, QA, and operations working together create and review specifications and set priorities for medium term planning.
There needs to be a particularly tight connection between the people developing the system and the people who are going to be running that system data-to-day. I have found that developers are bad at predicting what operations folks are likely to do. They also don't know all the tradeoffs. Working closely with operational folks to define what is best to be done in code, and what is best to be done through operational practices. One approach to this is that the operations group documents their requirements and acceptance criteria. The developers build something that means the documented requirements, the operations group verifies that it passes the acceptance criteria. In practice this doesn't work very well. Getting specifications fully correct is very difficult. Much more effective is to have regular communication between the developers and operational staff. Before significant code is written there should be an architecture and design review which has representatives from both the engineering and operational teams.
See flickr's 10-deploys-per-day
I have seen operations groups rely on engineering teams outside operations to provide tools to operate a service. While I think the people building a service should provide appropriate operational interfaces to whatever they build, relying on external teams for core operational tools is almost always a disaster. It seems like a good idea. Who better to build solid tools than a well trained engineer rather than a sysadm who will just hack something together. There is a large pool of engineers in the engineering department and it is often easier to get engineering headcount. Plus building solid tools is really added by getting someone out of the day-to-day. Have them in the engineering department rather than ops (which is interrupt driven) seems like a good separation. This can often work in the beginning, but falls down later. First, tools were written to solve the engineers problems, not necessarily the operations teams problems. Second, as soon as the engineers problems were solved they moved onto other issues, leaving the netops team without good tools support. An ops team needs to have at least one person who is given time to do really tools development without day-to-day interrupts.
Day-to-day operations should be the task of a service center entrusted to run a service reliable. You don’t want them also to be thinking about how to change the service for the future.
Day-to-day operations tends to be interrupt driven. You want to reward people to quickly respond to problems. You can't count on how much time they will have for long-term projects. If you have projects mixed with day-to-day work you will find all but the most exceptional people will had trouble finishing the long term projects because they let the day-to-day consume them. This set up an situation where people didn't feel bad when they missed long term goals because they were busy keeping the existing world running.
Split day-to-day from longer term projects, and call for the people working on long-term projects to deliver well. Fight against the multi-class workplace by periodically rotation some of the people between the day-to-day and project groups, and really stressing to the long term people that their work will be judged by how successful the day-to-day folks are. The day-to-day folks need to be encouraged to speak up for themselves and understand that they hold in their heads lots of things which the project oriented people need to know... but also realize they can be so down in the details that they might miss how to make things better.
If you mix production and IS/IT in the same person, IS/IT will get neglected because "production" will always have a priority. Production produces revenue after all . As a result, important IS/IT will almost always get neglected. ecide how much investment will be made into IS/IT and wall that off.
Otherwise you need to have two people with the same technical information and the developer does not have adequate incentives to do the job right the first time.
Need for people who glue things together.
The best rule is only outsource what is already working well, or something you haven't started and know clearly what you want. Don't outsource something that is a problem and you expect the outsourcer to make it better.
Workspace does make a difference.
Too often time netops orgs commit to a workload / objectives that aren;t possible. As a result people often feel like operations teams were behind, even when the ops team was was working very well. The harsh true of operations is that it nevet lets up. Often, operations teams are at the end of a long pipeline, so they feel like they can't push back on a big project because schedule buffers have been eaten up and the are demands for the "new thing". Since the big projects couldn’t be pushed back on (and schedules are full), smaller things get pushed back on. Some things which would take 5 minutes and unblock someone was be given an instant "NO, I won't do it because it is not on the plan". Better to engage early and set people's expectations, than to have things slipped at the end.
How does radical collocation help a team succeed? – Teasley, Covi, Krishnan, Olson, CSCW’00, December 2-6, 2000, Philadelphia, PA.