Establishing a Good Production Operations Culture
Mark Verber
Personal Responsibility
- Keep promises... on time within budget.
- Consider not only technology issues but also business issues.
- Distributed decision making... whoever is driving and
doing the work makes the decisions. But this means that you
are responsible for these decisions. Don't blameshift.
- If you break it, you fix it. If you built it, you maintain it.
Teamwork
An operations group has to work effectively within the
team and across teams in an organization. It is critical for the operations team
to have a good relationship with not only engineering, but also sales, product
marketing and finance. At the heart of effective teamwork is mutual respect and
trust. People need to trust that you will deliver what you promise, and
you need to be able to trust and depend on your team mates. Mutual respect
means that you let people do their jobs. If you aren't happy with there work you
should work to help them improve. You should avoid at all cost taking on their
work. I have often seen operational groups go into the business of engineering.
Why? Because the engineering group wasn't delivering an effective product. While
this works for a time, it often results in the operations team losing focus (and
therefore doing a poorer job operating) while the engineering group doesn't get
better.
- Early engagement with business development and engineering to ensure
operations can support new
projects.
- Focus on adding unique value... don't pick up work which isn't our core
competency. Expect, depend, and demand other teams to do their work.
- Consensus oriented culture... get everyone on the same page. Have
a shared vision and perspective. Everyone should understand why
we are doing something. Be realize that you don't have to wait for
complete consensus, sometimes a leader needs to drive a team even though
not everyone is on board.
- Don't depend on "heros" or unique individuals. Always have multiple individuals who can do a job, and processes that ensure work gets done.
Lazy (the good kind)
- Don't throw people at it... automate. You shouldn't do something
more than twice by hand.
- Don't fix hard problems at the end, get the design right (work upstream)
High Quality
- Work for the long term (smart, lazy work)
- Fix things quickly even if you don't need them right now (NYC broken
windows)
- Document work for you and your coworkers.
- Develop appropriate processes at the beginning.
- Build tools to improve efficiency and reduce errors.
- Build things you are proud of and the next person who owns it
will be pleased they now own it.
Systematic Improvement
- Don't try to fix everything at once because it is too hard, will take
too long, and you don't know what you really want
- Always have a metrics and data
- Make sure everything has instrumentation
- Collect real data
- Careful analysis
- Always have a process to improve the metric
- Incrementally improve looking for the biggest payoff.
- Understand you goal
- 99% - 5259 minutes/year (7 hours/month)
- 99.9% - 525 minutes/year (43 minutes / month)
- 99.99% - 52 minutes/year (4 minutes / month)
- 99.999% - 5 minutes/year (25 seconds / month)
Understand Systems
- Expect failure and recover quickly
- Expect growth and be prepared to scale without large, non linear
jumps.
- Expect change and have systems which can be refactored without
throwing everything away and starting over.
- Other operational system principles
Technology Agnostic
- Use the technology which is appropriate for the job
- All vendors suck... so don't write off a vendor for who they are,
choose products and technologies because they will solve your
short and long term issues.
- Example: we would be willing to use Microsoft products, if and
when they have addressed issues related to running lights out data
centers with a high degree of stability.
Other Discussions of Culture To Think About