In the wake of the recent MongoDB happy hour debacle, there have been a few mentions of DevOps and NoOps. The pieces were mostly about the fact that this incident proved that the IT business is not really in full DevOps mode, not to mention NoOps. I am not confident that NoOps will be the future for a vast majority of shops. Being from the Ops side of things, I am obviously biased toward anyone stating that NoOps is the future. Because that would mean no job left for me and my comrades in arms. But let me explain 🙂
I would like to be a bit more thorough than usual and explain what I see there, in terms of practices and trends.
Definitions
First let me set the stage and define what I mean by DevOps, and NoOps.
https://en.wikipedia.org/wiki/DevOps
http://www.realgenekim.me/devops-cookbook/
At its most simple definition, DevOps means that Dev teams and Ops team have to cooperate daily to ensure that they both get what they are responsible for : functionalities for Dev, and stability for Ops. A quick reminder though : business is the main driver, above all. This implies that both teams have to work together and define processes and tooling that enables fast and controlled deployment, accurate testing and monitoring.
We could go deeper into DevOps, but that is not the point here. Of course, Ops team should learn a thing or two from Scrum or any agile methodology. On the other hand, Dev teams should at least grasp the bare minimum of ITIL or ITSM.
What I could imagine in NoOps would be the next steps of DevOps, where the dev team is able to design, deploy and run the application, without the need of an Ops team. I do not feel that realistic for now, but I’ll come back to this point later.
How are DevOps, and the cloud, influencing our processes and organizations
I have worked in several managed services contexts and environments in my few years of experience, where sometimes Dev and Ops were very close, sometimes completely walled of. The main driver for DevOps, usually linked to cloud technologies adoption, on the Ops side, is automation. Nothing new here, you’ve read about it already. But there are several kinds of automation, and the main ones are automated deployment and automated incident recovery.
The second kind has a deep impact, in the long term, on how I’ve seen IT support organization and their processes evolve. Most of the time, when you ask your support desk to handle an incident, they have to follow a written procedure, step by step. The logical progress is to automate these steps, either by scripting them, or using any IT automation tool (Rundeck, Azure Automation, Powershell etc.). You may want to keep the decision to apply the procedure human-based, but it’s not always the case. Many incidents may be resolved automatically by applying directly a correctly written script.
If you associate that to the expanding use of PaaS services, which removes most of the monitoring and management tasks, you will get a new trend that has already be partly identified in a study :
https://azure.microsoft.com/en-us/resources/total-economic-impact-of-microsoft-azure-paas/
If you combine PaaS services which remove most of the usual L2 troubleshooting, and automated incident recovery you will get what I try to convince my customers to buy these days : 24*7 managed services relying only on L1.
Let me explain a little more what we would see there with an example based on a standard project on public Cloud. Most of the time it is an application or platform which the customer will develop and maintain. This customer does not have the resources to organize 24*7 monitoring and management of the solution. What we can build together is a solution like this :
• We identify all known issues where the support is able to intervene and apply a recovery procedure
• Obviously, my automation genes will kick in and we automate all the recovery procedures
• All other issues will usually be classified into three categories :
○ Cloud platform issues, escalated to the provider
○ Basic OS/VM issues, which should be automated or either be solved by Support team L1 (or removed by the use of PaaS services)
○ Customer software unknown issue/bug which will have to be escalated to the dev team
I am sure you now see my point : once you remove the automated recovery, and the incidents where the Support Team has to escalate to a third party… nothing remains!
And that is how you can remove 24*7 coverage for your L2/L3 teams, and provide better service to your customers and users. Remember that one of the benefits of automated recovery is that it’s guaranteed to be applied exactly the same each time.
A word about your teams
We have experienced firsthand this fading out of traditional Level 2 support teams. These teams are usually the most difficult to recruit and keep as they need to be savvy enough to handle themselves in very different situations and agree to work on shifts. As soon as they get experienced enough, they want to move to L3/consulting and to daytime jobs.
The good thing is that you will not have to worry anymore about staffing these teams and keep them.
The better thing is that they are usually profiles with lots of potential, so just train them a bit more, involve them in automating their work, and they will be happier and probably more to implementation or consulting roles.
How about you, how is your organization coping with these cloud trends and their impact?