Reintegrating the craft of Ops

Thijs van Leeuwen, Jeroen de Korte & Kim van Wilgen
nov 14, 2023 · 10 min lezen Engels
Annual report reintegrating the craft of ops

IT service delivery once entailed having physical and digital control of the whole tech environment, from the on-premises datacenter right up to the functional uptime of an application.

If there were an incident or a system went down, the engineer could visit their local datacenter, pull plugs and push buttons, undertaking manual procedures to get operations up and running. With the scope of resilience all under one roof, engineers could literally oversee the IT lifecycle and, used to getting their hands on the hardware, they knew what to look for.

But when the public cloud became a fixture in the digital ecosystem, this changed. Offering unprecedented elasticity and scalability, the cloud encouraged widespread adoption of an OpEx model for IT spending, permitting smaller upfront investments and easy purchases via the credit card on file. With hosting and management handled by another party not only off premises, but usually in a land far away, the cloud introduced many layers of abstraction and shared service models. This reduced the operational upkeep work that had gone into running stacks locally and redirected the attention of engineers from ensuring the resilience of on-premises infrastructure to supporting the velocity of development.

It’s been about 15 years since the work of operations engineers (Ops) combined with the work of development engineers (Dev), coalescing into the DevOps model. While DevOps revolutionized how teams collaborate and the pace of product development and delivery, the need for speed still prevails across organizations. In an unpredictable world with fast-moving markets and exponentially complex IT landscapes, many organizations struggle with the processes that control their daily business. They feel they must choose between the resilience emphasized by Ops and the velocity driven by Dev. Yet, for Schuberg Philis and the enterprises we partner with, it’s not a tradeoff. Organizations can have both by adopting a holistic view of resilience and, within it, reintegrating the role of Ops today.

“Offering unprecedented elasticity and scalability, the cloud encouraged widespread adoption of an OpEx model for IT spending.”

The craft of Ops
At its core, we see Ops as a craft and consider its engineers craftspeople. The concept of craftsmanship is fitting because it carries the connotation of expertise, spending hours learning a skill or a skillset and hand-tailoring its application in any given project. Ops engineers are masters at keeping an organization in working order, and there’s good reason to include them on a team from day one of any plan-build-run trajectory – rather than just call them in when there’s a problem. These engineers have honed their expertise by cycling through real-life incident responses and learning on the job what an evolving organization needs to preserve their performance.

Having engineers positioned to provide advanced Ops support to an organization’s respective DevOps teams is key. This dedicated group of engineers can think from the ground up, taking what the enterprise needs as the starting point rather than making operational decisions based on what available assets, such as the cloud, can offer. Thanks to their lived experience, Ops engineers often have an instinctive ability to make these assessments. This enables them to draw clear boundaries to help protect an organization. For example, they can forecast how a change will impact an organization years later and, in response, prioritize the most critical components to take care of in the present day. They can specify the bare-minimum requirements to implement measures for disaster avoidance. These kinds of operational insights translate contemporary IT choices into long-term business consequences.

We sometimes compare Ops to playing simultaneous games of chess because it demands constant strategizing and anticipation of what might happen according to what moves where when. What’s more, this level of concentration must be applied over multiple layers of infrastructure. Yet, by giving sufficient space to the craft of Ops – and more opportunity for its practitioners to express when a decision in the present will prove unsuitable for the future – organizations can infuse resilience in every change, from the design phase right up to the run.

“At its core, we see Ops as a craft and consider its engineers craftspeople.”

Healthy friction and crucial KPIs
Close collaboration between Dev and Ops ensures that the entire team shares responsibility for choices made throughout plan-build-run. While this arrangement serves the mutual goal of moving the business forward, a DevOps culture can also lead to vertical silos. In contrast to the days when specialized engineers sat together, traditionally as a centralized department, nowadays Ops engineers may rarely have the chance to talk with fellow Ops engineers. Similarly, within their multidisciplinary silos, they tend to lack an Ops-focused manager who can establish Ops-specific KPIs. This leaves serious gaps in attention for KPIs, such as those concerning systems security, stability, and resilience.

But Ops thrives when Ops engineers can plan, build, and run the deepest level of their craft with fellow Ops engineers. Sharing best practices with one another empowers better-informed operational decision-making within their respective teams. In turn, the DevOps team as a unit can embrace the changes needed for business to progress without sacrificing their agile practices. Moreover, teams benefit from healthy friction, such as the type regularly produced through the role of an Ops manager serving as sparring partner to a Dev counterpart. By keeping Ops and Dev interests in continuous conversation – and negotiation – the team can collectively define crucial KPIs. That vision, supported by technical expertise, serves as the springboard for making business-minded decisions to meet those KPIs.

Above all, Ops engineers need to be able to fulfill the duty suggested by their title: operations. This ensures that while Dev engineers continue to create new features, Ops engineers don’t have to leave their resilience jobs behind. Within this homeostasis, manual interventions can be carried out effectively and followed up with structural configuration updates so that their next installment is sure to have the same parameter. This means, too, that permanent fixes and lasting improvements can be made in harmony with the plan-build-run trajectory rather than applying Band-Aid solutions due to a lack of time, budget, or story points. Ideally, this operational upkeep minimizes production incidents by incorporating greater standardization and automation.

“Close collaboration between Dev and Ops ensures that the entire team shares responsibility for choices made throughout plan-build-run.”

More freedom for the business
Unlike modern-day software, resilience is not something that can be bought off the shelf. It is designed into a platform by incorporating operational predictability in every phase of plan-build-run. This ensures that the resilience of the platform never limits the pace at which the organization wants to move or the speed at which the business can progress. Concretely, at Schuberg Philis, we insist on proving a project’s resilience by conducting production acceptance testing (PAT). Before having an environment go live, we simulate various failures and supply evidence of predictability. In an already live environment, we trust the predictability prior proven in the PAT. The environment is constantly tested against a disaster avoidance scenario, leaving nothing to surprise whether in a routine yearly test or should there be an unexpected real-life situation.

“Unlike modern-day software, resilience is not something that can be bought off the shelf.”

Testing – and the predictably it validates – checks off an Ops engineer’s priority for 100% uptime. But ultimately, the entire IT landscape comes under better control when Ops can be reintegrated into an organization. The enterprise becomes infused with a robust operational culture that doesn’t have to compete with a strong Dev-driven culture. After all, resilience is not only about making sure systems stay stable and secure. It also requires having the adaptability and flexibility to accommodate growth and evolution, especially in a turbulent era when time and IT talent are limited. DevOps teams can then strike a balance between maintaining velocity and ensuring operations run smoothly in the here and now.

Within an operationally balanced enterprise, trust between the technology and the business flourishes. Trust drives business progress. A holistic view of resilience thus ushers in more freedom for the business. It enables the organization to focus on developments that will generate value and further along innovation and possible new ways of working. At the same time, even though the whole tech environment may no longer be under the same roof, it enables platforms to stay predictable and for organizations, therefore, to stay course on their path of resilience.

Kim van Wilgen contact

Meer weten?

Neem contact op met Kim van Wilgen.