“The risk of loss resulting from inadequate or failed internal processes, people and systems or from external events.” - Operational Risk, Wikipedia
In this section we’re going to start considering the realities of running software systems in the real world.
There is a lot to this subject, so this section is just a taster: we’re going to set the scene by looking at what constitutes an Operational Risk, and then look at the related discipline of Operations Management. Following this background, we’ll apply the Risk-First model and have a high-level look at the various mitigations for Operational Risk.
When building software, it’s tempting to take a very narrow view of the dependencies of a system, but Operational Risks are often caused by dependencies we don’t consider - i.e. the Operational Context within which the system is operating. Here are some examples:
This is a long laundry-list of everything that can go wrong due to operating in “The Real World”. Although we’ve spent a lot of time looking at the varieties of Dependency Risk on a software project, with Operational Risk we have to consider that these dependencies will fail in any number of unusual ways, and we can’t be ready for all of them. Preparing for this comes under the umbrella of Operations Management.
If we are designing a software system to “live” in the real world we have to be mindful of the Operational Context we’re working in and craft our software and processes accordingly. This view of the “wider” system is the discipline of Operations Management.
“Operations management is an area of management concerned with designing and controlling the process of production and redesigning business operations in the production of goods or services. It involves the responsibility of ensuring that business operations are efficient in terms of using as few resources as needed and effective in terms of meeting customer requirements. “ - Operations Management, Wikipedia
The diagram above is a Risk-First interpretation of Slack et al’s model of Operations Management. This model breaks down some of the key abstractions of the discipline:
The healthy functioning of the Transform Process is the domain of Operations Management. As the above diagram shows (again, modified from Slack et al.) this involves the following types of actions.
Let’s look at each of these actions in turn.
Since humans and machines have different areas of expertise, and because Operational Risks are often novel, it’s often not optimal to try and automate everything. A good operation will consist of a mix of human and machine actors, each playing to their strengths (see the table below).
The aim is to build a human-machine operational system that is Homeostatic. This is the property of living things to try and maintain an equilibrium (for example, body temperature or blood glucose levels), but also applies to systems at any scale. The key to homeostasis is to build systems with feedback loops, even though this leads to more complex systems overall. The diagram above shows some of the actions involved in these kind of feedback loops within IT operations.
|Humans Are…||Machines Are…|
|Good at novel situations||Good at repetitive situations|
|Good at adaptation||Good at consistency|
|Expensive at scale||Cheap at scale|
|Reacting and Anticipating||Recording|
As we saw in Map and Territory Risk, it’s very easy to fool yourself, especially around Key Performance Indicators (KPIs) and metrics. Large organisations have Audit functions precisely to guard against their own internal failing processes and Agency Risk. Audits could be around software tools, processes, practices, quality and so on. Practices such as Continuous Improvement and Total Quality Management also figure here.
There are plenty of Hidden Risks within the operation’s environment. These change all the time in response to economic, legal or political change. In order to manage a risk, you have to uncover it, so part of Operations Management is to look for trouble.
In order to control an operation, we need targets and plans to control against. For a system to run well, it needs to carefully manage unreliable dependencies, and ensure their safety and availability. In the example of the humans, say, it’s the difference between Hunter-Gathering (picking up food where we find it) and Agriculture (controlling the environment and the resources to grown crops).
As the diagram above shows, we can bring Planning to bear on dependency management, and this usually falls to the more human end of the operation.
While planning is a day-to-day operational feedback loop, design is a longer feedback loop changing not just the parameters of the operation, but the operation itself.
You might think that for an IT operation, tasks like Design belong within a separate “Development” function within an organisation. Traditionally, this might have been the case. However separating Development from Operations implies Boundary Risk between these two functions. For example, the developers might employ different tools, equipment and processes to the Operations team resulting in a mismatch when software is delivered.
No system can be perfect, and after it meets the real world, we will want to improve it over time. But Operational Risk includes an element of Trust & Belief Risk: we have a reputation and the good will of our customers to consider when we make improvements. Because this is very hard to rebuild, we should consider this before releasing software that might not live up to expectations.
So there is a tension between “you only get one chance to make a first impression” and “gilding the lily” (perfectionism). In the past I’ve seen this stated as pressure to ship vs pressure to improve.
A Risk-First re-framing of this (as shown in the diagram above) might be the balance between:
The “should we ship?” decision is therefore a complex one. In Meeting Reality, we discussed that it’s better to do this “sooner, more frequently, in smaller chunks and with feedback”. We can meet Operational Risk on our own terms by doing so:
|Sooner||Beta Testing, Soft Launches, Business Continuity Testing|
|More Frequently||Continuous Delivery, Sprints|
|In Smaller Chunks||Modular Releases, Microservices, Feature Toggles, Trial Populations|
|With Feedback||User Communities, Support Groups, Monitoring, Logging, Analytics|
In a way, actions like Design and Improvement bring us right back to where we started from: identifying Dependency Risks, Feature Risks and Complexity Risks that hinder our operation, and mitigating them through actions like software development.
Our safari of risk is finally complete: it’s time to reflect on what we’ve seen in the next section, Staging and Classifying.
Found this interesting? Please add your star on GitHub to be invited to join the Risk-First GitHub group.