Coordination Risk

Risks due to the fact that systems contain multiple agents, which need to work together.

Part Of

Operational Risk

Reduced By Practices

Change Management: Specifically addresses coordinating change in a structured way.
Meeting: Ensures everyone is on the same page regarding project goals and progress.
Pair Programming: Enhances collaboration and coordination between developers.
Requirements Capture: Reduces coordination risks around deciding what should be built.
Retrospectives: Identifies and addresses historic coordination issues through regular reviews.
Stakeholder Management: Allows stakeholders to coordinate on their demands.
Terms Of Reference: Provides a clear framework for coordination among team members and stakeholders.
Version Control: Facilitates collaboration by allowing multiple developers to work on the codebase simultaneously.

Attendant To Practices

Approvals: Requires coordination among stakeholders to provide timely sign-off.
Contracts: Contracting work can often involve setting careful terms to minimise coordination risks.
Delegation: Increases the number of entities involved in project coordination.
Marketing: Marketing efforts often need to be coordinated with other parts of the business.
Meeting: Meetings usually happen at a particular time so involve coordinating schedules.
Pair Programming: Requires coordination around time, place, activity and skills.
Review: Synchronous reviews require effective coordination among team members.
Terms Of Reference: Requires alignment and coordination among all parties to agree to the terms.
User Acceptance Testing: Requires coordination between the development team and end users.

Whenever we have multiple agents working together we have Coordination Risk. This happens even where the agents goals are aligned and they don't suffer from Communication Risk.

As in Agency Risk, we are going to use the term agent, which refers to anything in a system with agency to make decisions: that is, an agent has an Internal Model and can take actions based on it. Here, we work on the assumption that the agents are working towards a common Goal, even though in reality it's not always the case, as we saw in the section on Agency Risk.

Coordination Risk is the risk that agents can fail to coordinate to meet their common goal and end up making things worse. Coordination Risk is embodied in the phrase "Too Many Cooks Spoil The Broth": more people, opinions or agents often make results worse.

In this section, we'll:

look at some classic problems of coordination,
build up a model of Coordination Risk, describing exactly coordination means and why we do it,
then, we're going to consider the general problem of decision making, and consider the problem of agency at several different levels (because of Scale Invariance),
and finally, we'll look at the CAP Theorem and how this is a general problem, rather than specific to software systems.

Worked Example

On an open source software project, the maintainers are often required to make architectural decisions about the way the software works. Initially, they just applied a strategy of everyone makes their own decisions, but the codebase started to get chaotic with multiple competing abstractions. Not only that, but different maintainers would find that their work had been partly subsumed or rendered obsolete by someone else. In short, the project had devolved into competition and the maintainers were starting to fall out.

Coordination Risk after introducing voting

As shown in the above diagram, they tried to remedy this by instituting a governance process wherein the maintainers voted on key architectural issues together. However, the debates took up a lot of time and often ended in further argument and misunderstanding.

One alternative suggested to them was to decide on a "Benevolent Dictator for Life (BDFL)". This was a title originally conferred on Guido van Rossum, creator of the Python Language, but other famous examples exist such as Rich Hickey (Clojure Language) and Linus Torvalds (the Linux Kernel).

Coordination Risk after introducing a BDFL

As shown in the above diagram, this solution to coordination risk doesn't come with the downsides of increased communication risk and schedule risk (time spent debating and agreeing). However: power is concentrated in the hands of the BDFL, for better or worse. In open source projects there is a check on this power: the community is free to fork the open source project away from its out-of-control dictator and take a different path. An example of this happening is MariaDB which was forked from MySQL after Oracle (who sell competing proprietary database software) bought Sun Microsystems and took control of the project.

Problems Of Coordination / Example Threats

Let's unpack this idea, and review some classic problems of coordination, none of which can be addressed without good communication. Here are some examples:

1 Merging Data, Processes and Ideas

If you are familiar with the source code control system, Git, you will know that this is a distributed version control system. That means that two or more people can propose changes to the same files without knowing about each other. This means that at some later time, Git then has to merge (or reconcile) these changes together. Git is very good at doing this automatically, but sometimes different people can independently change the same lines of code and these will have to be merged manually. In this case, a human arbitrator "resolves" the difference, either by combining the two changes or picking a winner.

Threat: Two teams work on overlapping functionality in parallel, leading to integration conflicts.

2 Consensus

Making group decisions (as in elections) is often decided by votes. But having a vote is a coordination issue and requires that everyone has been told the rules:

Where will the vote be held?
How long do you provide for the vote?
What do you do about absentees?
What if people change their minds in the light of new information?
How do you ensure everyone has enough information to make a good decision?

Threat: Coordination issues arise in situations where communication is limited, ill-specified or there is insufficient visibility.

3. Factions and Silos

Sometimes, it's hard to coordinate large groups at the same time and "factions" can occur. That the world isn't a single big country is probably partly a testament to this: countries are frequently separated by geographic features that prevent the easy flow of communication (and force). We can also see this in distributed systems, with the "split brain" problem. This is where subset of the total system becomes disconnected (usually due to a network failure) and you end up with two, smaller networks with different knowledge. We'll address in more depth later.

Threat: Larger coordinating units break down and work independently, perhaps along cultural, geographic or functional boundaries.

4. Resource Allocation and Contention

Ensuring that the right people are doing the right work, or the right resources are given to the right people is a coordination issue. On a grand scale we have Logistics and Economic Systems. On a small scale the office's room booking system solves the coordination issue of who gets a meeting room using a first-come-first-served booking algorithm.

Threat:

5. Deadlock

Refers to a situation where, in an environment where multiple parallel processes are running, the processing stops and no-one can make progress because the resources each process needs are being reserved by another process. This is a specific issue in Resource Allocation, but it's one we're familiar with in the computer science industry. Compare with Gridlock, where traffic can't move because other traffic is occupying the space it wants to move to already.

Threat: Coordination issues involving time need to be carefully thought through to avoid deadlocks.

6. Race Conditions

Race Conditions are where we can't be sure of the result of a calculation, because it is dependent on the ordering of events within a system. For example, two separate threads writing the same memory at the same time (one ignoring and over-writing the work of the other) is a race.

Threat: Coordination issues involving time need to be carefully thought through to avoid races.

7. Scaling

Amdahl's law and Gunther's Universal Scalability Law both draw attention to the fact that as you increase the number of agents that need to coordinate, the more time needs to be spent on coordination. These laws were originally drawn from observations on computer hardware, but they apply generally to problems of coordination. While Amdahl's Law shows the diminishing returns of adding extra agents, Gunther's Law goes further to model how performance can get worse with extra agents involved - something we see when our computers thrash or when roads get really busy. This also explains Brooks Law - "Adding manpower to a late software project makes it later."

Threat: The more agents involved in coordinating, the harder and more time consuming it becomes.

Anecdote Corner

Coordination Risk generally focuses on the problems inherent in trying to get more coordination. But at the other end of the spectrum, crypto-currency systems like Bitcoin are predicated on the idea that the participants of the system are not coordinating but are competing - and this keeps the currency running.

However, if participants coordinated, they could perform what is known as a 51% Attack - effectively taking control of the currency via a majority share of the activity. This happened in 2019 when two mining conglomerates banded together to reorganise the blockchain and change the transaction history of Bitcoin Cash, a fork of Bitcoin. Although this could have been done for nefarious purposes, they actually coordinated in order to fix some erroneous transaction state for the good of the network.

This was in their interests as fixing these issues increased the value and trust of Bitcoin Cash and therefore the value of the holdings of the mining conglomerates too. But it could have easily gone the other way, with a nefarious party stealing from the network. That is didn't happen is down to the economic incentives of the miners involved: they don't want to damage the reputation and therefore the value of the currency that they are mining.

The fact that mining consortia needed to band together to fix issues in the network demonstrates a central issue with a distributed system like Bitcoin: as the protocol is designed on the basis of competition, change is very hard to coordinate and effect.

Part Of

Reduced By Practices

Attendant To Practices

Worked Example​

Problems Of Coordination / Example Threats​

1 Merging Data, Processes and Ideas​

2 Consensus​

3. Factions and Silos​

4. Resource Allocation and Contention​

5. Deadlock​

6. Race Conditions​

7. Scaling​