I am not a fan of risk registers for telcos. It is good to think about risks, and to implement plans to mitigate risks. The problem with telcos is that very many of their risks are so interwoven with the excruciatingly specific detail of how technology works that the risk register is either going to be impossibly long or else it will just be an arbitrary subset of all the most significant risks. To give an example, I believe most of us would agree that it would be a major risk to put someone in a position where they could make a single mistake that stops the entire network from working. But do you think any network operator has this risk near the top of their register?
Incorrect firewall configuration forces us to ask other Tier 1 networks to disable all peering with our network.
Perhaps they should, because that is what happened at CenturyLink this weekend. They crashed their network so hard that it caused a 3.5 percent drop in global internet traffic according to the very knowledgeable people who run Cloudflare. It also caused me some bemusement because I rarely find myself waking to find corporations have left me warnings about portions of the internet becoming unreachable overnight.
If you want to know the details of what caused the CenturyLink outage then you should take a look at Cloudflare’s analysis, which seemingly serves as the basis of every news article on this topic. Put simply, a flaw in an instruction to reconfigure all CenturyLink’s firewalls meant all their servers blocked the receipt of new configuration rules, creating a situation which was difficult for the network to recover from. CenturyLink’s team suddenly had to clear all the flawed instructions from across their US network, and they made that simpler by telling other Tier 1 networks to temporarily stop peering with CenturyLink, so the servers did not have to deal with additional configuration instructions from peers as well.
AS1299 have temporarily disabled all IPv4 peering with CenturyLink AS3356 on their request.
— Telia Carrier (@TeliaCarrier) August 30, 2020
For five hours CenturyLink’s network carried almost no traffic. Those who could switch to other networks did so. And for those customers whose only connection to the internet is via CenturyLink, their only consolation was that CenturyLink said they were sorry.
We are able to confirm that all services impacted by today’s IP outage have been restored. We understand how important these services are to our customers, and we sincerely apologize for the impact this outage caused.
— CenturyLink (@CenturyLink) August 30, 2020
Managing risk is a team effort, and this outage exemplifies why a Chief Risk Officer could never have all the knowledge to take hands-on responsibility for everything that can go wrong at a technical level. But it also illustrates why telcos need somebody independent of operational teams to evaluate risk. It may not have been on the risk register before, but this week some operators will be thinking about whether a single mistake could take their networks down too.