Huge CenturyLink Outage Shows Risk Lies in the Details

I am not a fan of risk registers for telcos. It is good to think about risks, and to implement plans to mitigate risks. The problem with telcos is that very many of their risks are so interwoven with the excruciatingly specific detail of how technology works that the risk register is either going to be impossibly long or else it will just be an arbitrary subset of all the most significant risks. To give an example, I believe most of us would agree that it would be a major risk to put someone in a position where they could make a single mistake that stops the entire network from working. But do you think any network operator has this risk near the top of their register?

Incorrect firewall configuration forces us to ask other Tier 1 networks to disable all peering with our network.

Perhaps they should, because that is what happened at CenturyLink this weekend. They crashed their network so hard that it caused a 3.5 percent drop in global internet traffic according to the very knowledgeable people who run Cloudflare. It also caused me some bemusement because I rarely find myself waking to find corporations have left me warnings about portions of the internet becoming unreachable overnight.

If you want to know the details of what caused the CenturyLink outage then you should take a look at Cloudflare’s analysis, which seemingly serves as the basis of every news article on this topic. Put simply, a flaw in an instruction to reconfigure all CenturyLink’s firewalls meant all their servers blocked the receipt of new configuration rules, creating a situation which was difficult for the network to recover from. CenturyLink’s team suddenly had to clear all the flawed instructions from across their US network, and they made that simpler by telling other Tier 1 networks to temporarily stop peering with CenturyLink, so the servers did not have to deal with additional configuration instructions from peers as well.

For five hours CenturyLink’s network carried almost no traffic. Those who could switch to other networks did so. And for those customers whose only connection to the internet is via CenturyLink, their only consolation was that CenturyLink said they were sorry.

Managing risk is a team effort, and this outage exemplifies why a Chief Risk Officer could never have all the knowledge to take hands-on responsibility for everything that can go wrong at a technical level. But it also illustrates why telcos need somebody independent of operational teams to evaluate risk. It may not have been on the risk register before, but this week some operators will be thinking about whether a single mistake could take their networks down too.

Eric Priezkalns
Eric Priezkalns
Eric is the Editor of Commsrisk. Look here for more about the history of Commsrisk and the role played by Eric.

Eric is also the Chief Executive of the Risk & Assurance Group (RAG), a global association of professionals working in risk management and business assurance for communications providers.

Previously Eric was Director of Risk Management for Qatar Telecom and he has worked with Cable & Wireless, T‑Mobile, Sky, Worldcom and other telcos. He was lead author of Revenue Assurance: Expert Opinions for Communications Providers, published by CRC Press. He is a qualified chartered accountant, with degrees in information systems, and in mathematics and philosophy.