A report from the Federal Communications Commission, the US comms regulator, advises that over 250 million calls from other telcos failed to reach subscribers of T-Mobile USA when the latter operator suffered a 12-hour outage on June 15th. At least 23,621 calls to 911 emergency services also failed during the outage.
The outage was initially caused by a mistake in the configuration of one of two routers installed that day. This led to a cascade of further problems, exacerbated by a previously undetected mistake in routing and an error in the software for registering mobile devices to the network. Another mistake was made by engineers whilst troubleshooting the network’s problems, leading the localized impact of the outage on the Atlanta market to spread nationwide as handsets attempted to connect to Voice over Wi-Fi and VoLTE services. Handsets then switched to 3G and 2G, only to face further congestion as those networks struggled to cope. This led to the failure of 911 calls; 3G and 2G nodes were so overwhelmed that they were unable to process some 911 calls although handsets do not need to register to complete an emergency call.
The FCC noted several deficiencies in T-Mobile’s approach to business continuity. Their first and most obvious recommendation was that operators should audit the resilience of their networks.
Network operators should periodically audit the physical and logical diversity called for by the design of their network segment(s) and take appropriate measures as needed. The router that dropped signaling traffic and precipitated this outage could never have provided functional diversity for the link that failed because the router was not provisioned to process the signaling traffic that the failed link carried. Further, T-Mobile could have prevented the outage if it had audited its network during the new router integration to ensure that the traffic destined for the failed link would redirect to a router that was able to pass it. If the backup route had operated as it was designed, a nationwide outage would likely not have occurred.
The report also recommended that upgrades and changes of procedure be validated in a test environment before they are attempted in the field.
You can find the report with the FCC’s analysis and recommendations by looking here.