Global Web Service Blackout: Incident Review

Global Failure

On Sunday morning, 08.30.2020, since approximately 10:00 GMT, numerous web services suddenly got unavailable for users worldwide due to a major internet outage.

SIM-Networks monitoring system detected the connectivity issue as early as it kicked in. Automatic failover to backup links had not rectified the situation. Soon, we started receiving information about similar issues with many other providers and alarming signals from the customers.

The problem occurred on the side of Tier 1 internet provider CenturyLink (formerly known as Level 3). During the day, the worldwide internet service disruption settled gradually down.

Please find below a brief review of the incident.

CenturyLink and the domino effect

According to the quick information from the growing number of sources (Qrator, Post-Gazette, Twitter, Techquila, CNN, ThousandEyes, etc.), 08.30.2020, routing issues arisen on the side of CenturyLink/Level3, the largest ISP and Internet bandwidth provider in the world, caused cascading dropouts of numerous popular services, such as Cloudflare, Hulu, the PlayStation Network, Xbox Live, Feedly, Discord and many others., a web resource availability monitoring service, besides CenturyLink and above, also informed of many other global service providers experiencing crucial connectivity issues on that date. Among them were:

  • Google
  • Facebook
  • Twitter
  • YouTube
  • Netflix
  • Instagram
  • Tik Tok
  • AT&T
  • Spectrum
  • Comcast
  • COX

and the list goes on. Details and stats on the incident you can find in Cloudflare’s «Analysis of Today’s CenturyLink/Level (3) Outage».

What happened?

"Today we saw a widespread Internet outage online that impacted multiple providers. Cloudflare's automated systems detected the problem and routed around them, but the extent of the problem required manual intervention as well."

- John Graham-Cumming,
Cloudflare’s chief technology officer

Graham-Cumming claimed that CenturyLink is primarily responsible for the Internet outage rendering Cloudflare along with its many customers effectively unavailable.

CenturyLink confirmed the outage impacting Content Delivery Networks (CDN) and that "all services had been restored as of 15:12 GMT".

The issue originated from one of CenturyLink’s data centers in Mississauga, Canada, due to offending BGP FlowSpec announcement. In the incident summary, CenturyLink provided the following details:

“Summary: On August 30, 2020 10:04 GMT, CenturyLink identified an issue to be affecting users across multiple markets. The IP Network Operations Center (NOC) was engaged, and initial research identified that an offending flowspec announcement prevented Border Gateway Protocol (BGP) from establishing across multiple elements throughout the CenturyLink Network. The IP NOC deployed a global configuration change to block the offending flowspec announcement, which allowed BGP to begin to correctly establish. As the change propagated through the network, the IP NOC observed all associated service affecting alarms clearing and services returning to a stable state.”

BGP (Border Gateway Protocol) is an Internet protocol that manages data package routing through exchanging information on route availability and security.

FlowSpec (or BGP flow specification) is a BGP extension designed for propagation of filtering and security rules onto numerous peer BGP routers.

With an aid of FlowSpecs, participants of the global network can use BGP routers for propagating firewall rules onto own networks. Alerting announcements of FlowSpec rules enable internet service providers almost in no time to address security threats, such as DDoS-attacks BGP hacks.

According to Cloudflare’s assumption based upon BGP updates escalation stats, CenturyLink, probably, detected a cyberattack or some other abuse aimed against its network. To resist it, the system announced a set of new routes which were dropping off due to offending FlowSpec rule. As a result, part of CenturyLink’s routers terminated BGP sessions while others were transmitting invalid routes to other connected Tier-1 providers. It caused further cascading failures in the networks of Tier-1 providers.

The vicious circle can be illustrated as follows:

1. The router receives BGP update with routes and rules including the offending rule that triggers BGP blocking.

2. The router accepts the offending rule and terminates the BGP session.

3. Because BGP rules are not persistent after session termination, the router tries to re-establish BGP session again.

4. Then step 1 repeats and it occurs to all CenturyLink routers.

Tier 1 (or Top-Level ISP) are global internet backbone network operators connected to the Internet via peer connections between each other. Tier 1 operators do not have to purchase transit IP-traffic enjoying free access to the entire World Wide Web. Tier 2 operators manage both peer connections and purchased traffic. Tier 3 operators manage only purchased traffic

The critical situation made CenturyLink take an unprecedented act: they requested all Tier-1 internet providers to drop off from CenturyLink connection and ignore its out-coming traffic. This measure enabled CenturyLink within 5 hours to infuse amendments into the global configuration and clean up the BGP route tables.

Web service dropout worldwide

The information on service availability rate by geographical locations 08.30.2020 gathered by Qrator's BGP-collectors reflects the scale of impact of this incident globally:

Global Failure

This incident that caused dramatic disruption of the operation of numerous organizations and businesses worldwide is an important warning. It could have had far more severe consequences if it interfered with the work of critical infrastructures (power grids, water supply and treatment, medical care, transportation, etc.)

By now, new information keeps coming in adding details to the picture. We are waiting for ‘post mortem’ from CenturyLink itself. The final assessment and conclusion is still to be made.

Author: Pavlo Bereza

Share this: