This article is from the source 'rtcom' and was first published or seen on . It last changed over 40 days ago and won't be checked again for changes.

You can find the current article at its original source at https://www.rt.com/usa/526063-fastly-outage-software-glitch/

The article has changed 3 times. There is an RSS feed of changes available.

Version 1 Version 2
Uncaught bug in Fastly software triggered global service outage, company says Uncaught bug in Fastly software triggered global service outage, company says
(about 1 month later)
Tuesday’s disruption of numerous popular websites using the services of US cloud computing firm Fastly has been traced to a software bug, which sneaked into a recent update and got triggered by a user.Tuesday’s disruption of numerous popular websites using the services of US cloud computing firm Fastly has been traced to a software bug, which sneaked into a recent update and got triggered by a user.
The websites of multiple news outlets, the British government, and services like Amazon and Spotify were among those affected by the hour-long outage on Tuesday. Fastly, whose servers were the source of the problem, says it traced the issue to a specific software bug, which its quality control engineers had failed to identify and fix ahead of a May update.The websites of multiple news outlets, the British government, and services like Amazon and Spotify were among those affected by the hour-long outage on Tuesday. Fastly, whose servers were the source of the problem, says it traced the issue to a specific software bug, which its quality control engineers had failed to identify and fix ahead of a May update.
The bug was triggered by a customer, who was not identified. The user made a “valid” configuration change on Tuesday, starting a chain reaction that “caused 85 percent of our network to return errors,” Fastly Vice President Nick Rockwell said in a blog post.The bug was triggered by a customer, who was not identified. The user made a “valid” configuration change on Tuesday, starting a chain reaction that “caused 85 percent of our network to return errors,” Fastly Vice President Nick Rockwell said in a blog post.
“Even though there were specific conditions that triggered this outage, we should have anticipated it. We provide mission critical services, and we treat any action that can cause service issues with the utmost sensitivity and priority,” the executive said.“Even though there were specific conditions that triggered this outage, we should have anticipated it. We provide mission critical services, and we treat any action that can cause service issues with the utmost sensitivity and priority,” the executive said.
Fastly said it noticed the issue a minute after it showed up, and managed to restore 95% of its network within 49 minutes. A permanent software patch fixing the problem was ready for deployment around five hours later. Rockwell promised to conduct a full analysis of the situation and figure out “why we didn’t detect the bug during our software quality assurance and testing processes.”Fastly said it noticed the issue a minute after it showed up, and managed to restore 95% of its network within 49 minutes. A permanent software patch fixing the problem was ready for deployment around five hours later. Rockwell promised to conduct a full analysis of the situation and figure out “why we didn’t detect the bug during our software quality assurance and testing processes.”
The outage caused by the Fastly glitch was one of several such large-scale incidents to have happened in the last several years. In February 2017, a human error made by an Amazon employee during a debugging process led to a cascading server shutdown and disrupted its AWS services for hours. In July 2020, a large portion of Cloudflare services went down for about 30 minutes due to a configuration error in a segment of the backbone network connecting Newark and Chicago.The outage caused by the Fastly glitch was one of several such large-scale incidents to have happened in the last several years. In February 2017, a human error made by an Amazon employee during a debugging process led to a cascading server shutdown and disrupted its AWS services for hours. In July 2020, a large portion of Cloudflare services went down for about 30 minutes due to a configuration error in a segment of the backbone network connecting Newark and Chicago.
Like this story? Share it with a friend!Like this story? Share it with a friend!
Dear readers and commenters,
We have implemented a new engine for our comment section. We hope the transition goes smoothly for all of you. Unfortunately, the comments made before the change have been lost due to a technical problem. We are working on restoring them, and hoping to see you fill up the comment section with new ones. You should still be able to log in to comment using your social-media profiles, but if you signed up under an RT profile before, you are invited to create a new profile with the new commenting system.
Sorry for the inconvenience, and looking forward to your future comments,
RT Team.