‘We deeply apologize for the impact this outage has had,’ Google Cloud says.
Google Cloud and Cloudflare have apologized for the cloud outage that appeared to also down multiple popular websites and applications including Spotify and Discord.
The Mountain View, Calif.-based cloud and artificial intelligence giant said that it will work on addressing a Google Cloud customer monitoring infrastructure failure that happened during the outage and left users without incident signals or an understanding of how the outage affected their business.
“We deeply apologize for the impact this outage has had,” according to the Google incident report. “Google Cloud customers and their users trust their businesses to Google, and we will do better. We apologize for the impact this has had not only on our customers’ businesses and their users but also on the trust of our systems. We are committed to making improvements to help avoid outages like this moving forward.”
[RELATED: Cloudflare Blames Google Cloud For Mass Services Outages]
Google Cloud Outage
Dawn Sizer, CEO of Mechanicsburg, Pa.-based solution provider 3rd Element Consulting, said that her clients noticed some commercial websites weren’t working but did not experience any interference with their business.
“A blip,” Sizer said. “That’s what it boiled down to.”
Thomas Kurian, CEO of Google Cloud, posted on social media network X that “we regret the disruption this caused our customers.”
Cloudflare also experienced broad service outages on June 12 and published a report on what happened. Although during the outage, a Cloudflare spokesperson pointed multiple media outlets to Google Cloud as the source of the failure, the official Cloudflare report says that the infrastructure that failed is backed in part “by a third-party cloud provider” that experienced an outage the same day.
The vendor confirmed that the outage was not due to an attack or security event. No data was lost because of the outage. Its Magic Transit and Magic WAN, DNS, Cache, proxy, WAF and related services didn’t experience direct impacts.
“We’re deeply sorry for this outage: this was a failure on our part,” according to Cloudflare’s report. “While the proximate cause (or trigger) for this outage was a third-party vendor failure, we are ultimately responsible for our chosen dependencies and how we choose to architect around them. … This was a serious outage, and we understand that organizations and institutions that are large and small depend on us to protect and/or run their websites, applications, zero trust and network infrastructure. Again we are deeply sorry for the impact and are working diligently to improve our service resiliency.”
Google Cloud Crashed Loop
Google’s report on the incident puts the start time at 10:51 a.m. Pacific June 12 and end time at 6:18 p.m. the same day.
The issue dates back to a new feature added to Service Control, the core binary that is part of the check system that makes sure application programming interface (API) requests are authorized with appropriate policies to meet endpoints.
On June 12, Google inserted a policy change into the regional Spanner tables used for policies and caused a crash loop with unintended blank fields in the policy data. Google site reliability engineers (SREs) found the root cause within 10 minutes and turned off the policy serving path. The path disabling rollout completed within 40 minutes of the incident start time and Google started seeing recovery across its regions.
In larger regions, such as us-central-1, which includes Iowa, Service Control task restarts overloaded the infrastructure. Service Control didn’t have the right randomized exponential backoff to avoid the problem. US-central-1 took almost three hours to fully resolve, and Google throttled task creation to minimized infrastructure impact. Cloud Service Health infrastructure failing during the outage prevented the first public incident report publishing on Google’s status page for about an hour.
Moving forward, Google plans to modularize Service Control’s architecture to isolate the functionality and “fail open”–that is, default to an accessible state if a future failure happens.
Google will also audit all systems that consume globally replicated data, enforce all changes to critical binaries with feature flag protection and disabled by default, improve static analysis and testing practices to correctly handle errors and fail open if needed, ensure randomized exponential backoff in systems, improve automated and human external communications to inform users and make sure monitoring and communication infrastructure remains operational even when Google Cloud and primary monitoring products are down, according to the vendor.
Cloudflare Works To Improve Resiliency
Cloudflare, meanwhile, said in its own report that its Workers KV key-value data store saw the failure of underlying storage infrastructure that is backed in part by Google Cloud. Worker KV “is a critical dependency for many Cloudflare products and relied upon for configuration, authentication and asset delivery across the affected services.”
Cloudflare put its outage at about two-and-a-half hours. Workers KV saw about 90 percent of requests fail, although data stored in Workers KV was not affected.
The incident started June 12 at 17:52 Coordinated Universal Time (UTC). The impact ended the same day at 20:28. Moving forward, Cloudflare plans to prevent singular dependencies on third-party storage infrastructure to improve recovery.
Cloudflare will improve Workers KV storage infrastructure redundancy, implement “short-term blast radius remediations” to make each product resilient to loss of service caused by a single point of failure, implement tooling to progressively re-enable namespaces during storage infrastructure incidents.
“This list is not exhaustive,” Cloudflare said in the report. “Our teams continue to revisit design decisions and assess the infrastructure changes we need to make in both the near (immediate) term and long term to mitigate the incidents like this going forward.”