Google explains what causes Monday’s multi-service outage


Google began the week with a major outage that took Gmail, Drive, and all other vertical apps. As promised, Google has now given a detailed description on the outage and will take steps to prevent future incidents.

At a high level, the issue is related to the current work updating Google’s account authentication system. As the effort continued, the previous components were “left in place.” Keeping those old aspects, with error in usage being about 0, Google established a grace period to minimize the impact.

That corrective solution expired and led automated systems to respond to the error as if it were real. Since usage appeared at 0, the capacity for the identity management system was reduced. While security checks were in place, they were not designed to cover the specific problem.

The issue started affecting users at 3:47 pm PT and engineers were alerted a minute later. “Workspace apps were down for the duration of the event” as they rely on the affected infrastructure to ensure that you are logged in, authenticated, and authorized to view content like emails and documents.

The root cause and a possible fix were identified at 04:08, which led to disabling quota enforcement in a datacenter at 04:22. This rapidly improved the situation, and the same mitigation was applied to all datacenters at 04:27, which returned error rates to normal levels after 04:33.

The company planned to review, improve and evaluate its systems to prevent similar issues of this nature. Google ended its explanation with apology:

We apologize for the impact this event has on our customers and their businesses. We take any event that seriously affects the availability and reliability of our customers, especially events that spread across many regions.

Full technical explanation is available here.

FTC: We use auto linked affiliate links. more.


For more news check out 9to5Google on YouTube:

Leave a Reply

Your email address will not be published.