GitHub Reports Four Major Incidents Affecting Services in July 2024

James Ding  Aug 16, 2024 10:42  UTC 02:42

0 Min Read

GitHub faced a challenging month in July 2024, with four major incidents leading to degraded performance across several of its services, according to The GitHub Blog.

Incident Breakdown

July 5 (lasting 97 minutes)

On July 5, from 16:31 to 18:08 UTC, GitHub's Webhooks service experienced performance degradation due to a configuration change that removed authentication from background job requests, causing these requests to be rejected. The incident led to delays in Webhooks deliveries, with an average delay of 24 minutes and a maximum of 71 minutes. A secondary issue from 18:21 to 21:14 UTC further affected GitHub Actions runs on pull requests, adding delays to job delivery.

To prevent future occurrences, GitHub has updated dashboards, improved health checks, and introduced new alerts for similar issues. The company is also working on better workload isolation to minimize the impact of such incidents.

July 13 (lasting 19 hours and 26 minutes)

On July 13, from 00:01 to 19:27 UTC, GitHub Copilot services were significantly degraded. The error rate for Copilot code completions reached 1.16%, while GitHub Copilot Chat peaked at 63%. The issue was traced back to a resource cleanup job executed by a partner service, which mistakenly targeted essential resources. GitHub managed to mitigate the impact while resources were being restored.

GitHub is now collaborating with partner services to implement safeguards against future incidents and enhance traffic rerouting processes for quicker mitigation.

July 16 (lasting 149 minutes)

On July 16, from 00:30 to 03:07 UTC, Copilot Chat was degraded and rejected all requests, with an error rate close to 100%. The issue was triggered during routine maintenance when GitHub services were disconnected and overwhelmed the dependent service during reconnections.

To address this, GitHub is improving its reconnection and circuit-breaking logic to prevent similar disruptions in the future.

July 18 (lasting 231 minutes)

On July 18, starting at 22:38 UTC, network issues within an upstream provider led to degraded experiences across Actions, Copilot, and GitHub Pages services. Up to 50% of Actions workflow jobs were stuck in the queuing state, and users faced issues with enabling Actions or registering self-hosted runners. The problem was caused by an unreachable backend resource in the central US region.

GitHub mitigated the issue by updating the replication configuration, which allowed successful requests while one region was unavailable. The company is now enhancing its replication and failover workflows to better handle such situations and reduce recovery time.

Future Mitigation Steps

In response to these incidents, GitHub is taking multiple steps to improve its service resilience. These include updating dashboards, enhancing health checks, implementing new alerts, collaborating with partner services, and improving reconnection and circuit-breaking logic. The company is also focused on better workload isolation and enhancing replication and failover workflows.

For real-time updates on status changes and post-incident recaps, users are encouraged to follow GitHub's status page and the GitHub Engineering Blog.



Read More