GitHub's Reliability Journey: Addressing Growth and Incidents

GitHub recently experienced two significant availability incidents that impacted users. In response, the engineering team has shared insights into the challenges and ongoing improvements. This Q&A covers key questions about what happened, why, and how GitHub is working to ensure a more resilient platform.

What caused the recent GitHub availability incidents?

The two incidents were triggered by a combination of rapid growth and system complexity. While specific root causes varied, both stemmed from the strain of exponential increases in repository creation, pull request activity, API usage, and automation driven by agentic development workflows. These workflows accelerated sharply from December 2025 onward, outpacing existing capacity planning. GitHub’s legacy architectures, such as storing webhooks in MySQL and relying on a monolithic Ruby codebase, struggled to keep up. As queues deepened, cache misses converted to database load, and indexes fell behind, cascading failures occurred. The incidents were deemed unacceptable, and GitHub apologized for the disruption. Since then, the team has prioritized understanding these failure modes and implementing both short-term fixes and long-term architectural changes to prevent recurrence.

GitHub's Reliability Journey: Addressing Growth and Incidents — Source: github.blog

Why is GitHub planning for 30 times today's capacity?

Originally, GitHub started a plan in October 2025 to increase capacity by 10X to improve reliability and failover. However, by February 2026, the growth trajectory made it clear that a 10X increase would be insufficient. The primary driver is the rapid shift in how software is built, specifically the rise of agentic development—automated agents that create repositories, submit pull requests, and run workflows at unprecedented rates. Nearly every metric—from repository creation to API calls—shows exponential growth. To ensure the platform can handle this future, GitHub redesigned its infrastructure to support 30X current capacity. This means not just scaling hardware but also rethinking software architecture, caching strategies, and dependency management to efficiently handle massive workloads without sacrificing availability.

How does the rise of agentic development affect GitHub's infrastructure?

Agentic development workflows, which became prominent in late 2025, place stress on multiple systems simultaneously. Unlike human-driven development, agents operate at high speed and volume, creating a continuous flood of repository creations, pull requests, and API calls. This rapid activity does not stress one service in isolation; a single automated PR can trigger Git storage, mergeability checks, branch protection, GitHub Actions, search, notifications, permissions, webhooks, APIs, background jobs, caches, and databases. At scale, small inefficiencies compound: queues deepen, cache misses become database load, indexes fall behind, and retries amplify traffic. One slow dependency can affect several product experiences. GitHub’s engineering team recognized that the platform must be designed to degrade gracefully under such pressure, with isolated services and minimized blast radius.

What systems are stressed when a pull request is processed?

A single pull request touches a surprising number of subsystems. First, it involves Git storage for the repository data. Then mergeability checks run against branch protection rules. GitHub Actions may be triggered for CI/CD. Search indexes are updated for cross-references, notifications are sent to watchers, permissions are verified, webhooks fire to integrations, APIs serve status updates, and background jobs process tasks like status checks. All of these depend on caches and databases. At high traffic, any inefficiency in this chain can create cascading delays. For example, a cache miss turns into a database query, which may then contend with other queries, causing indexes to fall behind. Retries from timeouts add further load. This complexity underscores why GitHub’s reliability efforts focus on reducing hidden coupling and ensuring one subsystem’s pressure doesn’t bring down the whole platform.

What are GitHub's top priorities for improving reliability?

GitHub’s priorities are clear: availability first, then capacity, then new features. This means that before scaling or adding functionality, the team focuses on making the platform resilient. Specific actions include reducing unnecessary work, improving caching at multiple levels, isolating critical services (like Git and Actions) from other workloads, removing single points of failure, and moving performance-sensitive code out of the Ruby monolith into more efficient systems (like Go). This is fundamentally distributed systems work: reducing hidden coupling, limiting blast radius, and ensuring graceful degradation when one subsystem is under pressure. The team acknowledges that the recent incidents highlight where more work is needed, but they are making progress quickly by addressing the highest risk items first.

What short-term actions did GitHub take to address bottlenecks?

In the short term, GitHub resolved several bottlenecks that appeared faster than anticipated. They moved webhooks out of MySQL to a different backend, reducing database load. The user session cache was redesigned to avoid frequent re-authentication overhead. Authentication and authorization flows were restructured to minimize database queries. Additionally, GitHub leveraged its migration to Azure to rapidly provision more compute resources—standing up a lot more capacity to absorb the growth. These actions provided immediate relief while longer-term architectural changes were being developed. The team also accelerated the migration of performance-sensitive code from the Ruby monolith into Go, which offers better concurrency and lower latency for critical paths. These steps were part of a broader effort to buy time for deeper reliability improvements.

How is GitHub isolating critical services and minimizing blast radius?

After addressing immediate bottlenecks, GitHub focused on isolating critical services like Git and GitHub Actions from other workloads. This isolation helps minimize the blast radius when a problem occurs. The team began with a careful analysis of dependencies and traffic tiers to understand what needed to be separated. They identified single points of failure and worked to reduce them. For each dependency, they assessed risk and prioritized fixes accordingly. The goal is to ensure that an issue with, say, notifications or search does not impact Git operations or Actions execution. By decoupling services and limiting the interconnectedness, GitHub makes the platform degrade gracefully: one subsystem under pressure won’t cascade into a full outage. This work is ongoing, with the highest risk items being addressed first.

What is GitHub's strategy regarding cloud migration and multi-cloud?

Even before the recent incidents, GitHub was in the process of migrating from its smaller custom data centers into the public cloud. The growth surge accelerated this move. Now, the team is also working on a path to multi-cloud to avoid dependency on a single cloud provider. This strategy involves not just moving workloads but also designing for portability and resilience across cloud environments. Multi-cloud will reduce the risk of a provider-level outage affecting GitHub availability. In combination with service isolation and capacity scaling, the cloud migration provides the elasticity needed to handle sudden traffic spikes from agentic development. The overall direction is to create a more robust infrastructure that can scale dynamically while maintaining high availability for all users.