✓

Follow along with this comprehensive guide

Meta's infrastructure serves over 3 billion users, making even tiny performance changes have massive energy impacts. To tackle this, the company built a unified AI agent platform that automates finding and fixing performance issues across its global data centers. This Q&A explores how the Capacity Efficiency Program leverages artificial intelligence to save megawatts of power and free engineers from manual regression investigations.

What is Meta's Capacity Efficiency Program and why is it needed?

The Capacity Efficiency Program is Meta's initiative to optimize performance and reduce power consumption across its hyperscale infrastructure. At this scale, even a 0.1% performance regression can waste enough electricity to power thousands of homes. The program operates on two fronts: defense (detecting and fixing regressions that slip into production) and offense (proactively finding opportunities to improve efficiency). Without automation, engineers would spend countless hours manually investigating each issue—a bottleneck that limits how many improvements can be deployed. By encoding domain expertise into AI agents, Meta aims to create a self-sustaining engine that handles the long tail of efficiency problems without proportionally growing the team.

How Meta Uses AI Agents to Boost Hyperscale Efficiency: Q&A — Source: engineering.fb.com

How do unified AI agents automate efficiency at Meta?

Meta built a unified platform where AI agents combine encoded domain expertise from senior efficiency engineers with standardized tool interfaces. These agents are composed of reusable skills that can be stacked to handle complex workflows. On the defense side, agents automatically investigate regressions flagged by FBDetect, root-cause them to a specific pull request, and even prepare mitigation fixes. On the offense side, they scan for optimization opportunities and generate ready-to-review pull requests. This reduces manual investigation time from ~10 hours to ~30 minutes, and for many cases, the path from opportunity to pull request is fully automated. The platform scales MW delivery across more product areas without requiring additional headcount.

What is FBDetect and how does it help with regressions?

FBDetect is Meta's in-house regression detection tool. It monitors production resource usage and catches thousands of performance regressions each week. When a regression is identified, the tool triggers an investigation workflow. Previously, engineers would manually analyze the regression, trace it to a code change, and deploy a fix. Now, AI agents expedite this process by automating the root-cause analysis and, in many cases, applying the fix automatically. Faster resolution means fewer wasted megawatts compounding across the fleet over time. FBDetect forms the backbone of the program's defense strategy, but the AI agent platform is what turns detection into rapid resolution.

How does the offensive side work with AI-assisted opportunity resolution?

The offensive side of Meta's efficiency program focuses on proactively finding performance optimizations before they become problems. AI agents scan the codebase and infrastructure for opportunities to improve energy efficiency—for example, tweaking algorithms, reducing unnecessary computations, or optimizing data structures. These agents encode the same systematic thinking that senior efficiency engineers use. They generate pull requests with proposed changes, which are then reviewed by human engineers. This approach allows the program to handle a growing volume of wins that would never be addressed manually. Every half (six-month period), the AI-assisted opportunity resolution expands to more product areas, compounding the power savings across Meta's massive fleet.

How much power has this program saved?

Meta's Capacity Efficiency Program has recovered hundreds of megawatts (MW) of power—enough to power hundreds of thousands of American homes for a year. These savings come from both preventing regressions (defense) and implementing efficiency improvements (offense). By compressing manual investigations from hours to minutes, the AI agents ensure that every identified opportunity is captured and deployed quickly. The automated pipeline scales efficiently: as the fleet grows and new product areas are added, the AI platform keeps MW delivery rising without requiring a proportional increase in human engineers. This is critical for Meta's sustainability goals and cost management at hyperscale.

How long does it take to investigate regressions manually vs with AI?

Before AI agents, a typical regression investigation took an experienced engineer around 10 hours of manual effort. This included analyzing performance data, root-causing the regression to a specific code change, and developing a mitigation fix. With the AI-powered platform, that same investigation is compressed to approximately 30 minutes—a 20x speedup. In many fully automated cases, the entire path from identifying an efficiency opportunity to generating a ready-to-review pull request requires no human intervention. This dramatic reduction in investigation time means that issues are resolved before they can significantly impact power consumption, and engineers can focus on higher-value innovation rather than repetitive debugging.

What is the end goal of this AI-driven efficiency engine?

The ultimate vision is a self-sustaining efficiency engine where AI handles the entire lifecycle of performance improvements: detecting issues, diagnosing root causes, generating fixes, and deploying them. While humans still review critical changes, the goal is to automate the long tail of small, repetitive optimizations that individually save little but collectively save megawatts. By continuously learning from new data and human feedback, the AI agents become more capable over time. This allows Meta to scale its infrastructure sustainably without proportionally increasing the team focused on capacity efficiency. The engine is designed to keep delivering MW savings across a growing number of product areas, making efficiency a built-in property of the development process.

How do these AI agents encode domain expertise?

Meta's AI agents are built from reusable, composable skills that capture the decision-making processes of senior efficiency engineers. These skills are encoded into the agent platform using a standardized tool interface, which allows different agents to collaborate on complex tasks. For example, one skill might know how to analyze CPU usage patterns, while another understands memory allocation bottlenecks. By combining these skills, the agent can mimic a human expert's systematic investigation workflow. The platform also ingests historical data from previous regressions and optimizations, so the agents improve over time. This encoding of tribal knowledge ensures that even as senior engineers move to other roles, their expertise remains embedded in the automation infrastructure.

How Meta Uses AI Agents to Boost Hyperscale Efficiency: Q&A