We pushed Atmos Pro to its limits by running a single PR that affected over 2,000 components — the kind of monorepo-scale change that stress tests every layer of the system. The goal was to validate that dispatch, GitHub API interaction, and status reconciliation all hold up under real-world load. They didn't, at first. So we fixed them.
The original dispatch architecture processed all workflow dispatches sequentially in a single step. At 2,000+ stacks, that exceeded serverless function timeouts and left hundreds of stacks permanently stuck. GitHub's secondary rate limits compounded the problem by silently blocking PR comment updates after dozens of rapid edits.
We redesigned the dispatch pipeline around a fan-out architecture. A thin coordinator step now emits one event per stack, and independent workers handle each dispatch with their own timeout budget, retries, and error isolation. Per-repository concurrency limits prevent GitHub API rate limiting, and adaptive comment debounce scales update frequency based on deployment size — faster feedback for small changes, throttled updates for large ones.
The reconciliation system now catches stacks in every status, including those that were never dispatched due to a crash or lost event. A periodic sweep resolves stale runs by checking their actual status on GitHub and updating the PR comment accordingly.
Processing 2,000 stacks now completes in roughly 100 seconds with a concurrency limit of 20 workers. Each worker independently dispatches a single workflow, updates the database, and triggers a debounced comment refresh. If any individual dispatch fails, only that stack is affected — every other stack continues without interruption.