Mastering Multi-Agent AI Collaboration at Scale

As AI systems become more complex, engineering teams face a critical challenge: enabling multiple autonomous agents to work together harmoniously at scale. In a recent podcast, Intuit's Chase Roossin and Steven Kulesza shared insights on tackling what they call the hardest engineering problem today—multi-agent coordination. This Q&A explores key strategies, pitfalls, and best practices for building robust, scalable multi-agent systems.

What makes multi-agent collaboration the toughest engineering challenge?

Coordinating multiple AI agents at scale introduces complexity that single-agent systems avoid. Each agent may have its own goals, data sources, and decision-making logic, leading to conflicts, resource contention, and unpredictable emergent behaviors. Engineers must design synchronization protocols to prevent deadlocks and race conditions, and ensure agents can communicate reliably even under high load. Unlike traditional distributed systems, AI agents are non-deterministic—their outputs vary based on context and learning. This means failures are harder to reproduce and debug. The sheer number of interactions (O(n²) for n agents) quickly overwhelms naive orchestration. At Intuit, the team discovered that the biggest risk is agent interference, where one agent's actions inadvertently disrupt another's state. Without careful design, the system degrades into chaos, making scalability impossible.

Mastering Multi-Agent AI Collaboration at Scale — Source: stackoverflow.blog

How can engineers design agents that play nice together?

The foundation lies in clear boundaries and contracts. Each agent should have a well-defined scope and explicit input/output schemas. Intuit engineers recommend using event-driven architectures where agents publish events to a central message bus and subscribe to relevant topics. This decouples agents and allows independent scaling. They also emphasize idempotent operations: if the same event is processed twice, the state remains consistent. Another key is graceful degradation—agents must handle missing data or delayed responses without cascading failures. For example, an agent that requests a credit score should continue working if the score service is temporarily down, perhaps by using a cached value. Finally, chaos engineering helps uncover hidden dependencies by randomly introducing failures in a staging environment.

What are the most common pitfalls when scaling multi-agent systems?

Three pitfalls stand out. First, tight coupling: when agents directly call each other's APIs, a change in one breaks others. Second, shared mutable state—multiple agents updating the same database without locking leads to data corruption. Third, synchronization bottlenecks like a single orchestrator that becomes a single point of failure. Chase Roossin notes that teams often underestimate the need for observability. Without distributed traces and centralized logging, it's nearly impossible to understand why an agent behaved incorrectly. Another trap is over-engineering upfront—trying to design a perfect coordination protocol before understanding actual agent behaviors. Start with a simple mediator pattern and iterate. Finally, testing multi-agent interactions is notoriously hard; use simulation and property-based testing to validate invariants.

How do you manage communication between AI agents at scale?

Intuit's approach uses asynchronous messaging with guaranteed delivery. Each agent communicates via a durable queue (like Apache Kafka) so messages aren't lost even if the receiving agent is down. They also implement backpressure mechanisms: if an agent falls behind, upstream agents slow down to prevent system overload. For conflict resolution, agents use a leader-election protocol to decide who handles ambiguous requests. Steven Kulesza emphasizes the importance of message schemas—using protocol buffers or Avro ensures backward compatibility. When agents need to negotiate (e.g., scheduling tasks), they use a two-phase commit like pattern, but with timeouts to avoid indefinite blocking. For large-scale systems, event sourcing helps replay agent interactions for debugging and auditing.

What role does orchestration play in multi-agent systems?

Orchestration is the brain of the system, but it must be lightweight. A central orchestrator can coordinate workflows across agents, but it should not be a monolithic controller. Instead, the orchestrator should define high-level goals and leave execution details to agents. This is known as goal-based orchestration. For example, an orchestrator might say 'process customer onboarding' and each agent decides how to fulfill its part. In contrast, choreography (agents react to events without a central director) works well for simple flows but becomes chaotic at scale. The sweet spot is a hybrid approach: use an orchestrator for critical path coordination and allow peer-to-peer communication for non-critical tasks. Intuit also uses circuit breakers in the orchestrator to stop invoking a failing agent and trigger fallback plans.

Can you share an example of a multi-agent system at Intuit?

One example is Intuit's fraud detection pipeline. Multiple AI agents analyze transaction data: one agent checks for stolen cards, another for account takeover patterns, a third for unusual spending behaviors. These agents run in parallel on streaming data. A coordinator agent aggregates their outputs and decides whether to block a transaction. The key challenge was ensuring that the coordinator didn't become a bottleneck. They solved it by using scatter-gather pattern with a timeout—if an agent doesn't respond in 200ms, the coordinator uses a default score. Agents share a feature store to avoid redundant computations, and they communicate via a shared log. This design now handles millions of transactions daily with 99.99% uptime. The team learned that agent specialization works best: each agent is an expert in one domain, and the coordinator trusts but verifies.

💬 Comments ↑ Share ☆ Save