Skip to content

A Deep Survey of Agent Async Communication Solutions

I. Where the Problem Comes From

Multi-Agent systems are becoming the standard architecture for AI applications. Behind a single user request, there may be an Orchestrator, several Worker Agents, data retrieval Agents, and tool-calling Agents all collaborating in parallel. This is not the future — it's what Claude Code, AutoGen, CrewAI, and LangGraph users are building every day.

In these systems, Agents need to communicate. Task distribution, result delivery, state synchronization, anomaly alerts — all of these are communication problems.

But Agents have a property that is fundamentally different from services: Agents are ephemeral and their online status is unpredictable. A Worker Agent that completes a task is destroyed. An Agent on an edge device rises and falls with network connectivity. A human approval node in the workflow may not come online for hours.

This property turns every communication approach built on the assumption that "the other party is online" into a solution with holes — in multi-Agent scenarios.

The problem is real. Let's look at how the industry is responding.


II. The Current Landscape: Three Layers of Effort

Layer One: Protocol Standards

This is where the most noise is right now. Google, IBM, and the Linux Foundation made dense moves in 2025.

A2A (Agent2Agent, Google)

Released by Google in April 2025, entered the Linux Foundation in June of the same year, with 100+ tech companies supporting it. Positioned as the standard for Agent-to-Agent communication — solving interoperability between Agents built on different frameworks by different vendors.

Technically, A2A is based on HTTP/JSON-RPC 2.0, with support for SSE streaming and webhook callbacks. For long-running tasks, A2A's approach is: the client disconnects, the task continues executing, and when the connection resumes, results are pushed via webhook.

This approach solves the semantic layer — how Agents describe capabilities, how they hand off tasks, how they express task state. But the fundamental transport hasn't changed: it's still short-lived HTTP connections, with async behavior achieved via webhook registration and server-side push polling. If the receiver is offline, the webhook can't be delivered, and the message is either lost or piles up on the sender side. This is not store-and-forward — it's best-effort delivery.

ACP (Agent Communication Protocol, IBM BeeAI)

IBM's positioning is "Slack + Email + Jira for Agents." Based on REST, with async as the default and sync as an option, supports an Agent registry and cross-platform interoperability. Now in the Linux Foundation.

ACP's REST-first design makes integration simpler, but it inherits REST's fundamental constraint: if the other party is offline when the caller sends a request, the request fails. ACP defines async message semantics at the protocol level, but it doesn't solve offline delivery at the transport layer — that's still left to the application to handle.

ANP (Agent Network Protocol)

Focused on Agent discovery and communication across the open internet, using DID-based identity and JSON-LD graph structures for decentralized Agent addressing. More macro in scope, targeting cross-organization and cross-platform Agent interoperability. The transport layer similarly has no native offline delivery support.

Common limitation at this layer: All three protocols solve semantic problems — how Agents describe themselves, how they hand off tasks, how they declare capabilities. None of them natively solve "what happens when the other party is offline" at the transport layer.


Layer Two: Infrastructure

Some teams recognized the transport problem and started reaching for existing MQ infrastructure.

NATS JetStream

Currently the closest existing infrastructure to what Agent async communication needs. JetStream adds persistence on top of Core NATS pub/sub — messages are written to a stream, subscribers receive them when they come online, replayed from the point they were last consumed. Store-and-forward is native, not bolted on.

But NATS is a general-purpose messaging system with no Agent concept. Using NATS JetStream for Agent communication requires designing mailbox semantics yourself, managing stream lifecycle, handling priority routing, and implementing capability discovery. Every team is building their own wrapper, each one different, none interoperable.

AWS SQS / Azure Service Bus

Enterprise-grade options. AWS Bedrock documentation explicitly recommends SQS for Agent async decoupling: Bedrock Agent → SQS → Lambda → target Agent. Message persistence, consumer-rate control based on the target Agent's processing capacity.

This works, but it's purely using traditional MQ as a pipe with no Agent-native design. No mailbox concept, no Agent addressing semantics, no capability discovery, no support for human-in-the-loop workflows. And it's locked to the cloud provider — complex to deploy on-premise or at the edge.

Kafka

Some teams have tried Kafka for Agent async communication. Kafka's persistence is strong, but its design assumptions conflict with the Agent scenario: topics need to be pre-created by operators, partitions need to be planned out, consumer group semantics are oriented toward batch data consumption rather than point-to-point mailboxes. Agents are disposable; Kafka assumes resources are long-lived and carefully maintained. The design philosophies are fundamentally opposed.

Common limitation at this layer: Infrastructure capabilities are there, but Agent-layer semantics need to be wrapped by every team. No standards, no interoperability, constant wheel-reinvention.


Layer Three: Tooling

This layer has the closest conceptual proximity to "Agent mailboxes," but all the solutions are local tools, not systemic infrastructure.

mcp_agent_mail (GitHub, 2025)

A Git + SQLite-based Agent mail coordination system. Each Agent has an independent inbox, with support for urgent-only filtering, timestamp-based filtering, persistent archiving, and file-lock conflict avoidance. Specifically designed for parallel multi-Agent programming workflows (Claude Code + Codex CLI collaboration).

Mailbox semantics are solid, but the architecture is a local tool: SQLite for storage, Git for versioning, exposed to Agents via an MCP server. No network protocol layer, no real-time push — Agents communicate by reading and writing SQLite, not via broker delivery.

agent-message-queue (GitHub, 2026)

Maildir-style file queue. Each Agent has an independent mailbox, crash-safe atomic writes, isolated session support. Lighter than mcp_agent_mail, but equally a local filesystem implementation with no network protocol layer.

agenticmail (GitHub, 2026)

A more radical direction: give each Agent a real email address and phone number. Agents communicate via SMTP/IMAP, supporting async mode (call_agent sends an email when done). 62 MCP tools covering email send/receive, SMS, and task coordination.

The intuition is right — email systems are inherently async and naturally solve offline delivery. But the architecture takes a roundabout path: using real SMTP/IMAP for Agent communication means high latency, heavy infrastructure, and unsuitability for millisecond-level Agent coordination.

Common limitation at this layer: Mailbox semantics are there, but all solutions are local tools — local filesystems, or borrowed real email infrastructure. None are a purpose-built network-level broker for Agents, and none provide the dual-track mechanism of real-time push plus persistent fallback.


III. The Root Deficiency of Current Approaches

Looking across all three layers, a shared blind spot emerges:

Every approach solves a local problem. No one has treated "Agent async communication" as a complete infrastructure problem to be designed from scratch.

Specifically, no existing solution simultaneously satisfies all four of the following:

1. Native offline delivery (store-and-forward) Not webhook polling. Not application-layer retries. Transport-layer native guarantee: once a message is sent, delivery is complete regardless of whether the recipient is online — they will receive it when they come online.

2. Agent-native addressing semantics Not a topic. Not a queue. Not an exchange. Concepts that are natural to an Agent: mailbox, inbox, broadcast channel. Developers writing Agents shouldn't need to think about underlying MQ resource management.

3. Lightweight, works out of the box No pre-created topics. No partition planning. No operational expertise required. When an Agent needs to communicate, one line to request a mailbox, get an address, send a message.

4. Real-time push + persistence, both tracks Real-time push when online, persistent waiting when offline. Both states transparent to the sender — send it and forget it.

The current state: A2A/ACP gets part of point 2 (semantics), not point 1. NATS JetStream gets point 1, not points 2 or 3. The tooling layer gets points 2 and 3, not point 1.

No solution covers all four.


IV. Real Need or Invented Problem?

A critical question: is Agent async communication a genuine need or an overhyped one?

From adoption scale: Gartner predicts that 40% of enterprise applications will integrate AI Agents by 2026. McKinsey reports that multi-Agent systems deliver 3x the ROI of single-Agent systems. When Google launched A2A, it had 50+ enterprise partners including Salesforce, ServiceNow, and SAP — companies running multi-Agent systems in production, for whom Agent communication is a real engineering problem.

From engineering practice: Multiple independent Agent mailbox implementations have appeared on GitHub (mcp_agent_mail, agent-message-queue, agenticmail), all built in 2025–2026 by developers solving their own real problems. When people independently build the same thing, it's direct evidence of a genuine pain point.

From architectural evolution: AWS, Google, and Microsoft all published official documentation and architecture guides in 2025 on "how to use message queues for Agent async communication." Cloud providers don't write docs for fake problems.

The reality of the need is not in question. But there's one thing worth being honest about: "good enough but inconvenient" doesn't necessarily drive developers to actively seek new solutions — unless the new solution is orders of magnitude better, or the maintenance cost of the stitched-together solution reaches a breaking point.

The growth in Agent system scale will accelerate that breaking point. When a system runs dozens of Agents and the custom Redis mailbox starts producing bugs; when edge device Agents start going offline and the polling solution starts missing messages — that's when demand for a purpose-built solution will genuinely explode. That window is opening, but it hasn't fully opened yet.


V. Conclusion

This is a real need, not an invented one.

It can be validated from three angles:

First, adoption scale. Gartner predicts 40% of enterprise applications will integrate AI Agents by 2026. Google launched A2A with 50+ enterprise partners. AWS, Google, and Microsoft all published official architecture guides on Agent async communication in 2025. Cloud providers don't write docs for fake problems; large companies don't put their names behind fake problems.

Second, engineering practice. Multiple independent Agent mailbox implementations (mcp_agent_mail, agent-message-queue, agenticmail) emerged on GitHub in 2025–2026. Developers independently building the same thing is the most direct evidence that the pain point is real.

Third, solution limitations. No current approach simultaneously satisfies: native store-and-forward offline delivery, Agent-native addressing semantics, lightweight out-of-the-box usability, and dual-track real-time push plus persistence. A2A/ACP have made progress at the semantic layer but rely on webhook stitching for the transport. NATS JetStream has sufficient transport-layer capability but requires every team to wrap Agent semantics themselves. Tooling solutions have mailbox semantics but no network-level broker. Three layers, each solving a piece — the middle layer is a real engineering gap.

One thing to acknowledge honestly: "good enough but inconvenient" doesn't automatically drive teams to switch. Most teams today are making do with Redis/SQS/NATS patchwork, and it holds in the short term. As Agent system scale grows, the maintenance cost of stitched solutions will rise to a breaking point — that's when demand for a dedicated solution will genuinely explode. The window is opening, but it isn't fully open yet.

One dynamic to watch: A2A's transport layer evolution. Today A2A relies on webhooks and polling for async — an obvious weakness. If the A2A community adds native store-and-forward capability at the protocol level, this gap gets filled. There's no sign of that happening today, but it's worth watching over the next 12–24 months.

Summary: The problem exists. The gap is real. The need is not manufactured. This is an engineering problem that the industry is currently routing around with temporary solutions, not yet solved head-on in a systematic way.

🎉 既然都登录了 GitHub,不如顺手给我们点个 Star 吧!⭐ 你的支持是我们最大的动力 🚀