mq9 as a Service Registry: Some Thoughts
I've been going back and forth on whether mq9 should build a service registry. I've turned this over in my head through several rounds, and my conclusion has shifted each time. This post is an attempt to lay out that thinking process — keeping it on record.
Where the Question Comes From
mq9's goal is to be the most usable middleware for Agent async communication. By async communication, I mean the sender and receiver don't need to be online at the same time — messages are stored when sent and processed when the other side comes online. If both parties are online simultaneously, that's real-time communication, which feels more like a mail system. This is a fundamental need in the Agent era, and it's what mq9 is positioned for.
The dominant protocol for Agent-to-Agent communication right now is A2A. mq9 doesn't replace A2A — it carries it. A2A defines the protocol content (Task, Message, Artifact, state machines), and mq9 is the transport layer. The specific design was covered in mq9 Carrying A2A: Some Design Thoughts — the core idea is using a mailbox abstraction plus a URI scheme (mq9://) so the A2A protocol itself doesn't need a single line changed.
But once you try to run the full communication path end-to-end, you hit a problem that can't be worked around: if Agent A wants to send an A2A message to Agent B, how does A know B's mailbox address?
The A2A protocol itself is aware of this. The spec requires every Agent to expose /.well-known/agent.json to publish its AgentCard publicly, but it also acknowledges this only solves "how to read an Agent Card once you know the domain" — not "how to find the domain in the first place." From the spec directly:
The mechanism for this registration is outside the scope of the A2A protocol itself. The current A2A specification does not prescribe a standard API for curated registries.
The A2A working group explicitly left this to the ecosystem.
That gap is what mq9 needs to answer. Without a solution, mq9's support for A2A communication is a half-finished job — the transport layer is complete, but users have to solve "finding the other party" before they can even use it. That step's experience determines whether the mq9 end-to-end path feels smooth or not.
Round One: Don't Build It
My earliest judgment was to not build it. The reasoning was straightforward: infrastructure projects are most vulnerable to scope creep. There are already mature solutions — etcd, Consul, Nacos, ZooKeeper. Reinventing the wheel in mq9 would be pointless. Get the communication layer right, and let users choose their own registry.
The concrete plan was to guide users toward existing registries. Agents register their mailbox address into etcd on startup, and look it up when needed. It works.
But thinking it through more carefully, this approach has a few uncomfortable rough edges for A2A scenarios.
Why etcd Isn't the Answer
I went and looked at the design assumptions behind etcd, Consul, and similar registries, and found they're solving a different problem.
The core problem for etcd and its kind is: "I need to call service X — which IP:Port is it deployed on, and which replicas are healthy?" A few characteristics of that design:
- Strong consistency — Raft-based, guaranteeing full replica agreement
- Small data size — each record is just an IP:Port plus some tags
- High watch frequency — clients hold long connections to subscribe to changes
- Health-check oriented — the core question is "is it up or not"
The question A2A Agents need answered is different. When Agent A is looking for Agent B to send an A2A task, it's not asking "what is B's IP:Port?" — it's asking "which Agent can handle task X?" That's a capability-matching problem, not an address lookup problem.
Getting concrete about the data:
- What gets registered isn't an IP:Port but an A2A AgentCard — a rich capability description with name, description, skills, examples, tags, etc.
- Strong consistency isn't needed; eventual consistency is fine (registration info doesn't change frequently)
- High-frequency watching isn't needed; on-demand queries are sufficient
- The granularity of health checks is also different — an Agent being "online" and "able to do this thing" are two separate questions
The more important difference is query pattern. etcd's queries are precise — the key path is explicit, and you either find it or you don't. Agent discovery in A2A scenarios is inherently fuzzy — "I'm looking for a translation assistant," described in natural language, matched by semantic similarity. The skills.description field in an AgentCard is natural language text, and the A2A protocol actively encourages Agents to describe their capabilities in natural language.
These two types of systems differ in storage engine, consistency model, and query pattern. You can store AgentCards in etcd — it'll work — but fundamentally you're using it as a KV store, wasting its strong consistency guarantees while gaining nothing on the vector retrieval side.
This means etcd is a dead end. Not because etcd is bad — it's because it's solving a different problem.
What About an Existing Vector Database?
Once I was clear that etcd wouldn't work, the next question was: what about using an existing vector database? Deploy a Qdrant instance, have Agents register into it.
Qdrant solves the retrieval problem, but it introduces several new ones:
- Another external component to operate — more operational complexity
- Qdrant is a general-purpose vector database, so you'd need to implement the A2A AgentCard registration semantics yourself (lifecycle, TTL, heartbeats, expiry cleanup, AgentCard format parsing)
- A lot of glue code between the mq9 protocol layer and Qdrant — mq9 receives an AgentCard on registration and forwards it to Qdrant; on DISCOVER, mq9 queries Qdrant to get IDs and then returns the original AgentCard
- The developer experience is fragmented: go to Qdrant to find an Agent, get the mailbox address, switch back to the mq9 client to send the message
That last point matters most. The A2A communication path should flow seamlessly from discovery to sending. Splitting discovery and communication into two separate components, with a mapping step required each time, is hostile to developers.
At this point I started reappraising the "mq9 builds its own registry" option.
What Is the Industry Doing?
Judging whether something is worth building requires looking at what others are doing.
The A2A working group's GitHub Discussion #741 has been running for several months. The core questions the community is debating haven't converged — centralized directory vs. distributed Agent Card, whether to standardize metadata, how to implement trust models, whether search should use metadata filtering or vector retrieval. The conclusion is that the A2A spec won't produce a registry standard in the near term.
Existing open-source projects in the space:
- awslabs/a2a-agent-registry-on-aws: S3 Vectors + Bedrock implementing a semantic registry for A2A AgentCards — tightly AWS-bound
- Agent Name Service (ANS): IETF draft, targeting "DNS for AI Agents," PKI certificate mechanism, positioned as a formal standard
- Solo.io's agentregistry: Commercial-grade governance platform, focused on approval workflows and compliance
- agentregistry-dev/agentregistry: More of a unified catalog for AI artifacts
What these all have in common: they're either bound to a specific cloud platform, aimed at comprehensive enterprise governance, or working toward a formal standard. Nobody is building something lightweight, embeddable, and paired with an A2A communication channel.
That's a gap. But does a gap automatically mean you should fill it? There are a few more questions worth pushing on.
Questions Worth Pressing
First question: is A2A Agent discovery actually a real problem?
If only a handful of people run into it, building something won't get any traction. I looked at real deployment cases from early A2A adopters (LangChain's A2A adapter, ServiceNow's Agent platform, etc.) and found that "how do Agents find other Agents" is a problem nearly every team has to solve. The current ad-hoc approaches in the wild:
- Hardcode AgentCard URLs in client code
- Maintain an Agent list in a Notion document
- Internal wiki with a diagram
- Home-rolled Redis to maintain a mailbox mapping
None of these are elegant, but they're all in use. That tells you the problem is real — there just isn't a good tool for it yet.
Second question: does mq9 have a unique value proposition here?
If what mq9 produces is indistinguishable from Qdrant plus a custom schema, then there's no point. But mq9 has one unique angle here: it lives in the same component as the A2A communication layer.
Compare the two paths through the A2A communication flow:
- Separate approach: query Qdrant for AgentCard → parse out mailbox address → switch to mq9 client → send A2A message
- Integrated approach: query mq9 for Agent → mq9 client sends A2A message directly
The second eliminates the "switch" and the "mapping." In multi-turn collaboration, task delegation, and multi-Agent scenarios, this gap is triggered over and over — every new Agent discovery, every task routing decision, every capability query. It accumulates into a significant developer experience difference.
The deeper synergy is metadata sharing. mq9 knows an Agent's registration state (online/offline/overloaded), and can use that during communication — for example, when an Agent is offline, messages enter persistent storage (a natural mailbox capability), and are delivered in full when the Agent comes back online (which is mq9's core use case). If the registry and the communication layer are separate, the registry knows "Agent is offline" but the communication layer doesn't, so this kind of coordination becomes impossible.
Third question: how do you keep scope from creeping?
The earlier worry about scope creep was valid — if mq9 says "we're building the best registry," that means continuous investment in vector algorithms, query performance, cross-cluster consistency, and so on indefinitely.
The reframe that resolves this:
mq9 is not building the best registry. mq9 is building the registry that makes the A2A communication path work smoothly.
These two positions look similar but the difference is enormous. The first requires pursuing absolute best-in-class. The second only needs "good enough and smooth for A2A." The second has bounded scope.
The key to drawing that boundary: the mq9 registry serves the A2A async communication scenario and does not try to be a general-purpose AI registry. Every design decision comes back to one test: "does this make A2A communication smoother?" If yes, build it. If not, don't.
The Kafka Analogy
At this point in my thinking, I noticed this pattern has precedent in infrastructure.
Kafka ships with schema registry, Kafka Connect, and Kafka Streams. What are those for?
- Schema registry isn't trying to compete with Avro tooling. It's there so Kafka users writing producers and consumers have a ready-made schema management solution.
- Kafka Connect isn't trying to compete with Airbyte or Fivetran in the data integration market. It's there so Kafka users can connect downstream systems without writing code each time.
- Kafka Streams isn't trying to compete with Flink in stream processing. It's there so Kafka users can do lightweight stream processing with something ready to hand.
Each one looks like it's "reinventing the wheel" — there are more specialized products in schema management, data integration, and stream processing. But Confluent built all of them anyway, and after doing so, Kafka's overall usability improved so much it became the de facto standard.
Why? Because the value of infrastructure isn't just "most powerful feature set" — it's "smoothest to use." Companion components don't need to be the strongest in their category; they need to be the smoothest when combined with the core component.
Applying this logic to mq9:
- mq9's async communication capability is the core — the goal is to be the most usable middleware for Agent async communication
- The primary protocol it carries right now is A2A
- The registry is a companion — it makes the A2A communication path complete
- A companion doesn't need to beat etcd or Qdrant; it needs to make A2A Agent discovery + communication the smoothest experience possible
- When users have more demanding needs (larger scale, higher precision), they can plug in external components
On the product side, this thinking lands as: mq9 ships a built-in, good-enough A2A AgentCard registry, while supporting plugin integration with external vector engines and embedding models.
What "Good Enough" Actually Means
Once the positioning is clear, the next question is what "good enough" concretely means. Vague words don't ship.
I tried to give "good enough" a quantified definition, scoped to A2A communication scenarios:
- Agent count: 10,000 active Agents per single broker
- Query latency: tag lookup <10ms, semantic retrieval <100ms (p99)
- Index scale: 1 million skill vectors
- Concurrent queries: 1,000 QPS
- Embedding quality: sufficient for "match AgentCard capability descriptions" — not targeting "complex intent understanding"
These numbers correspond to the typical scale of internal A2A deployments at enterprises. A mid-to-large company deploying dozens to hundreds of Agents internally is well within what mq9's built-in registry can handle.
What about beyond that range? Document it clearly:
- Agents exceed 10,000 → recommend external Qdrant cluster
- Need complex intent understanding → recommend external LLM-powered query understanding
- Need full-text + vector hybrid search → recommend Elasticsearch with vector plugin
- Cross-enterprise Agent discovery → wait for formal standards like ANS
Being explicit about what you're not good at isn't a weakness — it's honesty. Users see those numbers and immediately know whether their scenario fits. That's far healthier than "we can do everything."
Technology Choices
With the positioning settled, next is technology selection. The requirements are clear:
- Embedded (no new process to run)
- Fully local (no external API calls)
- Rust ecosystem (consistent with RobustMQ)
- Good enough (not chasing maximum performance)
For the vector database: LanceDB — a pure Rust embedded vector database, the "SQLite of vector databases." Add it to Cargo.toml and data lives in a local directory.
For the embedding model: fastembed-rs — a pure Rust embedding library, ships with mainstream open-source models (BGE, Jina, E5), runs CPU inference via ONNX Runtime. Downloads the model on first startup, then runs entirely locally. No external API calls.
The combination means mq9's built-in registry is fully local, zero external dependencies, pure Rust — consistent with RobustMQ's overall philosophy of "single binary, zero external dependencies," just with a binary that's a few hundred MB larger (if models are bundled) or an extra local data directory.
How to Vectorize A2A AgentCards
Since mq9's registry stores A2A AgentCards, the vectorization strategy needs to be designed around the AgentCard structure.
Fields in an A2A AgentCard worth vectorizing:
description(top-level) — overall Agent capability descriptionskills[].name— skill nameskills[].description— detailed skill descriptionskills[].examples— skill usage examples
Vectorization granularity: one vector per skill. The reasoning:
- The A2A protocol designed the
skillsfield specifically so each Agent capability can be referenced individually - User queries are typically task-driven ("I need translation," "I need code written"), which naturally maps to skill granularity
- Combining all skills into a single vector dilutes the semantics of individual capabilities
Concretely in the LanceDB schema, one row per skill, storing:
- Agent identifiers (
agent_id,mailbox) - Skill content (
id,name,description,tags,examples) - Complete raw AgentCard JSON (returned as-is on DISCOVER)
- Vector (embedding of the concatenated skill text)
- Timestamps and TTL
The DISCOVER interface supports both tag-based retrieval and semantic retrieval. Tag lookup requires zero dependencies (no embedding needed); semantic retrieval uses vector matching. Both paths coexist in the same interface.
The registry is intentionally aware of AgentCard structure — this "semi-parsing" is necessary. mq9's mailbox communication layer doesn't parse message content (business messages), but the registry must parse AgentCards to build the index. This is a consistent tradeoff: business messages are the user's business, mq9 shouldn't look at them; AgentCards are capability descriptions written for mq9, mq9 must parse them.
The Boundary Between Built-in and Replaceable
Next, the boundary between built-in and replaceable needs to be drawn clearly, otherwise the plugin architecture will become a mess.
Built into mq9 (protocol layer, not replaceable):
- Protocol layer (REGISTER, DISCOVER, UNREGISTER, REPORT)
- Agent lifecycle management (registration, heartbeat, deregistration, TTL expiry cleanup)
raw_cardstorage (what gets stored on registration, what gets returned on DISCOVER)- Tag-based exact retrieval (baseline capability, no vector dependency)
Replaceable via plugin (data plane backend):
- Vector engine (default LanceDB, can swap to Qdrant, Weaviate, Milvus)
- Embedding model (default fastembed, can swap to OpenAI, Cohere, Bedrock)
In terms of traits:
trait VectorStore {
async fn upsert(&self, records: Vec<Record>) -> Result<()>;
async fn query(&self, vector: &[f32], filter: Option<Filter>, limit: usize) -> Result<Vec<Record>>;
async fn delete(&self, filter: Filter) -> Result<()>;
}
trait Embedder {
async fn embed(&self, texts: &[&str]) -> Result<Vec<Vec<f32>>>;
}mq9 ships LanceDBStore and FastembedEmbedder implementations built-in. Users can write QdrantStore, OpenAIEmbedder, and so on. Config file specifies which to use:
mq9:
registry:
vector_store:
type: lancedb # or qdrant, weaviate, external
embedder:
type: fastembed # or openai, cohere, externalThe key judgment in drawing this boundary: the vector plugin plays exactly one role — "semantic matching engine." Swapping the plugin doesn't lose any protocol-layer semantics or lifecycle management logic. A2A AgentCard format parsing, the Agent registration state machine, TTL renewal — all of that is managed by mq9 itself.
Tradeoffs for the Default Implementation
Should the built-in implementation be pushed to be as strong as possible?
My judgment: no. The reasoning:
- What users expect from the built-in implementation is "works, stable, doesn't break"
- Users who genuinely need performance or precision will switch to an external plugin
- Optimizing the default implementation to be best-in-class would consume massive tuning effort (quantization, index parameters, model comparison) — that's in conflict with the "good enough" positioning
Concrete strategy:
- Embedding model: BGE-small-en-v1.5 (130MB), not BGE-M3 (2GB)
- Vector index: not built by default; automatically build IVF_FLAT once data hits a threshold, no IVF_PQ
- No quantization by default; retain f32 precision
- Simple LRU cache for query vectors; no complex cache strategy
The default implementation is responsible for a "passing grade of 60." Let the ecosystem handle the rest. If someone needs 80 or 90, they know how to swap in a plugin.
A Few Hard Boundaries
It's clear what mq9's registry will do. It's equally important to be explicit about what it won't do — boundaries matter more to a product's fate than capabilities.
No Agent orchestration. The registry solves "discovery," not "coordination." How multiple Agents collaborate to complete an A2A Task is the domain of LangGraph, AutoGen, CrewAI. mq9 doesn't enter that space.
No Agent governance. Which Agents are allowed to register, whether AgentCards are truthful, whether Agent behavior is compliant — those are Agent governance problems. mq9 doesn't solve them. That's the domain of commercial platforms like Solo.io.
No LLM capabilities. mq9 uses embeddings for semantic matching, but mq9 itself is not an LLM, does not do intent understanding, does not do dialogue. Embeddings are a basic tool, not core intelligence.
No cross-enterprise Agent discovery. Cross-enterprise involves establishing trust, billing, SLA — a whole different tier of problems. Let formal standards like ANS handle that. mq9 stays scoped to intra-enterprise A2A communication.
No registry schema standards outside A2A. The current mq9 registry serves A2A AgentCards. If MCP extends Agent-to-Agent communication in the future, or a new protocol emerges, mq9's registry can extend to support it — but mq9 won't define the registration schema standard. Schema follows the upper-layer protocol.
With these boundaries drawn, mq9's position is clear: middleware for Agent async communication, with a registry that exists to make the A2A communication path complete — not to expand upward. That kind of restraint is what gives infrastructure products longevity.
Current Judgment
Working through all of the above, my answer to "should mq9 build a registry" has changed:
Yes. But built as "companion to A2A communication, good enough, replaceable with external components" — not as "building the best AI registry."
These two approaches look similar on the surface, but they produce very different products. The first leads mq9 to keep expanding into AI platform territory with scope spinning out of control. The second keeps mq9 scoped to the Agent async communication middleware layer — sustainable long-term.
mq9's overall goal is clear: the most usable middleware for Agent async communication. Currently carrying A2A primarily, with other protocols to follow. The registry is a necessary companion to that goal, not an independent pursuit.
Concrete path forward:
- PoC: implement a minimal registry using fastembed-rs + LanceDB, validate that the AgentCard registration and semantic discovery pipeline runs end-to-end
- Protocolize: write AGENT.REGISTER, AGENT.DISCOVER, and related interfaces into the mq9 protocol, reflecting the A2A AgentCard structure
- Plugin abstraction: define the
VectorStoreandEmbeddertraits, with LanceDB and fastembed as built-in default implementations - Documentation: clearly state the "good enough" boundary numbers — what scenarios call for external integration, what scenarios the mq9 built-in handles fine
Whether this is ultimately worth building comes down to user feedback. The current judgment is yes — but that needs to be confirmed once the PoC is running and actual feedback comes in.
