Infrastructure for the Agent Era: Reflections After Reading Two Articles by Huang Dongxu
I recently read two articles by Huang Dongxu — "How to Build Infrastructure That AI Agents Love" and "A Formal Introduction to My Recent Work: db9.ai." Together, these two pieces represent the deepest and most practically grounded thinking on Agent-era infrastructure that I have come across.
Huang Dongxu holds a unique position — he is the co-founder of TiDB with over ten years of experience building database infrastructure, and he has directly observed large amounts of data on how Agents use databases in real production environments. What he says is not prediction; it is what is already happening.
After reading them, I kept returning to one question: if these judgments are correct, what should RobustMQ look like?
This article is a record of my thinking.
A Shift That Is Already Happening
Huang Dongxu observed a data point: on TiDB Cloud, more than 90% of newly created clusters each day are created directly by AI Agents. This is not a future trend — it is today's reality.
The primary users of infrastructure software are shifting from human developers to AI Agents.
This shift looks gradual, but its impact on infrastructure design is fundamental. Human developers and AI Agents use software in completely different ways — different usage frequencies, different resource consumption patterns, different tolerance for errors, different learning capabilities. If you keep building infrastructure on the assumption that it is designed for humans, you will eventually hit a wall.
The question is not "whether to support Agents," but "when Agents become the primary users, what should infrastructure fundamentally look like?"
Mental Models: Agents Prefer Systems They Already Understand
Huang Dongxu's first core judgment is: when the user shifts from human to AI, what software truly exposes to the user is no longer UI and API, but the mental model behind it.
During training, LLMs were exposed to enormous amounts of code and engineering practice — countless repeated abstractions, repeated patterns, and repeated choices. These repetitions solidified into extremely strong priors: SQL, the file system, POSIX, Bash, the Kafka API. These things have not changed fundamentally in decades and have been deeply baked into the models.
Agents are not waiting for a smarter, more powerful system. They prefer "systems they already understand," and then extend those systems with glue code at a thousand times the efficiency of a human.
This judgment has an important corollary: inventing new abstractions is dangerous. New abstractions require a learning cost — already a burden for humans, even more so for Agents. An Agent's preferences are entirely determined by its training data, and a new concept that barely appears in the training corpus will be very awkward for the Agent to use, with a high error rate.
He is not optimistic about new frameworks like LangChain, for exactly this reason: too new, even programmers are reluctant to learn it, let alone AI.
Good mental models must also be extensible. Take Linux VFS as an example — it allows you to introduce entirely new implementations without breaking the existing mental model. The semantics of cp, ls, and grep remain unchanged, but the underlying implementation can be anything: object storage, vector indexes, network file systems. The interface is stable; the implementation is replaceable. When Agents evolve systems at a thousand times the human rate, this design of "stable constraints + open extension" is especially important.
Interface Design: Natural Language for Exploration, Symbols for Convergence
A good Agent-friendly interface must simultaneously satisfy three conditions: it can be described in natural language, it can be solidified in symbolic logic, and it can deliver deterministic results.
Natural language describes intent — the core here is not "whether you support natural language input," but whether your interface is itself suited to being expressed in natural language. Graphical interfaces are hard to describe accurately in natural language; command lines, APIs, and SQL are naturally suited to it.
Symbolic logic solidifies execution — once intent is clear, it must be compressed into a definite, stable, and inferable form. Code, SQL, configuration files — once these intermediate representations are generated, they no longer depend on contextual interpretation and can be reused, audited, and automatically verified.
Huang Dongxu makes an interesting point: the best symbolic representation of logic is code. Not because it saves tokens, but because of cognitive density — code uses very few symbols to describe a process that can be executed an infinite number of times.
Natural language navigates the exploration space; symbols converge it. Between the two, there is a clear conversion point where ambiguity is fully eliminated. A well-designed system must clearly answer: at what moment does a fuzzy intent become a definite execution?
How Agents Work: Disposable, Parallel, Trial-and-Error
The way Agents work is fundamentally different from human developers.
Human developers carefully plan resources, meticulously maintain systems, and try to avoid destructive operations. Agents are different — they create resources very quickly, try something, discard it if it doesn't work, and start over in another direction. They run multiple branches in parallel simultaneously, and once one succeeds, the rest are immediately abandoned. This trial-and-error speed is thousands of times faster than a human.
Huang Dongxu calls this kind of workload "disposable" — instance lifetimes may be only seconds to minutes, creation and destruction frequencies are extremely high, and quantities grow explosively.
This means the traditional infrastructure assumptions must be overturned:
- Traditional assumption: a cluster is precious and needs careful maintenance
- Agent assumption: instances are cheap, lifetimes are short, and quantities are explosive
If infrastructure is still operating on traditional assumptions, Agents will be forced back into "use resources carefully" mode, and the advantages of parallel exploration and rapid trial-and-error will be completely erased.
Virtualization: Looks Like Exclusive, Actually Shared
Disposable workloads bring an unavoidable problem: it is impossible to provide a real physical instance for every Agent.
Huang Dongxu's judgment is that some form of virtualization must be introduced. Each Agent feels it has an independent environment it can do whatever it wants with, but at the resource level it is highly shared. "Looks like exclusive, actually shared" — this is not an optimization item; it is a prerequisite for supporting Agents at scale.
db9 is the practical embodiment of this idea: each Agent feels it has its own Postgres + file system, but underneath is virtualized shared resources at 1/100th the cost of traditional solutions. Agents can create tables, drop tables, run experiments — completely without worrying about affecting others.
What db9 Is: A Reference for the Storage Layer
db9 is the product Huang Dongxu built according to his own theory — his answer to "what should storage infrastructure for the Agent era look like."
The core design fuses the file system and SQL into the same storage layer. These two are the mental models Agents understand best, but they are completely separate in traditional systems. db9 bridges them: files cp-ed in can be queried directly with SQL; data written by SQL can be read directly as files. Two ancient mental models, unified storage, each interface preserved.
Key characteristics:
- Instant on: supports anonymous databases, no registration required
- Never pauses: instances are never shut down even with no traffic for extended periods
- Flexible lifecycle: can be short or long; Agents don't need to manage it manually
- Extremely low cost: virtualization supports hundreds of billions of tenants at 1/100th the cost of traditional solutions
The positioning is not "a better database," but "storage infrastructure for the Agent era."
Judgments on Future Infrastructure
Three core judgments on future infrastructure distilled from the two articles:
Invisible to Agents. Agents don't need to know about clusters, nodes, replicas, or partitions. The better the infrastructure, the less you feel its presence. Like HTTP — you send a request to a URL without needing to know what's behind it, only knowing "send it, and there will be a response."
Tiered and optional, not forced. Requirements for infrastructure vary enormously across scenarios. Good design starts lightest and upgrades on demand, with the interface remaining unchanged. Storage from memory to persistent, security from no auth to full ACL — users choose freely by scenario, and the system follows. Forcing a particular mode is a constraint on users, not a virtue in design.
Extremely low cost, massive scale. Agents make long-tail demand economically viable — things that didn't add up before are now worth doing. Infrastructure must be able to support explosive scale at this level while driving unit costs to an extreme low. This requires virtualization, multi-tenancy, and automatic lifecycle management — all three are essential.
Implications for RobustMQ
After reading these two articles, I have a clearer picture of RobustMQ. Not an adjustment in direction, but a deeper understanding of why the current direction is right — and what is still missing.
The Direction Is Right, and the Timing Is Right
MQTT, Kafka, NATS, AMQP — all four protocols are mental models that Agents have deeply internalized and that appear extensively in LLM training data. RobustMQ has not invented any new concepts: native SDKs, native commands, native semantics — learning cost is zero.
This perfectly aligns with what Huang Dongxu calls "adhering to ancient but repeatedly validated mental models." It wasn't intentional; it's because these protocols are themselves correct abstractions, validated over ten to twenty years.
RobustMQ is preparing the 2027 answer — when Agents become the primary users of communication infrastructure, what should the messaging layer look like?
Invisibility Is the Ultimate Goal
Huang Dongxu says "the better the infrastructure software, the less you feel its presence." What does this mean for RobustMQ?
Ultimately, an Agent's perception of RobustMQ should come down to just two things: an address and a message.
send(address, message)
subscribe(address, callback, mode)How many nodes are in the cluster, what storage engine is used, which partition the message is in, whether replicas are in sync — all of this should be absorbed, invisible to the Agent.
Today's Kafka makes engineers aware of too much: partition, offset, consumer group, rebalance. MQTT makes engineers aware of QoS, retain, and will. These concepts are already complex enough for human engineers; for Agents they are unnecessary noise. RobustMQ's goal is to absorb all of this, leaving only a clean address and a message.
This is not feature reduction — it's internalizing complexity into the system rather than exposing it to users. Agents always see only the simplest interface; the system handles all the details behind the scenes.
Lightweight Is a Mandatory Answer
The disposable workloads Huang Dongxu describes pose a specific question for RobustMQ: how low is the cost of creating and destroying a topic?
A traditional Kafka topic is heavy — creation requires allocating partitions, writing metadata, assigning a leader, and syncing replicas. This was designed for the assumption that "a topic is a precious long-lived resource." Agents assume the exact opposite: channels are temporary, disposable, and may have a lifetime of just a few messages.
NATS gives the right answer: a subject is an address, not a resource. It doesn't need to be created, managed, or destroyed. Publishing makes it exist; no subscribers and it disappears. Agents can use them freely — generate a temporary subject with a job_id, let it disappear when the task is done, without ever thinking about cleanup.
RobustMQ's NATS protocol support is already heading in this direction. Going further, TTL and automatic lifecycle management need to become core capabilities, not optional features. A message's lifecycle should be bindable to business semantics — when a task is done, the related messages are automatically cleaned up, with no Agent intervention required.
Tiered Semantics Is the Real Differentiator
Persistence is not binary — it is a spectrum. Agents have different needs in different scenarios:
- No persistence: temporary task communication, messages are processed immediately on arrival, missing them is fine, lowest latency
- Temporary persistence: critical instruction delivery, the Agent may be briefly offline, messages must not be lost, but auto-cleaned after task completion
- Medium-term persistence: IoT data reporting, needs to be retained for a period for analysis and replay, rolling by time window
- Long-term persistence: data pipelines, audit logs, needs long-term retention and support for replay at any point in time
These four needs correspond to the same interface in RobustMQ, switched with a single parameter:
subscribe(address, callback, mode="realtime") // no persistence
subscribe(address, callback, mode="reliable") // temporary persistence
subscribe(address, callback, mode="replay") // long-term persistenceThe three forms of the underlying Shard storage — Memory, RocksDB, File Segment — naturally correspond to these three levels. Agents start with the lightest option; when business complexity grows, they upgrade the semantics directly without migration, without switching brokers, with the interface unchanged.
This is a capability unique to RobustMQ. NATS cannot achieve Kafka-level persistence. Kafka cannot achieve NATS's zero-topic lightweight model. Only RobustMQ stretches this line from the lightest to the heaviest on the same storage layer, with no gaps in between.
Security follows the same tiered logic:
- No auth: development testing, internal systems, Agent self-declared tenant ID, fully trusted environment
- Lightweight token: only verifies legitimacy without complex permissions, suitable for most production scenarios
- Full ACL: fine-grained access control, enterprise-grade compliance needs
Users choose by scenario, default to the lightest, add a layer when security is needed, interface unchanged. This shares the same design philosophy as storage tiering — start lightest, upgrade on demand, always within the same system.
Every Agent Feels It Has Its Own Broker
db9 achieved "every Agent feels it has its own database." The corresponding question RobustMQ needs to answer is: does every Agent feel it has its own broker?
Multi-tenancy is the answer, but it's not just isolation at the protocol level. The real experience is: an Agent tells the system which tenant it is, creates subjects freely in its own tenant space, publishes messages, subscribes to data, and is completely unaware of other tenants' existence. The underlying storage and compute resources are shared, but isolation is complete.
Lightweight tokens provide minimum identity verification — not for complex access control, but to ensure "you are indeed the tenant you claim to be." Agents don't need to log in, register, or configure — they get a token and use it directly, with cost approaching zero.
This capability becomes very critical at scale with many Agents. A complex Agent system might run hundreds of sub-Agents simultaneously, each needing an independent communication space without interfering with others. Without lightweight tenant isolation, deployments at this scale would generate enormous management overhead.
Stay in the Communication Layer, Don't Overstep
RobustMQ does only one thing: reliably move messages from A to B.
Service discovery, Agent coordination, task orchestration — these are the responsibility of upper-layer frameworks, not the messaging layer. Just as TCP doesn't handle service registration and HTTP doesn't handle load balancing, infrastructure should not overstep into upper-layer concerns.
The $mq9.AI.API.* namespace is not RobustMQ doing service discovery — it provides a standardized subject space as a basic primitive. Upper-layer frameworks can build their own service discovery logic on this space; RobustMQ is only responsible for reliable message delivery.
The pipe belongs to RobustMQ; what flows through it and who is listening is decided by the upper layer. With this boundary clear, there is no feature creep, and RobustMQ won't become a system that does everything but does nothing well.
A Complementary Relationship with db9
db9 is the storage layer; RobustMQ is the communication layer. Together they form a complete Agent infrastructure stack.
Agent message passing and task coordination → RobustMQ Agent state storage and context querying → db9
Huang Dongxu is building neural storage for Agents; RobustMQ is building neural communication for Agents. The two directions are complementary, not competitive. In a single task, an Agent might need both simultaneously — receiving instructions via RobustMQ, writing results to db9, and notifying downstream via RobustMQ.
If both are done well, they will be two very important layers in the Agent-era infrastructure stack.
The Core Proposition of Next-Generation Infrastructure: Not Volume, but Count
The core problem this generation of infrastructure solved is scale — large traffic volumes, high throughput, low latency, large data volumes. Kafka's core competitiveness is millions of messages per second; Redpanda's core competitiveness is lower latency and higher throughput than Kafka. You can keep competing on this dimension, but you're competing on the same axis — volume. On the "volume" dimension, current architectures are already good enough. Redpanda rewrote Kafka in C++ with several times the performance improvement, but in essence it's still an optimization within the same paradigm. The ceiling is clearly visible; no decisive advantage can be formed.
The core problem next-generation infrastructure will solve is count — not one user sending a hundred million messages, but a hundred million users each sending a few messages.
Agents are automated; robots are automated; IoT devices are automated. They don't need human intervention — they create connections themselves, send messages themselves, and disconnect themselves. The growth in user count is not linear, it is exponential. Every Agent, every device, every robot is an independent user that needs its own communication space. This "count" problem cannot be solved by optimizing current architectures — it requires fundamentally redesigning the assumptions of infrastructure.
"Volume" and "count" place completely different demands on architecture. The volume problem is optimizing the throughput path for individual messages, reducing latency, increasing concurrency. The count problem is supporting lightweight access for massive numbers of tenants, with extremely low per-tenant cost and automatic lifecycle management — it requires not a faster data path, but a lighter tenant model.
Kafka's architecture is designed for "volume" — partition is the core abstraction, fundamentally designed for high-throughput data pipelines. It has no concept of "tenant," no concept of "lightweight temporary channel," no concept of "use and discard." On the "count" battlefield, Kafka's and Redpanda's advantages actually become liabilities — every design decision they made assumed "resources are precious and need careful management," which is exactly the opposite of what the Agent era requires.
A system that supports a billion tenants with average throughput per tenant is more valuable in the Agent era than a system that supports ten thousand tenants with extremely high throughput per tenant.
High Reliability and Multiple Replicas Are Not That Important Anymore
Traditional infrastructure's obsession with high reliability comes from a deeply rooted assumption: every message is precious — once lost, it's gone forever. So three replicas, synchronous replication, Raft consensus — all the complexity is to ensure "no loss." This assumption held during the era when human developers hand-wrote code and carefully maintained systems.
But Agents change this assumption. Agents can retry, regenerate, and try a different direction. A temporary task message is lost — just resend it. A disposable exploration branch fails — open a new one. The "value density" of messages has decreased; paying the cost of three replicas for every message no longer makes sense. A large amount of Agent communication is temporary, exploratory, and repeatable — it simply doesn't need high reliability. Only a few critical paths — final task results, audit logs, critical cross-system events — truly need reliability guarantees.
What truly matters is not high reliability, but choice:
- Is this message worth three replicas? — No, use memory, lightest option, retry if lost
- Does this message need reliable delivery? — Yes, add a layer of persistence, guarantee no loss
- Does this message need long-term retention? — Yes, use File Segment, cold data archival
Different messages, different semantics, different costs. Not forcing all messages through the same high-reliability path, but letting Agents choose on demand and having the system execute based on configuration.
RobustMQ's storage tiers — Memory, RocksDB, File Segment — were originally designed for performance and cost trade-offs. But in this framework, their significance goes deeper: they provide choice over reliability for every message. This is not flexibility as a feature — it is the correct answer to the core proposition of next-generation infrastructure.
Closing Thoughts
NATS's "subject is an address, not a resource," unified storage's lightweight Shard, native multi-tenancy support, TTL automatic lifecycle management — these designs naturally point toward the "count" problem. Not doing the same thing better on the same dimension, but defining competition on a new dimension.
Huang Dongxu says: Agent-era infrastructure is not about inventing new things, but about recombining existing things in the right way, then driving costs to an extreme low and scale to an extreme high. RobustMQ's combination in the communication layer is: MQTT's connection model + Kafka's stream semantics + NATS's lightweight routing + unified storage layer. None of these are new inventions; all are validated mental models. The goal of the combination is not to be faster, but to be lighter, to support more users, and to scale larger.
This is the question that next-generation communication infrastructure truly needs to answer.
