RobustMQ: Next-Gen Communication Infrastructure Built for AI
RobustMQ is compatible with and extends the Kafka protocol, allowing existing training frameworks and Agent applications to connect with zero code changes. Through four architectural innovations—object storage direct access, million-scale lightweight Topics, shared subscription, and multi-mode storage—RobustMQ transforms the message queue from a passive data carrier into an intelligent data scheduling and caching layer for AI scenarios: GPUs no longer wait for data, and millions of Agents each have their own channel.
Introduction: What Kind of Communication Infrastructure Does the AI Era Need?
As AI develops rapidly, we keep asking: what scenarios should the next communication platform serve? After digging into AI training, Agent communication, and edge AI, I found an unexpected but sensible answer—existing communication infrastructure can’t meet AI-era needs, and RobustMQ can fill that gap.
Traditional middleware like Kafka was born in the big data era, designed for real-time log processing and stream computation. EMQ and other MQTT brokers focus on IoT device connectivity. But the AI era brings new demands: massive Agents need independent channels, GPU training needs extreme data loading performance, edge AI needs lightweight data transfer. These go beyond what existing communication middleware was built for and require a rethought solution.
RobustMQ’s positioning: next-generation unified communication infrastructure for AI, IoT, and big data. Kafka protocol keeps ecosystem compatibility; architectural innovation pushes past current limits, designed specifically for AI-era scenarios.
Part One: RobustMQ’s Architectural Innovations
Kafka’s Fundamental Bottlenecks
Kafka was created in 2011 with a file-system-based design: each Partition corresponds to a separate log file. This works well for big data but runs into four severe limits in AI scenarios.
First, Topic count. Each Kafka Partition needs two file handles (log + index); the OS caps total handles. In practice, a single Kafka Broker is advised to stay under about 4,000 Partitions. With 3 Partitions per Topic, that’s roughly 1,300 Topics at the limit. For AI, that’s a problem.
An AI Agent system might run 100,000 Agents, each needing input, output, and state sync channels—300,000 Topics. A multi-tenant training platform serving 1,000 customers with training, validation, and checkpoint streams—3,000 Topics. A team running 1,000 parallel experiments, each with data, gradient, and metric streams—3,000 Topics. Kafka would collapse in these cases, which are common in AI.
Second, concurrency limits. Kafka guarantees order within a Partition by design: one Partition can only be consumed by one consumer in a Consumer Group. Concurrency is capped by Partition count. For 8 GPUs in parallel training, you need at least 8 Partitions. Scaling to 64 GPUs means changing Topic config to add Partitions and possibly rebalancing, affecting training. Worse with multiple experiments: Experiment A uses 4 GPUs, B uses 8, C uses 16, all sharing one dataset Topic—no single Partition count fits all.
Third, single storage mode. Kafka only persists to disk; even ephemeral real-time data goes to disk. AI training gradient sync needs microsecond memory latency; training data benefits from memory cache for GPU access. Kafka can’t provide pure-memory storage or automatic hot-cold data tiering; history must stay on expensive local SSD, not auto-archive to cheap S3.
Fourth, single data source. Kafka expects data via Producers. AI training data usually already lives in S3 or other object storage. To use Kafka, you must manually import TBs from S3 to Kafka—slow (hours) and redundant (S3 + Kafka copies), doubling storage cost.
RobustMQ’s Four Architectural Innovations
RobustMQ builds on RocksDB as a unified storage layer and fundamentally addresses these four limits.
Innovation 1: Object Storage Data Source
RobustMQ supports "Topic pointing to object storage." When creating a Topic, you can specify an S3 or MinIO path as the data source. RobustMQ prefetches data from object storage into local cache (memory or SSD) and serves it via the Kafka protocol. No "import"—data lives in S3 once; RobustMQ acts as a fast AI training cache layer.
Innovation 2: Million-Scale Topics
RocksDB is a high-performance KV store; all Topics share one RocksDB instance, distinguished by key prefix. A key might be "topic_name:partition:offset," value is the content. Topic creation is purely a metadata operation—no new files, so millions of Topics are possible. Creating 100,000 Topics takes seconds; deletion is equally fast; file handle count stays constant and doesn’t grow with Topic count.
Innovation 3: Shared Subscription
RobustMQ borrows MQTT’s shared subscription and applies it to Kafka. When creating a Kafka consumer, use a special group_id format: "$shared/{group_name}". RobustMQ recognizes this and enables shared subscription. Multiple consumers can consume the same Partition concurrently; data is distributed by round-robin or smarter policy, order not guaranteed but concurrency is high.
This breaks Kafka’s "concurrency = Partition count" limit. One Topic with one Partition can be consumed by 90 consumers (90 GPUs). Scaling from 32 to 64 GPUs is just starting 32 more consumer processes—no Topic changes. All via standard Kafka SDK; only the group_id convention changes; zero code changes.
For AI training, this is huge. Training data often doesn’t need strict order (it’s often shuffled). Shared subscription gives high-concurrency unordered consumption. It also decouples storage sharding (Partition count, disk parallelism) from compute concurrency (consumer count, GPU count), so both can be tuned independently.
Innovation 4: Multi-Mode Storage
RobustMQ uses a pluggable storage engine. Each Topic can have its own mode: memory for microsecond latency (e.g., gradient sync); hybrid for memory buffer and periodic flush to disk; persistent for full RocksDB storage; tiered for hot data in memory, warm on SSD, cold on S3.
These four innovations aren’t isolated features—they stem from one design philosophy: communication infrastructure shouldn’t just "move data" but be the "intelligent scheduling hub for data flow." From that, RobustMQ evolves into an architecture that fits the AI era.
Part Two: Intelligent Cache Design for GPU Training Acceleration
GPU Utilization Crisis
AI training’s main challenge is GPU utilization. According to an AI infrastructure alliance 2024 survey, only 7% of orgs exceed 85% GPU utilization at peak; most are 40–70%. That means 30–60% of the time, expensive GPUs idle waiting for data.
Example: 64 NVIDIA A100 GPUs, 7-day training. Cloud cost ~$1,600/hour, ~$270K total. At 60% utilization, 40% (about $110K) is wasted on data wait. Improving data pipelines to 85% utilization saves cost and shrinks training from 7 days to 5.
GPU wait comes from data load speed falling behind compute. An A100 processes hundreds of GB/s, but S3 reads are typically 50–200 ms. In the training loop, CPU reads from storage, decodes images, augments data, transfers to GPU. If storage read takes 100 ms and GPU compute 40 ms, the GPU waits 60% of the time.
Limits of Traditional Approaches
Common approaches have clear limits.
Direct S3 reads: simplest, slowest. Each process fetches a batch per iteration; high, variable latency. Worse with concurrency: 128 GPUs → 128 processes hitting S3 → easy throttling (~5,500 GET/s per bucket). Multiple epochs repeat the same reads, multiplying S3 egress cost.
Local SSD cache: better. Download data to each training server’s local SSD before training, then read locally. Latency drops, but in multi-node training, 16 servers × 150 GB each = 2.4 TB downloaded; S3 egress multiplies. Scaling mid-training (e.g., more Spot instances): new nodes have empty cache, 5–10 minutes warm-up while GPUs idle.
PyTorch DataLoader: some prefetch, typically 2–4 batches, a few hundred MB. Each process has its own instance; on 8 GPUs per node, 8 processes each cache, wasting memory and no sharing.
Alluxio: full distributed cache, smart prefetch, multi-tier storage; real cases show 90%+ GPU utilization. But it uses POSIX; you mount S3 as a filesystem, setup is heavier. As a commercial product, the open-source version is limited; full features require a license.
RobustMQ’s Intelligent Cache
RobustMQ approaches from the Kafka protocol: standard Kafka SDK, zero code changes, but with an AI-training-optimized cache.
Three-Tier Cache
Memory: sub-millisecond access, for hot data in use. SSD: millisecond access, for warm data next in line. S3: source of truth, cold data loaded on demand. Data flows automatically between tiers; RobustMQ optimizes based on access patterns.
When creating a Topic, specify the object storage path. E.g., ImageNet at "s3://datasets/imagenet/train/" with 1000 tar files of 1 GB each. One command creates a Topic; RobustMQ scans the path, builds indexes. Each file becomes consumable records (e.g., 1000 images per tar → 1000 records).
Predictive Prefetch
RobustMQ prefetches in the background. With a cache size (e.g., 200 GB), it downloads the first 200 files to local cache—hottest in memory (e.g., first 10 files, 10 GB), next hottest on SSD (remaining 190 files, 190 GB). Prefetch continues; when training starts consuming, RobustMQ predicts what’s next and prefetches.
Preload isn’t passive "cache on access" but active "predict next access." By monitoring consumption rate (e.g., 100 records/s), it computes where consumption will be in 10 seconds and prefetches that range from S3. When the process reaches that position, data is already in memory; the GPU hardly waits.
Multi-Epoch Optimization
Especially for multi-epoch training. AI training often iterates over the same data (e.g., 10 epochs). RobustMQ detects this. In the first epoch, maybe 70% fits in cache, 30% comes from S3. It records access frequency and keeps the hottest data in cache. By the second epoch, that 70% is served from cache; latency drops from ~100 ms to ~2 ms. Third and fourth epochs improve further; cache hit rate can exceed 95%.
Eviction is also smarter. When cache is full, RobustMQ uses more than simple LRU. If it detects multi-epoch, it keeps data that will be accessed in the next epoch, not data just visited but not due soon. For first-time training, it prefers upcoming sequential blocks.
Shared Cache Across Nodes
In multi-node training, RobustMQ runs as a distributed cache cluster; all nodes share one cache. Data is downloaded from S3 once into the RobustMQ cache (e.g., 3 nodes × 500 GB = 1.5 TB) and served over the internal network. S3 egress drops dramatically; no concurrent throttling; new nodes read from cache immediately, no warm-up.
Training uses standard Kafka consumer API to read from RobustMQ. Data is in cache; latency goes from S3’s 100–200 ms to under 2 ms. Training code doesn’t care whether data is from S3 or cache.
With object storage as source, three-tier intelligent cache, predictive prefetch, and access-pattern-aware eviction, RobustMQ cuts training data access from 100–200 ms to under 2 ms, raises GPU utilization from 60% to 85%+, and shortens training by 25–40%. All transparent to the training program; standard Kafka SDK is enough.
Part Three: Million-Scale Topics for AI Agent Communication
Agent-Era Communication Challenges
AI Agents are becoming the dominant form of AI applications. From LangChain to CrewAI to AutoGPT, frameworks are evolving fast. A typical Agent system may have hundreds or thousands of Agents—retrieval, planning, code gen, verification—that need frequent communication.
The usual approach is shared channels with metadata to distinguish Agents. That breaks at scale: mixed data, hard to manage and monitor; coarse permissions; poor cost attribution; debugging one Agent affects others.
Ideal: each Agent has its own channel. Kafka’s Topic limit blocks that. 100,000 Agents × 3 Topics = 300,000 Topics; a Kafka cluster can’t handle it.
RobustMQ’s Million-Topic Ability
RobustMQ’s unified RocksDB layout enables million-scale Topics. All Topics share one RocksDB instance; keys are distinguished by prefix. Creating a Topic is adding metadata; no new files—seconds. Delete is similar—just clear a key range.
So each Agent can have its own Topics. 100,000 Agents × 3 Topics—RobustMQ handles 300,000 Topics. No upfront planning; Topics are created and deleted as Agents come and go.
Monitoring and management become clear. Per-Topic metrics: produce rate, consume lag, backlog. You know which Agent is slow or stuck. Permissions can be per-Topic; cost can be attributed by Topic.
For multi-tenant Agent platforms this matters. A SaaS platform might serve thousands of customers with dozens to hundreds of Agents each—tens or hundreds of thousands of Topics. RobustMQ supports this; Kafka doesn’t.
For training experiments, million Topics also help. One S3 dataset can back multiple Topic "views": by class (imagenet-class-cat, imagenet-class-dog, etc.), by shard (imagenet-shard-0001 to 1000), by resolution (imagenet-high-res, imagenet-low-res). Experiments pick the Topic combos they need; data reused, isolation preserved.
Part Four: Shared Subscription Breaks Concurrency Limits
Kafka’s Concurrency Model
Kafka’s design guarantees order within a Partition, which caps concurrency by Partition count. In a Consumer Group, one Partition → one consumer. With 3 Partitions, at most 3 consumers work; a 4th idles.
That’s fine when order matters. For AI training it’s a bottleneck. Training data often doesn’t need strict order; it’s often shuffled. With Kafka, 8 GPUs need at least 8 Partitions; 64 GPUs need 64 Partitions. Mid-training scale-up from 32 to 64 means changing Partition count, possible rebalance, affecting progress.
Worse with multiple experiments. A uses 4 GPUs, B uses 8, C uses 16; they share one dataset Topic. How many Partitions? 4 → C underuses; 16 → A overuses. No single config fits.
RobustMQ’s Shared Subscription
RobustMQ applies MQTT-style shared subscription to Kafka. Use group_id "$shared/{group_name}"; RobustMQ enables shared mode. Multiple consumers consume the same Partition; data is distributed round-robin or by policy; order not guaranteed, concurrency is.
Usage: start 8 training processes with group_id "$shared/training-gpu". RobustMQ sees 8 consumers, starts distributing. Consumer 1 gets 1, 9, 17...; Consumer 2 gets 2, 10, 18...; 8-way parallelism, 8× consumption speed. Code is standard Kafka SDK; only group_id changes.
Supports elastic scaling. Start with 32 GPUs, 32 consumers. Two hours later, scale to 64; start 32 more processes with the same group_id. RobustMQ detects them and starts distributing; no Topic changes, no rebalance. If Spot instances are reclaimed, RobustMQ re-assigns unacked data to remaining consumers.
Shared subscription decouples storage sharding and compute concurrency. Storage: few Partitions (e.g., 9 across 3 Brokers for disk parallelism). Consumption: any number of consumers (4, 32, 90). Example: 3 Brokers, 9 Partitions, 90 training processes; RobustMQ reads from 9 Partitions and serves 90 consumers.
Part Five: MQTT Unifies Edge-to-Cloud Data Paths
Rise of Edge AI
AI is moving from cloud to edge. Autonomous vehicles run perception on in-vehicle chips; industrial cameras do quality inspection on edge gateways; smart speakers run speech recognition locally. Edge AI data needs local real-time processing and cloud upload for model training, forming a closed loop.
Traditional setup: two systems. Edge–device with MQTT (light, fits unstable networks, low power). Cloud big data with Kafka (high throughput, persistence, mature ecosystem). You need an "MQTT → Kafka" bridge—extra complexity, latency, ops, and a possible single point of failure.
Harder case: same data needed via MQTT (e.g., edge gateway subscribes to vehicle sensor data) and via Kafka in the cloud (for training). Traditional flow: device → MQTT Broker → bridge → Kafka → training platform; each hop adds latency; data may be copied multiple times.
RobustMQ’s Dual-Protocol Unification
RobustMQ implements both MQTT and Kafka; they share one storage layer. Data can be published via MQTT and consumed via Kafka, or the reverse. True zero-copy protocol translation—one copy, two access views.
Example: connected vehicle data loop. Vehicles send sensor data (camera, lidar, GPS) via MQTT to RobustMQ. Edge peers (other vehicles, roadside) subscribe via MQTT for real-time processing. Cloud training consumes the same data via Kafka for model training. One transmission, serving both edge and cloud.
Industrial IoT inspection: cameras send images via MQTT to RobustMQ on an edge gateway. Edge AI subscribes via MQTT for real-time defect detection. Same images flow via Kafka to cloud data lake and time-series DB (e.g., GreptimeDB) for analysis and model tuning. One system, two protocols, full edge-to-cloud path.
RobustMQ also handles edge disconnection. When the edge network is unstable, local RobustMQ caches data; when connectivity returns, it syncs to the cloud RobustMQ cluster. Lightweight Rust enables deployment on 512 MB edge devices; MQTT and Kafka unification makes RobustMQ suitable for edge AI.
Part Six: The Broader AI Infrastructure Picture
Standard Upstream for Time-Series Databases
AI and IoT produce large amounts of time-series data: training loss curves, GPU utilization, sensor readings, device state. Often stored in time-series DBs (GreptimeDB, InfluxDB, TimescaleDB) for analysis and visualization. These need upstream data pipelines.
Typical setup: MQTT Broker + Kafka. IoT devices send to EMQ, which bridges to Kafka, which feeds the time-series DB. Two systems to deploy and operate; config is complex; data is transformed multiple times.
RobustMQ can be the direct upstream. Edge devices send via MQTT; the time-series DB consumes via Kafka. One system, both ends. Million Topics let each device or metric have its own Topic; clear organization, efficient queries.
Multi-Mode Storage for Different Scenarios
AI scenarios differ in latency and persistence. RobustMQ’s multi-mode storage adapts.
Gradient sync: microsecond latency, ephemeral—Memory mode. Training data: repeated reads, temporary—Hybrid (memory + SSD). Checkpoints: long-term—Persistent. IoT history: low access—Tiered, auto-archive to S3.
Topic-level storage config lets one RobustMQ cluster serve all these. Training teams pick a mode when creating Topics; no need for separate systems per scenario.
Seamless Kafka Ecosystem Integration
RobustMQ’s full Kafka protocol compatibility is key. Tools and systems built on Kafka work with RobustMQ unchanged.
Flink and Spark use RobustMQ as a source. Kafka Connect connectors import to DBs and warehouses. Existing training code using Kafka SDK can switch to RobustMQ by changing bootstrap_servers and gain million Topics, S3 source, intelligent cache, etc.
Compatibility lowers adoption cost. No new APIs, no rewrites, no migration. Deploy RobustMQ, change config, get beyond-Kafka capabilities. For AI companies already on Kafka, it’s a smooth upgrade.
Part Seven: Key Technical Choices
Why Rust
RobustMQ is implemented in Rust by design. Zero-cost abstraction and no GC matter for latency-sensitive communication infrastructure.
Java-based Kafka can suffer GC pauses under load; when the heap grows, pauses can reach tens to hundreds of ms, causing noticeable latency spikes. For AI training, that disrupts data loading rhythm and GPU utilization. Rust’s lack of GC keeps latency stable; P99 can stay under 5 ms even at 100K messages/s.
Rust’s zero-copy handling helps with large payloads. Using Bytes, data can stay reference-counted through receive, store, read, and send—no physical copy. For a 20 MB video or image, that saves multiple 20 MB copies, reducing memory bandwidth and improving throughput.
Memory safety is another factor. Communication infrastructure runs 24/7; leaks or bad pointers can cause long-running crashes. Rust’s compile-time checks avoid these, so RobustMQ can run months or years without restarts.
Storage-Compute Separation
RobustMQ separates compute (Broker) and storage. Brokers handle protocol, routing, connections—stateless. Storage handles persistence—RocksDB, object storage, or other engines. The layers are clearly split and can scale independently.
This fits AI’s elastic needs. Training clusters change size: more compute during the day, less at night. With Spot instances, nodes can be reclaimed; new nodes must join quickly. With separation, adding compute is starting new Broker processes—seconds to join. Scaling down is shutting processes; no data migration.
Storage scales separately. More data → add RocksDB instances or expand object storage. A Topic gets hotter → adjust cache; no restart.
Part Eight: How This Differs From Existing Solutions
Not "Better Kafka" but a Paradigm Shift
RobustMQ isn’t just a performance tweak (e.g., C++/Rust rewrite). It rethinks what communication infrastructure should be in the AI era.
Kafka’s model: "Data flows from Producer → queue → Consumer." Producers are the source; the queue is the only copy. That fits app-generated real-time streams but not AI training: training data already exists in object storage at TB scale and is read repeatedly.
RobustMQ’s model: "The communication layer can point at external storage and act as a smart cache." S3 is the source of truth; RobustMQ is the acceleration layer. Consumers use the Kafka protocol; they don’t know if data comes from S3 or cache, just that it’s fast. That fits AI training: data lives in object storage (cheap), accessed through cache (fast), no manual migration.
Similar to Redis vs Memcached: Memcached was a simple KV cache; Redis added rich structures and became a "data structure server," expanding use cases. RobustMQ extends from "communication middleware" to "data pipeline and cache layer"—from async data transfer to AI training acceleration, edge-to-cloud protocol translation, and time-series ingestion.
vs. Dedicated Cache Solutions
Alluxio and similar products mount object storage as a POSIX filesystem; training reads "local" files while Alluxio caches. Cases show 90%+ GPU utilization and ~35% training speedup. But: mounting adds setup; FUSE-based solutions have overhead; many frameworks prefer stream interfaces over files.
RobustMQ’s difference: it speaks Kafka protocol, which AI engineers already know. PyTorch and TensorFlow DataLoaders have Kafka integration; many platforms use Kafka for data pipelines. With RobustMQ, no architecture change—replace Kafka and gain object storage source, intelligent cache, million Topics.
RobustMQ is open source. Alluxio has an OSS version, but enterprise features (advanced cache, multi-cloud) cost money. RobustMQ plans to open all core features for the AI community.
Conclusion: Becoming AI-Era Communication Infrastructure Through Architectural Innovation
RobustMQ redefines the role of communication infrastructure in the AI era through four core designs. The premise: full Kafka protocol compatibility. Training frameworks, data pipelines, existing Kafka clients—no code changes to integrate. Not a reinvention, but architectural innovation on top of Kafka: Kafka-compatible and beyond.
First, object storage source and intelligent cache. Topics point directly to S3/MinIO; RobustMQ as cache layer uses three tiers (memory/SSD/S3) and predictive prefetch. Early epochs may load from S3; RobustMQ learns patterns; by epoch 2–3, 95%+ of accesses hit cache; latency drops from ~200 ms to ~2 ms. In multi-node training, a shared cache cluster lets all nodes use one download from S3, avoids duplication and throttling, and new nodes start with cache ready. Kafka can’t do this—Kafka Brokers must hold local copies and don’t natively use object storage as source.
Second, million-scale lightweight Topics. RocksDB-based unified storage; all Topics share one KV instance, keys distinguished by prefix, bypassing file-system limits. Each Agent, experiment, or tenant can have its own channels; full isolation and cost attribution. Kafka’s file-per-Partition model caps Topics at tens of thousands.
Third, shared subscription. Using "$shared/{group}" group_id, multiple consumers can consume one Partition, breaking Kafka’s "concurrency = Partition count." Storage sharding and compute concurrency decouple. AI training can scale consumers with GPU count, 4 to 90, without changing Topics. Still standard Kafka Consumer API; concurrency no longer locked to Partition count.
Fourth, multi-mode storage and tiering. AI scenarios differ: gradient sync needs memory latency but can lose data; training data needs repeated access but can be temporary; checkpoints must persist; logs need cheap archival. RobustMQ’s per-Topic storage modes let one system span from real-time to long-term archival, from 512 MB edge gateways to PB-scale cloud clusters. Kafka has one mode—append-only log and tiering to S3—and can’t match different storage strategies per Topic.
These four designs form a systemic architectural shift. They address one question: how can communication infrastructure evolve from "passive data carrier" to "active data flow scheduler"? In that role, RobustMQ not only transports data but understands sources (object storage), access patterns (training order and repetition), and compute needs (GPU concurrency), and optimizes storage, cache, and distribution.
All of this is transparent to users. Kafka compatibility means Flink, Spark, PyTorch DataLoader—everything built on Kafka—can connect with zero migration. Users see familiar Kafka API and get beyond-Kafka capabilities: object storage integration, million Topics, elastic concurrency, multi-mode storage. RobustMQ's positioning: Kafka protocol enhanced—Kafka and beyond Kafka—exploring a viable path for AI-era communication infrastructure.
