RobustMQ Storage Layer Design (Part 2): From Philosophy to Solution

In the previous article "RobustMQ: Some Thoughts on Message Queue Storage Layers," I shared initial thinking about the storage layer: why compute-storage separation, the diversity of message queue scenarios, and the design of pluggable storage.

That article was more about ideas. As we've gone deeper into the Storage Engine design, many concrete questions have clarified. This article shares the thinking in designing the core storage engine: what problems arose, how we found solutions step by step, and why we made these choices.

The Essence of Pluggable Storage

Before diving into design, a quick recap of pluggable storage. Given varied scenario needs, RobustMQ's answer is pluggable storage—not for technical flair.

The core value of pluggability is scenario fit. Storage Engine is the main engine, built for production cluster deployment and solving elasticity, cost, and consistency in a systematic way. Memory and RocksDB serve standalone and edge scenarios with no external dependencies and small binaries. MySQL and Redis are less common but useful when enterprises have specific needs, e.g., compliance requiring data in a given database or reuse of existing infrastructure.

Pluggability isn't about "supporting many storages"—it's about leaving enough room in the architecture. When users hit scenarios we didn't anticipate, they can implement a Storage Adapter instead of giving up or forking. That's architectural openness, not feature bloat.

This article focuses on Storage Engine design. This is RobustMQ's core and where we spend most effort.

Problem 1: How to Guarantee Order Under High Concurrency

The first challenge for Storage Engine: offsets within the same partition must be strictly contiguous. How to achieve that under high-concurrency writes?

The obvious idea is locks. One lock per partition; on write, acquire lock, allocate offset, write file, release. Simple to implement, about twenty lines. But each thread usually writes a single message before releasing, which leads to many small I/Os. Frequent small writes are bad for disks—both SSD and HDD—which prefer batch writes. Plus lock hold time includes disk I/O, so lock contention gets severe under high concurrency.

Our solution is the I/O Pool architecture. A fixed number of I/O Workers (e.g., 16) manage all partitions. partition_id % worker_count routes each partition's requests to the same Worker. Workers batch requests: wait for the first, then non-blockingly collect more—possibly hundreds or thousands. After grouping by partition, one fsync can persist hundreds or thousands of messages.

We turn "one small write per message" into "one large write per batch." Memory use scales only with Worker count. The idea: use queues instead of lock waits; use batch processing to improve disk efficiency.

We also use zero-copy. With Rust's Bytes, data flows from network to disk in a single copy, shared via reference counts, avoiding multiple copies and reducing CPU and memory overhead.

Problem 2: How to Build Indexes

A message queue needs several indexes: offset lookup, time range query, key/tag filtering. We use RocksDB for indexes because it's a high-performance KV store suited to indexing and supports batch writes, which matches our batch processing.

The next question: synchronous or asynchronous index building? Async might improve write throughput, but introduces many issues: knowing each record's file position, races between write and index threads, index durability, recovery after crash. Implementation gets complex, and users might query data that isn't indexed yet, which is poor UX.

We chose synchronous index building. The insight: under batch writes, index cost is small. When a Worker batch-processes 1,000 records, it builds all indexes (offset, time, key, tag) for those 1,000, then writes them once via RocksDB WriteBatch. Data file: one I/O. RocksDB index: one I/O. Two I/Os total. Extra latency ~1ms, but indexes are immediately usable, data and index stay strongly consistent, and crash recovery is simple.

For the offset index we use a sparse strategy: one index entry per 1,000 records. To query, we use RocksDB to find the nearest index point and file position, then scan at most 1,000 records. Sparse indexing keeps size tiny (10M records ≈ 240KB) and lookup fast (~2ms). Time, key, and tag indexes can be built selectively per business needs and written via RocksDB batches; overhead is controlled.

Problem 3: Choosing a Consistency Protocol

RobustMQ uses compute-storage separation; Segments live on multiple Storage Nodes. How to keep replicas consistent? We compared ISR and Quorum.

Quorum looks attractive: no single point, fast failover, closer to compute-storage separation. Broker writes to 3 Storage Nodes in parallel and returns on 2 acks. But we hit a core issue: any replica can be incomplete.

Example: write 1,000 messages. Storage A might miss 100, B 120, C 80. Every message satisfies Quorum, but no replica has all of them. A read from the wrong replica fails and needs retries. For sequential consumption it's worse: a consumer can hit a gap, think it's done, and miss later data.

Quorum needs complex background repair: periodically scan all Segments, check completeness, fill gaps. That logic is hard—repair speed vs write speed, failures during repair, delay at scale. If repair isn't solid (tough early in a new project), gaps accumulate, read failure rate climbs, and you get a vicious cycle. Real engineering risk.

We chose ISR. There is a Leader, but the Leader ensures complete data and ISR Followers are complete too; reads succeed 100% with no background repair. Followers pull from Leader; in high QPS, small-message workloads this performs better because they can batch pulls and cut network requests from millions to hundreds. Pragmatic choice: simple and reliable over theoretically perfect.

We still have Leaders, but with sensible Segment placement they spread across Storage Nodes and don't bottleneck. Brokers stay stateless; compute-storage separation benefits remain. We also support flexible acks: strong consistency (acks=all), balanced (acks=quorum), high throughput (acks=1).

Problem 4: How to Reduce Leader Management Cost

With ISR, a new question: if every Segment has a Leader, the count explodes. 1,000 Shards × 100 Segments each ⇒ 100,000 Leaders. Election, ISR maintenance, heartbeats become huge overhead.

Our solution: distinguish Active and Sealed Segments. Only Active Segments (currently written) have Leader and ISR. When a Segment fills (e.g., 1GB) or hits a time threshold, it's sealed. On seal, the Leader waits until all ISR Followers have fully caught up and consistency is verified. Only then it's marked Sealed and the Leader role is dropped.

Sealed Segments have no Leader; all replicas are equal; reads can come from any. Leader count becomes Shard count, not Segment count. 1,000 Shards ⇒ 1,000 Leaders; management overhead drops by 100x. Sealed read load can be balanced across all replicas; 99% of historical reads spread across Storage Nodes, better utilizing storage.

Problem 5: How to Avoid Data Migration

Kafka scaling requires copying data between Brokers—hours or days, and production impact. Can we avoid migration entirely?

Answer: the Segment design naturally solves this. When adding Storage Nodes, we don't migrate historical data. We wait for the current Active Segment to fill; new Segments are then created on new nodes and take traffic. With high traffic, a 1GB Segment can fill in minutes to tens of minutes, so new nodes pick up load quickly. Fully transparent to applications, no performance hit.

Historical data remains on original nodes as Sealed Segments for reads. To reclaim space, Sealed Segments can move to S3 via tiered storage, asynchronously without impacting applications. Shrinking works similarly: stop creating new Segments on the node being removed, let Active Segments fill and seal, migrate or keep Sealed on S3, then decommission.

Compared to Kafka's migration: scaling takes effect immediately (new Segments go to new nodes), no historical migration (no large copies), no app impact (reads and writes continue), and it's mostly automated. Segment-based storage gives this naturally: no historical migration, only new data flowing to new nodes.

Problem 6: Cost Optimization and Tiered Storage

Sealed Segments being immutable makes tiered storage simple. We read the full file from any replica, upload to S3, update metadata. No complex coordination; on failure, retry.

Our tiered strategy: hot (Active Segments) on local SSD, 3 replicas + ISR, lowest latency and highest throughput. Warm (recent Sealed) on local storage (SSD or HDD), any replica readable. Cold (old Sealed) on S3, latency ~50ms but very low cost and effectively unlimited capacity. Local keeps only recent hot data (e.g., 7 days); older data in S3. For 1TB/day write and 1-year retention, storage cost can drop from about $210K to about $10K—roughly 95% savings.

Replica policies add flexibility. Multi-replica + local storage is default for core business data. For log-archive workloads, single replica direct to S3 reduces local cost and avoids intra-cluster replication; S3 provides durability. For low-latency feeds (real-time push, market data), single- or multi-replica memory can hit ~100µs latency, at the cost of possible data loss. Memory + async persistence trades off performance and durability for things like session state.

Different scenarios use different policies; flexibility between performance, cost, and reliability. That comes from understanding real requirements, not from stacking features.

Unexpected Bonus: Data Lake Integration

While thinking about tiered storage, we saw another benefit: Sealed Segments naturally fit data lakes. Segments are immutable and complete; when moving to S3 we can convert to Parquet. Spark, Hive, and similar tools can query directly. RobustMQ becomes part of data infrastructure—messages feed the data lake without ETL, supporting both real-time and offline analytics. Historical data in S3 supports audit and compliance.

This wasn't originally planned; it emerged naturally from tiered storage. Good design tends to yield bonuses.

Design Philosophy: Pragmatism

Looking back at the design, the mindset was scenario-driven, problem-oriented, pragmatic choices.

Each decision addressed a concrete problem: I/O Pool for high concurrency, RocksDB for index performance, synchronous build to avoid consistency bugs, ISR to avoid repair complexity, Active/Sealed to cut Leader cost, Segment design for migration-free scaling, tiered storage for cost. These aren't technical stunts; they're direct answers to real issues.

We believe the storage layer, as a core component, should use proven schemes and optimize for our case, rather than chasing theoretically perfect but risky ideas. We chose ISR over Quorum for the simple data-completeness guarantee. We chose synchronous index building because with batch writes the cost is small and the benefit is clear. We chose Active/Sealed separation because this simple change drastically cuts management cost. We chose Segments because they naturally solve migration.

Storage Engine serves mainstream scenarios with mature tech; Memory, RocksDB, MySQL, Redis, S3 serve long-tail cases; pluggable design leaves room for growth. When we talk about flexibility in storage, we mean scenario fit, not technical flexibility for its own sake. With different configs, RobustMQ can be a high-performance cluster message queue, a low-cost log aggregation system, an edge deployment, or an ultra-low-latency in-memory push system.

Next Steps

This article shared the thinking behind Storage Engine design. But implementation details are still evolving; we're validating and refining in practice.

The plan: first implement storage for MQTT and validate the design; then gradually complete the full storage model. MQTT will be our first deep validation, Kafka the second; each scenario will bring new challenges and insights.

Storage engine design is one of RobustMQ's biggest technical challenges. We'll keep exploring and improving, validating ideas in practice. From the kernel up, step by step. That's our choice and our commitment.

About RobustMQ

RobustMQ is a high-performance message queue written in Rust with a compute-storage separation architecture, aiming to become next-generation message middleware infrastructure. Follow us on GitHub and join the technical discussion.

RobustMQ Storage Layer Design (Part 2): From Philosophy to Solution ​

The Essence of Pluggable Storage ​

Problem 1: How to Guarantee Order Under High Concurrency ​

Problem 2: How to Build Indexes ​

Problem 3: Choosing a Consistency Protocol ​

Problem 4: How to Reduce Leader Management Cost ​

Problem 5: How to Avoid Data Migration ​

Problem 6: Cost Optimization and Tiered Storage ​

Unexpected Bonus: Data Lake Integration ​

Design Philosophy: Pragmatism ​

Next Steps ​