Skip to content

RobustMQ Storage Layer: Design and Implementation of File Segment

The storage engine is the heart of a message queue—it determines the system's performance ceiling, cost floor, and the range of scenarios it can serve. RobustMQ's storage layer plans three engines (Memory, RocksDB, File Segment) to cover different scenarios, but File Segment is the foundation and the one fully implemented today.

File Segment targets Kafka's typical scenarios: log collection, user behavior tracking, CDC, and real-time data pipelines. These scenarios share a profile: high volume per topic (tens of thousands to hundreds of thousands of messages per second) but relatively few topics (dozens to hundreds). This article shares the design rationale and implementation details for File Segment.

Core Challenge: Sequentiality Under High Concurrency

The storage layer of a message queue has a basic requirement: message offsets within the same partition must be strictly contiguous. With high-concurrency writes, how do we guarantee both order and high performance?

Traditional approaches use locks for offset allocation. Each partition gets a lock; threads acquire the lock, allocate an offset, write data, then release. The problem is that each thread usually writes only one message before releasing, leading to frequent small I/O operations. Worse, lock hold time includes disk operations, so lock contention is severe under high concurrency.

I/O Pool: The Key to Batch Processing

Our solution is the I/O Pool architecture. A fixed number of I/O Workers (e.g., 16) manage all partitions. Via partition_id % worker_count, each partition's requests always route to the same Worker.

Batch processing inside each Worker is the key. The Worker first blocks waiting for the first request, then non-blockingly collects more (possibly hundreds or thousands). Requests are grouped by partition_id and processed in batches: multiple requests for the same partition get offsets assigned in order, all records are added to a buffer, and when the buffer reaches a threshold (e.g., 100 records), they are serialized and written to disk in one go.

One fsync persists hundreds or thousands of messages instead of one fsync per message. Compared to the lock approach's "one small write per message," the impact on disk performance is orders of magnitude different.

Memory usage scales only with Worker count. 16 Workers means about 16MB of channel memory; even with 10,000 partitions, total memory stays around 100MB. Worker count is usually set to CPU core count, which is enough to fully utilize I/O on SSD.

Zero-Copy: Reducing CPU Overhead

For the path from network to disk, we use Rust's Bytes type for zero-copy. Bytes uses Arc internally; clone() increments a reference count and does not copy data.

The data flow is: network receives BytesMutfreeze() to Bytes (zero-copy) → clone() when sending to channel (reference count only) → Worker receives and batch processes → direct access to raw data when writing. End to end, there is only one copy of the data; we just hold references in different places. This saves CPU and reduces memory bandwidth pressure.

Index Design: Sparse Index + Synchronous Build

RobustMQ uses RocksDB for indexes: offset index, time index, key index, and tag index. We chose synchronous index building.

When a Worker batch processes 1,000 records, it builds all indexes for those 1,000 records and then writes them in one RocksDB WriteBatch. Data file write: one I/O. Index write: one I/O. Total: two. With batch writes, index building adds roughly 1ms, and in return we get indexes that are immediately usable, strong consistency between data and indexes, and simple crash recovery.

The offset index uses a sparse strategy. One index entry per 1,000 records, recording the file position for that offset. To look up, we use RocksDB to find the nearest index point, get the file position, then sequentially scan at most 1,000 records from there. Sparse indexing keeps size tiny (10M records ≈ 240KB) while lookup latency remains low (≈2ms).

ISR Consistency Mechanism

RobustMQ uses a compute-storage separation architecture: Storage Nodes manage Segments, Brokers are stateless. We chose ISR over Quorum as the consistency protocol.

ISR is core to Kafka's design. Each Segment has a Leader and maintains an ISR (In-Sync Replicas). Write success means data was replicated to all replicas in ISR. Followers pull data from the Leader; in high QPS, small-message workloads this performs better because Followers can batch pulls, cutting network requests from millions to hundreds.

The main advantage is data completeness. The Storage Leader has full data; ISR Followers have full data. Reads can come from the Leader or any ISR Follower and always succeed with no gaps. The implementation is simple and reliable; no complex background repair is needed.

Quorum avoids a single point of failure but has a critical issue: any replica can be incomplete. A Broker writes concurrently to 3 Storage Nodes and returns success on 2 acknowledgments—some replica may be missing data. Although each message satisfies Quorum, no single replica is complete. Reads can hit gaps and need to retry other replicas. For sequential consumption this is worse: a consumer may hit a gap, think it has consumed everything, and miss later data. Quorum relies on complex background repair logic; engineering complexity is high.

Given engineering complexity and risk, we chose ISR. Although Storage has a Leader, with sensible Segment placement Leaders spread across Storage Nodes and do not become a bottleneck. The Broker layer stays stateless; the benefits of compute-storage separation remain.

We support flexible acks configuration: acks=all (strong consistency, wait for all ISR replicas), acks=quorum (balanced), acks=1 (fastest but may lose data).

Active vs Sealed Segment

On top of ISR we made an important optimization: distinguish Active Segment and Sealed Segment.

Active Segment is the segment currently being written; it has Leader and ISR, with Followers continuously pulling to replicate. When a Segment is full (e.g., 1GB) or hits a time threshold, it is sealed. Sealed Segments stop writes; the Leader waits until all ISR Followers have fully caught up and verified consistency. Once all replicas are complete and consistent, it is marked Sealed and the Leader role is released. After that, there is no Leader; all replicas are equal and reads can come from any of them.

The benefit is clear. Leader count equals Shard count, not Segment count. With 1,000 Shards and 100 Segments each (99 sealed, 1 active), we only have 1,000 Leaders, not 100,000. Leader management overhead (election, ISR maintenance, heartbeats) drops by 100x. And 99% of historical reads are spread across all Storage Nodes, better utilizing storage.

Migration-Free Scaling

The Segment design naturally solves Kafka's scaling pain. Kafka scaling requires repartitioning and copying large amounts of data between Brokers—slow, bandwidth-heavy, and potentially disruptive to production.

RobustMQ scaling is different. When adding Storage Nodes, we do not migrate historical data. We only wait for the current Active Segment to fill. New Segments are automatically placed on new nodes and start taking traffic. With high traffic, a 1GB Segment may fill in minutes to tens of minutes; new nodes pick up load quickly. The whole process is transparent to applications, with no performance penalty.

Historical data stays on the original nodes as Sealed Segments for reads. To reclaim space, Sealed Segments can be migrated to S3 via tiered storage, asynchronously in the background without affecting applications. Shrinking works similarly: stop creating new Segments on the node being removed, let its Active Segments fill and seal, then migrate or keep Sealed Segments on S3 before decommissioning.

Tiered Storage

The immutability of Sealed Segments makes tiered storage straightforward.

Hot data (Active Segments) live on local SSD with 3 replicas and ISR, for lowest latency and highest throughput. Warm data (Recent Sealed Segments) stay on local storage (SSD or HDD); any replica can serve reads, with slightly higher but still good latency. Cold data (Old Sealed Segments) move to S3; latency increases to about 50ms but cost is very low and capacity is effectively unlimited.

Migration is simple. Sealed Segments are immutable and all replicas are complete; read the full file from any replica, upload to S3, update metadata. No complex coordination; on failure, retry. Locally we only keep recent hot data (e.g., 7 days); older data is in S3. For workloads writing 1TB/day with 1-year retention, storage cost can drop from $210K to $10K—about 95% savings.

Design Philosophy

For File Segment we kept one principle: simple and reliable beats complex and perfect.

We chose ISR over Quorum for the simple guarantee of data completeness. We chose synchronous index building—small overhead with batch writes, but indexes are immediately usable. We chose Active/Sealed separation—a simple optimization that greatly reduces management cost. We chose the Segment design—it naturally solves data migration.

We pick proven, mature schemes and optimize for our use case, rather than chasing theoretically perfect schemes with high engineering risk. File Segment draws on the experience of Kafka, Pulsar, and RocketMQ, seeking the best balance of high performance, high reliability, and low cost.

Summary

File Segment achieves a balance of high performance and reliability through I/O Pool, zero-copy, batch writes, and sparse indexing. Active vs Sealed Segments sharply cut Leader management overhead. The Segment mechanism enables migration-free scaling. Tiered storage enables cost optimization.

This isn't theoretical innovation—it's engineering composition. We combine existing techniques (sequential files, batch processing, ISR) in a clear architecture with sensible tradeoffs. As the foundation of RobustMQ's storage layer, File Segment sets the stage for future RocksDB and Memory engine extensions.

A solid foundation lets us go further.