RobustMQ Raft State Machine Performance Investigation: The Problem Isn't RocksDB — It's Queuing

Where It Started

RobustMQ's Meta Service is built on OpenRaft and is responsible for storing MQTT sessions, subscription relationships, cluster metadata, and other data requiring strong consistency. When we were evaluating options, we looked at OpenRaft's official performance reports — the numbers were impressive: a single node could reach hundreds of thousands of QPS with microsecond-level latency.

But in our actual load testing, the reality was completely different.

When stress testing with a large number of MQTT client connections, throughput could reach 9,000 ops/s, but the latency was completely wrong — p95 was consistently above 360ms, p99 exceeded 450ms, and both were slowly creeping upward with no sign of convergence. The official benchmark claimed microsecond-level latency and hundreds of thousands of QPS — two orders of magnitude better than what we were seeing.

In this scenario, each client connection triggers a client_write, writing session data into the Raft state machine for persistence. As the connection count rises, the pressure on Raft writes is immediately exposed.

We used RobustMQ's built-in cli-bench tool to stress test, hitting the Meta Service's placement-create-session interface directly:

bash

cargo run --package cmd --bin cli-bench meta placement-create-session \
  --host 127.0.0.1 \
  --port 1228 \
  --count 1000000 \
  --concurrency 3000 \
  --timeout-ms 3000 \
  --output table

1 million writes, 3,000 concurrent, 3-second timeout. Output (excerpted):

sec=56 | ops/s=8957  | p95=359ms p99=452ms
sec=60 | ops/s=8828  | p95=361ms p99=453ms
sec=70 | ops/s=9092  | p95=367ms p99=457ms
sec=75 | ops/s=12148 | p95=370ms p99=456ms
sec=82 | ops/s=2943  | p95=372ms p99=475ms  ← sudden drop
sec=83 | ops/s=7684  | p95=378ms p99=551ms  ← p99 spike
sec=84 | ops/s=8633  | p95=380ms p99=552ms

A few direct observations:

Stable throughput around 9,000 ops/s; under 3,000 concurrency this matches Little's Law: 3000 / 0.36s ≈ 8333
p95 slowly climbs from 359ms to 372ms, p99 from 452ms to 457ms — latency is continuously rising with no sign of convergence
A noticeable spike at second 82: ops/s drops to 2,943, and p99 subsequently stays above 550ms — likely a brief write stall triggered by RocksDB compaction or a background flush

This gap puzzled us for a while. We were using the same library, and there were no obvious anomalies in our implementation — so why was the performance so much worse? Was our code flawed, or is OpenRaft simply like this in production? Or was there some other explanation for the latency?

This article documents our investigation.

The Investigation: Adding Instrumentation Step by Step, Ruling Out Causes One by One

When facing performance problems, intuition is often unreliable. We went through several rounds of "thought we found it, but it wasn't that."

Step 1: Suspect gRPC Server Concurrency Handling

With 3,000 concurrent requests coming in, the first suspect was the gRPC server itself having a concurrency bottleneck — perhaps a full processing queue, a connection count limit, or one request holding things up and causing backlog.

We checked tonic's documentation and code: the tonic gRPC server has no hard limit on concurrent connections. The underlying HTTP/2 multiplexing can theoretically handle a large number of concurrent requests on a single TCP connection.

But "no limit" doesn't mean "no problem" — the real issue was lack of observability. Once a request arrived at the server, how long it took to process and its success rate were completely invisible.

So we added request-level latency tracking to the gRPC server's BaseMiddleware (Tower middleware layer): after each request completes, record its duration in the grpc_request_duration_ms Histogram, and also track request counts (grpc_requests) by service/method/status dimensions. We also added slow-request alerting — any request exceeding 2,000ms emits a warn log, making it easy to spot anomalies directly in the logs. The tonic Server also got HTTP/2 keepalive and a 30-second timeout configuration.

After this, gRPC Server QPS, error rates, and latency distributions were all visible in Grafana. The data showed that gRPC server processing latency itself was low, requests were responding normally, and there was no blocking or backlog.

gRPC server concurrency ruled out.

Step 2: Suspect the gRPC Connection Pool

cli-bench calls the Meta Service via gRPC. With 3,000 concurrent requests coming in, the next suspect was a client-side connection pool that was too small, causing requests to queue up before they were even sent.

Checking the logs revealed instances of transport error and connection counts stuck at 5 with no growth. Analysis showed the cause: the existing implementation was based on the mobc library, where Pool::builder() only configured max_open without max_lifetime or max_idle_lifetime, causing dead connections to never be recycled and the pool unable to grow.

This prompted a significant refactor: we removed mobc entirely and implemented a lightweight connection pool ourselves (ChannelPool) using Vec<Channel> + AtomicUsize with round-robin dispatch. tonic's Channel uses HTTP/2 multiplexing underneath, so unlike database connections, maintaining a large number of TCP connections isn't necessary — a few Channels per address with round-robin distribution is sufficient. We also added monitoring metrics for the connection pool to observe channel counts and acquisition latency.

After the refactor, the connection pool behavior was normal, and transport errors disappeared from the stress test logs. But latency didn't change at all.

Connection pool ruled out.

Step 3: Suspect RocksDB Performance, Add RocksDB Monitoring

Nothing wrong with the connection pool, so we looked at the server side. Instinctively, RocksDB is the most likely culprit — write amplification, WAL flushing, and compaction can all cause jitter.

RobustMQ already had Histogram instrumentation on RocksDB save/get/list/delete operations (rocksdb_operation_ms), so we pulled up the RocksDB Write Latency and RocksDB Read Latency panels in Grafana.

During the stress test, RocksDB write latency was stable under 1ms, reads were normal, and there was no sign of compaction blocking or WAL delays.

RocksDB ruled out.

Step 4: Discover High Raft Write Latency

Since the storage layer was fine, we looked higher up. src/meta-service/src/raft/manager.rs already had latency tracking on client_write(), recorded into a Prometheus Histogram.

Pulling up the Raft Write Latency panel in Grafana: P50 was consistently at 89ms, P99 reached 243ms, P999 reached 460ms. The problem wasn't in network transmission — it was in the Raft write itself.

But Raft internals were still a black box — was consensus slow? Was state machine apply slow? Was something blocking somewhere? Unknown.

Step 5: Add System CPU and Memory Monitoring

We suspected that maxed-out machine resources were causing overall scheduling slowdowns. We added monitoring panels for process CPU usage, system CPU usage, process memory, and system memory.

The data during stress testing: System CPU hit 100%, Process CPU was only 40%, a 55% gap.

This gap was significant. The Broker itself was only using 40% CPU, and the other 55% was consumed by the benchmarking tool (running on the same machine). The two processes were competing for CPU, meaning some of the latency we measured was environmental interference rather than the Broker's true bottleneck. This also indicated that future stress testing needed to use separate machines to see the Broker's real limits. Memory showed no anomalies, ruling out memory pressure.

Step 6: Add Tokio Runtime Monitoring

CPU resources were being shared between two processes, but how was the Broker's 40% CPU distributed internally? Were the various subsystems interfering with each other?

We added monitoring panels for each Tokio Runtime's busy ratio, global queue depth, and alive task count.

The data revealed three curves with very different behavior: server-runtime busy ratio fluctuated between 40–80%, meta-runtime showed pronounced spikes when write load surged, and broker-runtime was near 0%.

This data confirmed two things: first, the isolation between the three runtimes was working — Raft's load hadn't spilled over into server-runtime; second, while server-runtime was busy, it wasn't saturated — the network processing layer was not the bottleneck.

Step 7: Observe the Correlation Between Apply Lag and Write Latency

We added three Gauge metrics to the monitoring: raft_apply_lag (the difference last_log_index - last_applied), raft_last_log_index, and raft_last_applied, to observe whether the state machine was falling behind.

Comparing the two graphs side by side was very revealing: Apply Lag spikes always preceded Write Latency increases. When the MQTT write rate surged to 6K req/s, the Lag first accumulated to 71, and only then did Write Latency climb from 300ms to over 1s.

This indicated that the state machine apply was unable to keep up, creating back-pressure that then slowed down the overall client_write latency.

Step 8: Attempt to Drill Deeper with CPU Flame Graphs

The project has pprof integration (src/common/pprof-monitor), which when enabled allows pulling flame graphs in real time via an HTTP endpoint to see CPU usage per function.

We didn't pursue this avenue further because a more direct approach was available — breaking down the call chain into timed segments.

Step 9: Time the client_write Chain in Segments — Find Those 199ms

client_write takes hundreds of milliseconds overall, but it was still unclear which segment was slowest. Breaking the chain down:

client_write()
  → RaftCore handles propose
  → Write Raft log (log append)    ← instrument here
  → commit advances
  → notify state machine to apply
  → state machine writes RocksDB (apply batch) ← instrument here
  → return

We added batch-level timing to both log.rs::append() and state.rs::apply(): just a single Instant::now() at the function entry and exit points, with results reported to Prometheus Histograms. Two new panels were added in Grafana: raft_log_append_batch_duration_ms and raft_apply_batch_duration_ms. No fine-grained per-entry statistics — only two clock reads on the hot path, with negligible overhead.

What the Data Reveals

With the two new panels' data in hand, the conclusion was crystal clear:

raft_log_append_batch_duration_ms P50: 0.5ms
raft_apply_batch_duration_ms P50: 0.5ms
raft_write_duration_ms (overall client_write) P50: 200ms, P99: 431ms, P999: 615ms

The two ends together add up to under 1ms, yet the end-to-end takes 200ms — a gap of 199ms.

This directly exonerates RocksDB. RocksDB writes were fine. The problem was in Raft's internal scheduling and waiting.

Where Did Those 199ms Go?

Once we understood OpenRaft's internals, the answer became clear.

OpenRaft's core is a single-threaded Actor (RaftCore), where all proposes, commits, and apply notifications are passed through async channels and processed serially by this single thread. This is a requirement for Raft algorithm correctness — the Leader must maintain a globally ordered view of all entries, and parallel processing would break this constraint.

When a client_write request comes in, its lifecycle roughly goes:

Enter RaftCore's mailbox, wait to be scheduled
RaftCore handles propose, writes log (0.5ms)
Wait for commit confirmation
Notify apply task via channel
Apply task executes state machine write (0.5ms)
Return result to caller via channel

Steps 2 and 5 combined take 1ms. The wait time in step 1 is negligible under low load, but at high QPS, as requests accumulate in the RaftCore mailbox, step 1's wait time becomes the dominant component of latency.

Using queueing theory to understand this: under the M/M/1 queue model,

Average response time = service time / (1 - utilization)

Apply takes 0.5ms, giving RaftCore a theoretical maximum throughput of around 2,000 QPS. Under 3,000 concurrency the system actually runs at 9,000 ops/s — achieved by leveraging queue buildup to inflate batch sizes. At this point utilization (ρ) approaches 1, the denominator approaches 0, and latency approaches infinity.

Even at ρ = 0.9 (not yet overloaded), the average response time is 0.5ms / 0.1 = 5ms. At ρ = 0.99 it's 50ms. Queue effects are extremely sensitive to high utilization — this has nothing to do with OpenRaft itself; it's the universal behavior of single-threaded processors facing high concurrency.

Is the Raft State Machine Really Limited to Just 2,000 QPS?

At this point, a natural reaction is: with a single Raft group plus RocksDB, is performance really limited to 2,000 QPS? That seems extremely low.

This conclusion has an implicit premise that deserves unpacking.

2,000 QPS is the limit of "processing one entry per round," not Raft's upper limit

RaftCore is a single-threaded Actor with one processing cycle taking about 1ms (0.5ms log append + 0.5ms apply). But OpenRaft natively supports batching — when multiple requests are backed up in the mailbox simultaneously, RaftCore merges them into a single batch, writing to RocksDB with one WriteBatch, and processing all entries with one apply() call.

The actual throughput formula is:

Throughput = (rounds per second) × (average batch size per round)
           = 1000 rounds/sec × N entries/round

N=1 gives 1,000 QPS; N=50 gives 50,000 QPS. The batch size per round is the key variable determining throughput.

This also explains where our measured 9,000 ops/s came from: once 3,000 concurrent requests filled the mailbox, each round could process multiple entries, pushing throughput well beyond the single-round limit — at the cost of 360ms of high latency. The system was "borrowing" throughput by piling up concurrency to inflate batch sizes, but the cost of borrowing was queuing latency.

RocksDB itself is not the bottleneck

On NVMe SSDs, RocksDB WriteBatch throughput easily exceeds 100k writes/sec. The true slow path is WAL fdatasync — Raft's persistence semantics require syncing each batch to disk after every write. On NVMe, fdatasync takes about 0.1–0.5ms; on HDD it's about 5–10ms. The larger the batch, the lower the amortized cost of fdatasync, and the higher the throughput.

etcd's recommendation of 1,000 QPS is a conservative stability guideline

etcd's documentation recommends production write rates not exceed 1,000 QPS — not a hard limit, but to leave plenty of headroom for compaction, snapshots, and follower synchronization to avoid long-tail jitter. In lab environments, etcd on a single node can reach 10k–30k ops/s.

So where does our 2,000 QPS limit come from?

It's a constraint of the current implementation, not a universal constraint of Raft or RocksDB. Currently, each RobustMQ client_write writes only one entry, so at low concurrency the batch size approaches 1, naturally hitting the single-round limit. The fix is straightforward: application-layer batch merging — pack multiple write operations within the same client connection lifecycle into a single client_write, increasing batch size from 1 to 10–50. This offers an order-of-magnitude improvement in throughput potential while also reducing latency, since each request waits for fewer rounds.

The Gap vs. Official Benchmarks

With this analysis, the original confusion can now be fully explained.

The OpenRaft official benchmark test configuration uses an in-memory state machine, with apply operations taking near-zero time. Under this premise, the theoretical maximum QPS approaches 1 million per second, system utilization is very low, queuing effects are nearly zero, and latency is of course at the microsecond level.

Our implementation uses a RocksDB state machine, where each batch apply takes about 0.5ms — but that 0.5ms is a batch cost shared across multiple entries. The real difference is: in-memory operations take near-zero time, while disk fdatasync takes about 0.5ms — a 500x difference that determines the ceiling for per-round processing capacity.

The official benchmark measures OpenRaft's protocol layer overhead in isolation, not performance after integrating real storage. The two are not testing the same dimension, and direct comparison is meaningless.

Supplement: Tokio Runtime Isolation and Monitoring

Tokio Runtime monitoring data was an important piece of the puzzle in this investigation. It's worth discussing separately, as it has general value for performance analysis in large Rust services.

Why Runtime Isolation Matters

RobustMQ is a large service process containing multiple subsystems: the gRPC Server, the OpenRaft state machine, and MQTT Broker business logic all run in the same process. When OpenRaft starts, it creates about 9 internal tasks via tokio::spawn (core loop, log IO, state machine worker, etc.), and these tasks are bound to the runtime context in which Raft::new() is called. If Raft is initialized on server_runtime, those 9 tasks will be mixed with gRPC request handling tasks, competing for worker threads.

RobustMQ's approach is to split into three independent runtimes:

server-runtime  → gRPC server, network I/O, request dispatch
meta-runtime    → all OpenRaft internal tasks, RocksDB metadata writes
broker-runtime  → MQTT business logic, message processing

During initialization, Raft::new() is explicitly called within the context of meta_runtime, ensuring all Raft internal tasks are bound to meta_runtime and won't pollute server_runtime:

rust

let meta_runtime = create_runtime("meta-runtime", meta_worker_threads);
let meta_params = meta_runtime.block_on(BrokerServer::build_meta_service(...));

What to Monitor

Tokio provides the tokio::runtime::Handle::metrics() API. Based on this, we implemented start_runtime_monitor, which samples every 15 seconds and reports three metrics:

Busy Ratio: By comparing the difference in worker_total_busy_duration between adjacent samples, calculates the actual busy proportion of worker threads. 0% is completely idle, 100% means all threads are always running
Global Queue Depth: How many tasks are backed up in the global queue awaiting execution — an early signal of runtime overload
Alive Tasks: Current number of live tasks; abnormal growth indicates uncontrolled spawning somewhere

What the Data Showed

During stress testing, the three curves behaved very differently: server-runtime busy ratio was 40–80%, meta-runtime had pronounced spikes when write load surged, and broker-runtime was near 0%. This confirmed two things: isolation was working and Raft's load hadn't spilled into server-runtime; server-runtime was busy but not saturated, so the network processing layer was not the bottleneck.

Without runtime isolation, the three curves would merge into one, making it impossible to tell which subsystem was consuming resources.

Conclusion

What the Problem Was

Stress testing Meta Service session writes with 3,000 concurrency: throughput could reach 9,000 ops/s, but p95 latency was 360ms and continued rising over time — two orders of magnitude worse than OpenRaft's official benchmark claim of microsecond-level latency.

What Was Investigated, What Was Ruled Out

We successively added gRPC Server request metrics, gRPC connection pool metrics, RocksDB operation latency metrics, process/system CPU and memory metrics, Tokio Runtime busy ratio metrics, and Raft apply lag metrics, and finally added batch-level timing at log.rs::append() and state.rs::apply().

Elimination chain: gRPC Server normal → connection pool rewritten without mobc and normal → RocksDB writes < 1ms normal → CPU/memory not the main cause (benchmark tool used 55% CPU) → Tokio Runtime not saturated → Apply Lag confirmed backlog exists → Segment timing found the 199ms in queuing.

Root Cause

Log append measured at 0.5ms, apply batch measured at 0.5ms, but client_write end-to-end is 200ms — a gap of 199ms. These 199ms are spent queuing in RaftCore's mailbox. RaftCore is a single-threaded Actor; apply latency of 0.5ms sets a theoretical upper limit of about 2,000 QPS. Under 3,000 concurrency, the system runs near saturation, and queue effects amplify 0.5ms into hundreds of milliseconds.

Performance Boundaries and Next Steps

2,000 QPS is the single-round limit when "each client_write writes one entry, batch size = 1" — not Raft's upper limit. The measured 9,000 ops/s is the result of RaftCore batch-processing multiple entries, at the cost of 360ms+ queuing latency. There are two improvement directions with different effects:

Application-layer batch writes are a symptomatic fix: merging multiple MQTT operations into one client_write, increasing batch size from 1 to 10–50. Throughput improves and latency will improve as well, but a single RaftCore remains a serial bottleneck with a ceiling.

Multi-Raft sharding is a fundamental fix: distributing session writes across multiple independent Raft groups by hashing ClientID, where each group has its own RaftCore and they operate fully in parallel. 10 groups means 10x throughput, and latency can stay low because the queue pressure on each group is reduced. This is true horizontal scaling.

RobustMQ's architecture already includes a Multi-Raft design — metadata, MQTT state, and offsets each have their own independent Raft groups, and different data types are already isolated. But currently all MQTT session writes still hit the same group, which remains a serial bottleneck with a single RaftCore.

The next step is to put this design to real use: for high-frequency write data types like sessions, shard by ClientID hash across multiple groups so that multiple RaftCores can process in parallel. This is more thorough than batch writes, and it's also the standard path taken by production-grade distributed systems like TiKV and CockroachDB to break through single-Raft bottlenecks.

Short term: application-layer merge writes, reducing the number of processing rounds per group. Medium term: identify which operations don't require strong consistency and move them off the Raft path. Long term: shard high-frequency writes like sessions across multiple groups by ClientID, and truly put the Multi-Raft design into action.

Project URL: https://github.com/robustmq/robustmq

RobustMQ Raft State Machine Performance Investigation: The Problem Isn't RocksDB — It's Queuing ​

Where It Started ​

The Investigation: Adding Instrumentation Step by Step, Ruling Out Causes One by One ​

What the Data Reveals ​

Where Did Those 199ms Go? ​

Is the Raft State Machine Really Limited to Just 2,000 QPS? ​

The Gap vs. Official Benchmarks ​

Supplement: Tokio Runtime Isolation and Monitoring ​

Conclusion ​