Skip to content

RobustMQ Raft State Machine Performance Breakthrough: Batch Semantics + Concurrency Governance, from 20k to 140k ops/s

The conclusion from the previous article, Raft State Machine Continued Optimization: Re-Validating Groups, Heartbeat Clocks, and Runtime Threads, was: in the current single-machine environment, data_raft_group_num=4 is a negative optimization; =1 combined with conservative clock parameters can stabilize around 20k ops/s. The open question at the end was: why do more Raft nodes not improve performance — they actually regress? What's next?

This article documents the analysis process that followed, and the eventual breakthrough. The conclusion first: the bottleneck is not in the number of Raft groups, but in request granularity and concurrency depth. By changing CreateSession to batch semantics and reducing client concurrency from thousands to 20, single Raft group throughput jumped directly from 20k ops/s to 140k ops/s, with 100% success rate and P95 latency < 24ms.


Back to the Core Question: Where Is the Bottleneck?

The previous round's conclusion was "raft_write_latency reached the second level, while log append and apply were only at the millisecond level," and we attributed the cause to "entry queuing and system-wide waiting." This judgment was directionally correct, but not precise enough — it answered "time is spent waiting," but not "why is the wait so long."

This round we focused our attention on OpenRaft's internal implementation. After reading openraft-0.9.21/src/core/raft_core.rs, we discovered a key fact:

RaftCore::append_to_log does not simply queue the log and return — internally it rx.awaits for the log flush callback. This means every single proposal blocks the Raft core event loop until disk writes complete.

This means the Raft core's proposal processing capacity has a hard ceiling: approximately 33–50 proposals per second (depending on the latency of a single log flush). When external concurrency far exceeds this rate, requests all pile up in the rx_api channel queue. The deeper the queue, the longer tail requests wait, ultimately manifesting as second-level end-to-end latency and widespread client timeouts.

This discovery explains all the unexplained phenomena from the previous article:

  • Why were log append and apply fast, but raft_write_latency was high? Because latency is not execution time — it's queuing time plus execution time. Execution itself only takes milliseconds, but a request at the end of the queue might wait one or two seconds before its turn.
  • Why did multiple groups make things worse? 4 groups means 4 sets of Raft background tasks competing for scheduling time slices in the same runtime; tokio's scheduling latency gets stretched, and each group's effective proposal processing rate is actually lower.
  • Why didn't CPU saturate, yet performance couldn't improve? Because the bottleneck isn't CPU compute — it's serial waiting in the Raft core event loop. The core thread is idle on rx.await, and no number of extra cores helps.

The conclusion is clear: the real lever is not "add more Raft instances," but "reduce the number of proposals each instance needs to handle" and "control queue depth."


Optimization 1: Full-Pipeline Batch Semantics for CreateSession

Since a single Raft group can only process about 40 proposals per second, letting each proposal carry more business data is the most direct amplifier.

Proto Layer: From Single to Batch

The original CreateSessionRequest had single-item semantics — one request creates one Session. After the refactor, we introduced CreateSessionRaw, allowing one request to carry a batch of Sessions:

protobuf
message CreateSessionRaw {
  string client_id = 1;
  bytes session = 2;
}

message CreateSessionRequest {
  repeated CreateSessionRaw sessions = 1;
}

This way, one gRPC call, one Raft proposal, and one state machine apply can handle N Sessions. With batch_size=100, 40 proposals/s equals 4,000 Sessions/s — and that's just the theoretical lower bound for a single group.

State Machine Apply Layer: RocksDB WriteBatch

Previously, the apply phase performed a separate rocksdb.put_cf() for each Session. N Sessions meant N WAL fsyncs, each going through the complete "serialize → get CF handle → write disk" path.

After the refactor, we added an engine_batch_save interface at the rocksdb-engine layer, using RocksDB's native WriteBatch:

rust
pub fn engine_batch_save<T: Serialize>(
    engine: &Arc<RocksDBEngine>,
    column_family: &str,
    source: &str,
    entries: &[(String, &T)],
) -> Result<(), CommonError> {
    let cf = get_cf_handle(engine, column_family)?;
    let mut batch = rocksdb::WriteBatch::default();
    for (key, value) in entries {
        let wrap = StorageDataWrap::new(value);
        let serialized = serialize(&wrap)?;
        batch.put_cf(&cf, key.as_bytes(), &serialized);
    }
    engine.write_batch(batch)
}

All put operations within a single WriteBatch trigger only one WAL fsync, with atomicity guaranteed by RocksDB. 100 Sessions go from 100 fsyncs to 1 — reducing the I/O overhead in the apply phase to 1/100 of the original.

Effect Pipeline

The write pipeline after the refactor:

text
cli-bench (100 sessions) → 1 gRPC call → 1 Raft proposal → 1 log append
→ 1 apply → 1 RocksDB WriteBatch (100 puts, 1 fsync) → return

Compared to before the refactor (100 independent requests → 100 proposals → 100 fsyncs), the Raft core's burden is reduced by two orders of magnitude.


Optimization 2: Concurrency Governance — Less Is More

Batch semantics solved the problem of "each proposal doing too little work," but there's another equally important dimension: queue depth.

Why Excessive Concurrency Is Harmful

Previous rounds of benchmarking consistently used --concurrency 2000 or even 5000. Intuitively, high concurrency means high throughput. But for Raft write pipelines, this intuition is wrong.

Raft core is a strictly serial processor: one proposal in → log flush → wait for callback → process the next one. No matter how many concurrent requests are waiting outside, its processing pace doesn't speed up because "there are more people in line." On the contrary, the deeper the queue, the larger the side effects:

  1. Scheduling contention intensifies. 2,000 tokio tasks are simultaneously in an await state. Each time Raft core completes a proposal, the runtime must select the next one to wake from a large number of wakers. This scheduling overhead grows with concurrency.
  2. Tail latency worsens. Latency for a request at the tail of the queue ≈ queue depth × single proposal processing time. Concurrency 2000 + single proposal 25ms = the last request might wait 50 seconds.
  3. Timeouts trigger chain reactions. The client has a 3-second timeout, but the request spent 3 seconds waiting in queue and still hasn't been processed — so the client cancels the request and closes the connection. When the Raft server eventually gets to processing it, it finds the oneshot receiver has already been dropped, triggering OneshotConsumer.tx.send: is_ok: false warnings. These "ghost proposals" consume Raft processing capacity without producing any useful results.

What Happened When We Dropped to 20 Concurrency

After reducing --concurrency from thousands to 20:

  • Queue depth dropped from thousands to no more than 20. Each proposal that comes in is processed almost immediately without queuing.
  • Scheduling overhead dropped dramatically. The tokio runtime only needs to manage 20 active tasks; context switches are greatly reduced.
  • Zero timeouts. All requests complete in milliseconds, with no "ghost proposals" wasting Raft processing capacity.
  • Raft core utilization is actually higher. Because there's no scheduling jitter and timeout interference, the core can advance proposals at a more stable pace.

This is the classic queueing theory phenomenon (Little's Law): at a fixed service rate, reducing queue depth does not reduce throughput, but dramatically reduces latency. And reduced latency eliminates timeouts and retries, further improving effective throughput.


Benchmark Results

Test Environment

  • Apple Silicon, 14 logical CPUs (10P + 4E)
  • Local service + local benchmark (single-machine all-in-one)
  • data_raft_group_num = 1, Raft clock 100/1000/2000, runtime threads 16
  • Debug build (non-release)

Benchmark Command

bash
cargo run --package cmd --bin cli-bench meta placement-create-session \
  --host 127.0.0.1 \
  --port 1228 \
  --count 10000000 \
  --concurrency 20 \
  --batch-size 100 \
  --timeout-ms 60000 \
  --output table

Key Metrics

MetricValue
Total Sessions10,000,000
Total Duration69 seconds
Average Throughput144,927 sessions/s
Peak Throughput179,000 sessions/s
Success Rate100%
Timeouts0
Latency P5011.7ms
Latency P9523.9ms
Latency P9957.3ms
Latency min1.6ms

Comparison with Previous Round

DimensionPrevious Round (Single + High Concurrency)This Round (Batch + Low Concurrency)
batch_size1100
concurrency2000~500020
Throughput (ops/s)~20,000~145,000
P95 Latencyhundreds of ms to second level24ms
Timeout Ratecontinuously accumulating0
Success Rate< 100%100%

Throughput improved approximately 7x, latency reduced by approximately two orders of magnitude, and timeouts went from "continuously accumulating" to zero.


Grafana Monitoring Validation

From the Raft metrics panels, we can see that the state machine after optimization is running very healthily:

  • Raft Write QPS: data_0 stable at ~340 req/s (i.e., ~340 proposals/s × 100 sessions = ~34,000 sessions/s Raft-layer throughput)
  • Raft Write Success Rate: perfectly matching the request volume, no failures
  • Raft Write Failure Rate: 0 failures across the entire pipeline
  • Log Append Latency P50: ~0.5ms
  • Apply Batch Latency P50: ~0.5ms
  • Raft Write Latency: millisecond-level, no more second-level spikes

This data shows that all components within Raft are operating in comfortable ranges: log flush is fast, apply is fast, no queuing backlog. The latency of the entire pipeline is now the true overhead of "network + serialization + log flush + apply" — with no additional waiting amplification.


Looking Back at the Previous Article's Unresolved Issue: Why Does Multi-Group Always Regress?

The core confusion in the previous article was "why do multiple Raft nodes regress?" At the time, we attributed it to "single-machine shared resources where coordination costs exceed parallel gains." This attribution direction was correct, but not precise enough.

In this round, with batch semantics already in place, we did another controlled experiment: changing data_raft_group_num from 1 to 4 with all other parameters unchanged (concurrency=20, batch_size=100). The result was still dramatically degraded performance. This rules out the possibility that "the previous round's multi-group regression was caused by single-item semantics + high concurrency" — even with Batch + low concurrency, multiple groups still perform worse than a single group.

After tracing through the code, we ultimately pinpointed the regression cause to OpenRaft's per-proposal async scheduling model, not RocksDB itself.

First, let's rule out an easy-to-misjudge direction: all Raft groups share the same Arc<DB>, but Grafana shows that Log Append Latency is only ~0.5ms, meaning a single db.write() in RocksDB is very fast — theoretically handling ~2,000 writes per second. Even if 4 groups write simultaneously (4 × 340 ≈ 1,360 times/second), RocksDB can handle it comfortably. RocksDB is not the bottleneck.

The true bottleneck is: each proposal in OpenRaft is not just a db.write() — it's a tokio async round-trip:

text
RaftCore (task A) → sends entries to log store → yields, waits for callback (rx.await)
    ↓ tokio scheduler switches
LogStore worker (task B) → db.write(batch), 0.5ms → callback.log_io_completed()
    ↓ tokio scheduler switches
RaftCore (task A) → woken up → continues run_engine_commands → processes next proposal

With a single group, this round-trip involves 2 tokio task switches, with a measured total latency of ~3ms (0.5ms disk + the rest is all scheduling overhead). So 1 group can stably run at ~340 proposals/sec.

With 4 groups, the situation is completely different:

  • 4 RaftCore tasks + 4 log worker tasks + 4 SM worker tasks + 4 sets of heartbeat timers + 4 sets of election timers ≈ 16+ active tasks being scheduled on the same tokio runtime
  • Each task switch takes longer — not because there isn't enough CPU, but because the tokio scheduler has to select the next waker from a larger set, making the scheduling queue deeper
  • Each group's per-proposal round-trip grows from ~3ms to ~8–10ms; the effective proposal rate for each single group noticeably drops
  • 4 sets of background heartbeats and Balancer continuously insert high-frequency small tasks, further disrupting the scheduling rhythm of foreground proposals

The final effect: the total throughput of 4 groups not only fails to reach 4x, but is actually lower than a single group. Because the increase in per-proposal latency (caused by scheduling contention) exceeds the theoretical gain from group parallelism.

text
data_raft_group_num = 1:
  1 RaftCore, ~3ms per proposal → ~340 proposals/s → 34,000 sessions/s (batch=100)

data_raft_group_num = 4:
  4 RaftCores, ~8-10ms per proposal → ~100-125 proposals/s per group
  Total ~400-500 proposals/s? Not necessarily — 4 sets of background overhead further reduce this
  Actual observation: total throughput noticeably below single group

Core conclusion: every OpenRaft proposal must go through a "core → log worker → core" async round-trip. The latency of this round-trip is primarily determined by tokio scheduling latency, not disk I/O. When multiple groups run on the same tokio runtime, scheduling contention amplifies each proposal's round-trip latency, causing total throughput to decrease rather than increase.

The fundamental issue is not RocksDB, not a bug in OpenRaft, but the deployment topology of "multiple serial state machines sharing the same async runtime" having an inherent scheduling ceiling. To break through this ceiling, the direction is to run each group on an independent runtime (independent thread pool), or to improve OpenRaft's log flush path to reduce the number of async round-trips. Under the current architecture, data_raft_group_num=1 combined with batch semantics is the optimal solution.


Summary of This Round's Improvements

This round did two things — neither complex, but with direct impact:

First, batch semantics. Changed CreateSession from single-item to batch: one Raft proposal handles N Sessions. No changes needed to the Raft layer; only the gRPC request and state machine apply logic needed to become batch-aware. The storage layer replaced per-item puts with a RocksDB WriteBatch — N fsyncs became 1. This is vertical optimization — each proposal does more useful work.

Second, concurrency governance. Reduced client concurrency from thousands to 20. Not "reducing pressure," but "eliminating useless queuing." The Raft core's processing rate is fixed; excess concurrency only piles up queuing, producing timeouts and scheduling overhead that actually reduce effective throughput. Controlling concurrency = controlling queue depth = eliminating tail latency = eliminating timeouts = increasing effective throughput. This is horizontal optimization — operating the system at its optimal working point.

The combination of both brought a single Raft group on a single machine in debug mode to 140k sessions/s. This number still has room to grow: release build, multi-machine deployment to eliminate same-machine contention, and moderately increasing the number of groups. But the current baseline is already sufficient to demonstrate that the direction is correct.


Next Steps

The combination of Batch + concurrency governance has brought the placement-create-session pipeline to a healthy baseline. Follow-up work proceeds along three lines:

First, extend batch semantics to other high-frequency write paths (Subscribe, Offset, etc.), using the same approach to reduce Raft proposal pressure.

Second, evaluate the feasibility of multi-Raft-group independent runtimes. Currently all groups share the same tokio runtime, and scheduling contention is the root cause of multi-group regression. If each group runs on an independent runtime (independent thread pool), per-proposal async round-trip latency would not be amplified by increasing group count, and the parallel gains from multiple groups could truly be realized. This change requires evaluating the trade-off between thread resource overhead and architectural complexity.

Third, continue optimizing per-proposal processing efficiency on the single-group baseline. The current debug mode already hits 140k sessions/s; release build, multi-machine deployment, and fine-tuning OpenRaft configuration (such as max_payload_entries) still have room for further improvement.

Project URL: https://github.com/robustmq/robustmq

🎉 既然都登录了 GitHub,不如顺手给我们点个 Star 吧!⭐ 你的支持是我们最大的动力 🚀