RobustMQ Raft State Machine Performance Breakthrough: Batch Semantics + Concurrency Governance, from 20k to 140k ops/s
The conclusion from the previous article, Raft State Machine Continued Optimization: Re-Validating Groups, Heartbeat Clocks, and Runtime Threads, was: in the current single-machine environment, data_raft_group_num=4 is a negative optimization; =1 combined with conservative clock parameters can stabilize around 20k ops/s. The open question at the end was: why do more Raft nodes not improve performance — they actually regress? What's next?
This article documents the analysis process that followed, and the eventual breakthrough. The conclusion first: the bottleneck is not in the number of Raft groups, but in request granularity and concurrency depth. By changing CreateSession to batch semantics and reducing client concurrency from thousands to 20, single Raft group throughput jumped directly from 20k ops/s to 140k ops/s, with 100% success rate and P95 latency < 24ms.
Back to the Core Question: Where Is the Bottleneck?
The previous round's conclusion was "raft_write_latency reached the second level, while log append and apply were only at the millisecond level," and we attributed the cause to "entry queuing and system-wide waiting." This judgment was directionally correct, but not precise enough — it answered "time is spent waiting," but not "why is the wait so long."
This round we focused our attention on OpenRaft's internal implementation. After reading openraft-0.9.21/src/core/raft_core.rs, we discovered a key fact:
RaftCore::append_to_log does not simply queue the log and return — internally it rx.awaits for the log flush callback. This means every single proposal blocks the Raft core event loop until disk writes complete.
This means the Raft core's proposal processing capacity has a hard ceiling: approximately 33–50 proposals per second (depending on the latency of a single log flush). When external concurrency far exceeds this rate, requests all pile up in the rx_api channel queue. The deeper the queue, the longer tail requests wait, ultimately manifesting as second-level end-to-end latency and widespread client timeouts.
This discovery explains all the unexplained phenomena from the previous article:
- Why were log append and apply fast, but raft_write_latency was high? Because latency is not execution time — it's queuing time plus execution time. Execution itself only takes milliseconds, but a request at the end of the queue might wait one or two seconds before its turn.
- Why did multiple groups make things worse? 4 groups means 4 sets of Raft background tasks competing for scheduling time slices in the same runtime; tokio's scheduling latency gets stretched, and each group's effective proposal processing rate is actually lower.
- Why didn't CPU saturate, yet performance couldn't improve? Because the bottleneck isn't CPU compute — it's serial waiting in the Raft core event loop. The core thread is idle on
rx.await, and no number of extra cores helps.
The conclusion is clear: the real lever is not "add more Raft instances," but "reduce the number of proposals each instance needs to handle" and "control queue depth."
Optimization 1: Full-Pipeline Batch Semantics for CreateSession
Since a single Raft group can only process about 40 proposals per second, letting each proposal carry more business data is the most direct amplifier.
Proto Layer: From Single to Batch
The original CreateSessionRequest had single-item semantics — one request creates one Session. After the refactor, we introduced CreateSessionRaw, allowing one request to carry a batch of Sessions:
message CreateSessionRaw {
string client_id = 1;
bytes session = 2;
}
message CreateSessionRequest {
repeated CreateSessionRaw sessions = 1;
}This way, one gRPC call, one Raft proposal, and one state machine apply can handle N Sessions. With batch_size=100, 40 proposals/s equals 4,000 Sessions/s — and that's just the theoretical lower bound for a single group.
State Machine Apply Layer: RocksDB WriteBatch
Previously, the apply phase performed a separate rocksdb.put_cf() for each Session. N Sessions meant N WAL fsyncs, each going through the complete "serialize → get CF handle → write disk" path.
After the refactor, we added an engine_batch_save interface at the rocksdb-engine layer, using RocksDB's native WriteBatch:
pub fn engine_batch_save<T: Serialize>(
engine: &Arc<RocksDBEngine>,
column_family: &str,
source: &str,
entries: &[(String, &T)],
) -> Result<(), CommonError> {
let cf = get_cf_handle(engine, column_family)?;
let mut batch = rocksdb::WriteBatch::default();
for (key, value) in entries {
let wrap = StorageDataWrap::new(value);
let serialized = serialize(&wrap)?;
batch.put_cf(&cf, key.as_bytes(), &serialized);
}
engine.write_batch(batch)
}All put operations within a single WriteBatch trigger only one WAL fsync, with atomicity guaranteed by RocksDB. 100 Sessions go from 100 fsyncs to 1 — reducing the I/O overhead in the apply phase to 1/100 of the original.
Effect Pipeline
The write pipeline after the refactor:
cli-bench (100 sessions) → 1 gRPC call → 1 Raft proposal → 1 log append
→ 1 apply → 1 RocksDB WriteBatch (100 puts, 1 fsync) → returnCompared to before the refactor (100 independent requests → 100 proposals → 100 fsyncs), the Raft core's burden is reduced by two orders of magnitude.
Optimization 2: Concurrency Governance — Less Is More
Batch semantics solved the problem of "each proposal doing too little work," but there's another equally important dimension: queue depth.
Why Excessive Concurrency Is Harmful
Previous rounds of benchmarking consistently used --concurrency 2000 or even 5000. Intuitively, high concurrency means high throughput. But for Raft write pipelines, this intuition is wrong.
Raft core is a strictly serial processor: one proposal in → log flush → wait for callback → process the next one. No matter how many concurrent requests are waiting outside, its processing pace doesn't speed up because "there are more people in line." On the contrary, the deeper the queue, the larger the side effects:
- Scheduling contention intensifies. 2,000 tokio tasks are simultaneously in an await state. Each time Raft core completes a proposal, the runtime must select the next one to wake from a large number of wakers. This scheduling overhead grows with concurrency.
- Tail latency worsens. Latency for a request at the tail of the queue ≈ queue depth × single proposal processing time. Concurrency 2000 + single proposal 25ms = the last request might wait 50 seconds.
- Timeouts trigger chain reactions. The client has a 3-second timeout, but the request spent 3 seconds waiting in queue and still hasn't been processed — so the client cancels the request and closes the connection. When the Raft server eventually gets to processing it, it finds the oneshot receiver has already been dropped, triggering
OneshotConsumer.tx.send: is_ok: falsewarnings. These "ghost proposals" consume Raft processing capacity without producing any useful results.
What Happened When We Dropped to 20 Concurrency
After reducing --concurrency from thousands to 20:
- Queue depth dropped from thousands to no more than 20. Each proposal that comes in is processed almost immediately without queuing.
- Scheduling overhead dropped dramatically. The tokio runtime only needs to manage 20 active tasks; context switches are greatly reduced.
- Zero timeouts. All requests complete in milliseconds, with no "ghost proposals" wasting Raft processing capacity.
- Raft core utilization is actually higher. Because there's no scheduling jitter and timeout interference, the core can advance proposals at a more stable pace.
This is the classic queueing theory phenomenon (Little's Law): at a fixed service rate, reducing queue depth does not reduce throughput, but dramatically reduces latency. And reduced latency eliminates timeouts and retries, further improving effective throughput.
Benchmark Results
Test Environment
- Apple Silicon, 14 logical CPUs (10P + 4E)
- Local service + local benchmark (single-machine all-in-one)
data_raft_group_num = 1, Raft clock100/1000/2000, runtime threads 16- Debug build (non-release)
Benchmark Command
cargo run --package cmd --bin cli-bench meta placement-create-session \
--host 127.0.0.1 \
--port 1228 \
--count 10000000 \
--concurrency 20 \
--batch-size 100 \
--timeout-ms 60000 \
--output tableKey Metrics
| Metric | Value |
|---|---|
| Total Sessions | 10,000,000 |
| Total Duration | 69 seconds |
| Average Throughput | 144,927 sessions/s |
| Peak Throughput | 179,000 sessions/s |
| Success Rate | 100% |
| Timeouts | 0 |
| Latency P50 | 11.7ms |
| Latency P95 | 23.9ms |
| Latency P99 | 57.3ms |
| Latency min | 1.6ms |
Comparison with Previous Round
| Dimension | Previous Round (Single + High Concurrency) | This Round (Batch + Low Concurrency) |
|---|---|---|
| batch_size | 1 | 100 |
| concurrency | 2000~5000 | 20 |
| Throughput (ops/s) | ~20,000 | ~145,000 |
| P95 Latency | hundreds of ms to second level | 24ms |
| Timeout Rate | continuously accumulating | 0 |
| Success Rate | < 100% | 100% |
Throughput improved approximately 7x, latency reduced by approximately two orders of magnitude, and timeouts went from "continuously accumulating" to zero.
Grafana Monitoring Validation
From the Raft metrics panels, we can see that the state machine after optimization is running very healthily:
- Raft Write QPS: data_0 stable at ~340 req/s (i.e., ~340 proposals/s × 100 sessions = ~34,000 sessions/s Raft-layer throughput)
- Raft Write Success Rate: perfectly matching the request volume, no failures
- Raft Write Failure Rate: 0 failures across the entire pipeline
- Log Append Latency P50: ~0.5ms
- Apply Batch Latency P50: ~0.5ms
- Raft Write Latency: millisecond-level, no more second-level spikes
This data shows that all components within Raft are operating in comfortable ranges: log flush is fast, apply is fast, no queuing backlog. The latency of the entire pipeline is now the true overhead of "network + serialization + log flush + apply" — with no additional waiting amplification.
Looking Back at the Previous Article's Unresolved Issue: Why Does Multi-Group Always Regress?
The core confusion in the previous article was "why do multiple Raft nodes regress?" At the time, we attributed it to "single-machine shared resources where coordination costs exceed parallel gains." This attribution direction was correct, but not precise enough.
In this round, with batch semantics already in place, we did another controlled experiment: changing data_raft_group_num from 1 to 4 with all other parameters unchanged (concurrency=20, batch_size=100). The result was still dramatically degraded performance. This rules out the possibility that "the previous round's multi-group regression was caused by single-item semantics + high concurrency" — even with Batch + low concurrency, multiple groups still perform worse than a single group.
After tracing through the code, we ultimately pinpointed the regression cause to OpenRaft's per-proposal async scheduling model, not RocksDB itself.
First, let's rule out an easy-to-misjudge direction: all Raft groups share the same Arc<DB>, but Grafana shows that Log Append Latency is only ~0.5ms, meaning a single db.write() in RocksDB is very fast — theoretically handling ~2,000 writes per second. Even if 4 groups write simultaneously (4 × 340 ≈ 1,360 times/second), RocksDB can handle it comfortably. RocksDB is not the bottleneck.
The true bottleneck is: each proposal in OpenRaft is not just a db.write() — it's a tokio async round-trip:
RaftCore (task A) → sends entries to log store → yields, waits for callback (rx.await)
↓ tokio scheduler switches
LogStore worker (task B) → db.write(batch), 0.5ms → callback.log_io_completed()
↓ tokio scheduler switches
RaftCore (task A) → woken up → continues run_engine_commands → processes next proposalWith a single group, this round-trip involves 2 tokio task switches, with a measured total latency of ~3ms (0.5ms disk + the rest is all scheduling overhead). So 1 group can stably run at ~340 proposals/sec.
With 4 groups, the situation is completely different:
- 4 RaftCore tasks + 4 log worker tasks + 4 SM worker tasks + 4 sets of heartbeat timers + 4 sets of election timers ≈ 16+ active tasks being scheduled on the same tokio runtime
- Each task switch takes longer — not because there isn't enough CPU, but because the tokio scheduler has to select the next waker from a larger set, making the scheduling queue deeper
- Each group's per-proposal round-trip grows from ~3ms to ~8–10ms; the effective proposal rate for each single group noticeably drops
- 4 sets of background heartbeats and Balancer continuously insert high-frequency small tasks, further disrupting the scheduling rhythm of foreground proposals
The final effect: the total throughput of 4 groups not only fails to reach 4x, but is actually lower than a single group. Because the increase in per-proposal latency (caused by scheduling contention) exceeds the theoretical gain from group parallelism.
data_raft_group_num = 1:
1 RaftCore, ~3ms per proposal → ~340 proposals/s → 34,000 sessions/s (batch=100)
data_raft_group_num = 4:
4 RaftCores, ~8-10ms per proposal → ~100-125 proposals/s per group
Total ~400-500 proposals/s? Not necessarily — 4 sets of background overhead further reduce this
Actual observation: total throughput noticeably below single groupCore conclusion: every OpenRaft proposal must go through a "core → log worker → core" async round-trip. The latency of this round-trip is primarily determined by tokio scheduling latency, not disk I/O. When multiple groups run on the same tokio runtime, scheduling contention amplifies each proposal's round-trip latency, causing total throughput to decrease rather than increase.
The fundamental issue is not RocksDB, not a bug in OpenRaft, but the deployment topology of "multiple serial state machines sharing the same async runtime" having an inherent scheduling ceiling. To break through this ceiling, the direction is to run each group on an independent runtime (independent thread pool), or to improve OpenRaft's log flush path to reduce the number of async round-trips. Under the current architecture, data_raft_group_num=1 combined with batch semantics is the optimal solution.
Summary of This Round's Improvements
This round did two things — neither complex, but with direct impact:
First, batch semantics. Changed CreateSession from single-item to batch: one Raft proposal handles N Sessions. No changes needed to the Raft layer; only the gRPC request and state machine apply logic needed to become batch-aware. The storage layer replaced per-item puts with a RocksDB WriteBatch — N fsyncs became 1. This is vertical optimization — each proposal does more useful work.
Second, concurrency governance. Reduced client concurrency from thousands to 20. Not "reducing pressure," but "eliminating useless queuing." The Raft core's processing rate is fixed; excess concurrency only piles up queuing, producing timeouts and scheduling overhead that actually reduce effective throughput. Controlling concurrency = controlling queue depth = eliminating tail latency = eliminating timeouts = increasing effective throughput. This is horizontal optimization — operating the system at its optimal working point.
The combination of both brought a single Raft group on a single machine in debug mode to 140k sessions/s. This number still has room to grow: release build, multi-machine deployment to eliminate same-machine contention, and moderately increasing the number of groups. But the current baseline is already sufficient to demonstrate that the direction is correct.
Next Steps
The combination of Batch + concurrency governance has brought the placement-create-session pipeline to a healthy baseline. Follow-up work proceeds along three lines:
First, extend batch semantics to other high-frequency write paths (Subscribe, Offset, etc.), using the same approach to reduce Raft proposal pressure.
Second, evaluate the feasibility of multi-Raft-group independent runtimes. Currently all groups share the same tokio runtime, and scheduling contention is the root cause of multi-group regression. If each group runs on an independent runtime (independent thread pool), per-proposal async round-trip latency would not be amplified by increasing group count, and the parallel gains from multiple groups could truly be realized. This change requires evaluating the trade-off between thread resource overhead and architectural complexity.
Third, continue optimizing per-proposal processing efficiency on the single-group baseline. The current debug mode already hits 140k sessions/s; release build, multi-machine deployment, and fine-tuning OpenRaft configuration (such as max_payload_entries) still have room for further improvement.
Project URL: https://github.com/robustmq/robustmq
