RobustMQ Raft State Machine Continued Optimization: Re-Validating Groups, Heartbeat Clocks, and Runtime Threads

In the previous article, Raft State Machine Performance Investigation: The Problem Isn't RocksDB — It's Queuing, we confirmed a key finding: the main latency in client_write is not in RocksDB disk writes, but in queuing at the Raft entry point and waiting inside Raft.

This article continues to document the subsequent optimization process. The goal is not to produce a single impressive peak number, but to clearly explain the relationship between parameters and observed behavior, and ultimately converge on a stable, reproducible, and explainable configuration.

Three Questions This Article Aims to Answer

After the first round of conclusions, we narrowed the problem down to three directions:

Group sharding: if we increase data_raft_group_num from 1 to 4, raising the theoretical level of parallelism — will throughput actually improve?
Raft clock: to what extent do heartbeat_interval and election_timeout_min/max affect throughput and stability?
Tokio runtime threads: what is a more reasonable balance for server/meta/broker worker threads?

These three directions could theoretically each bring performance gains, but whether that holds in practice can only be determined by experimental data.

Experiment Setup and Environment Notes

To avoid false conclusions from "parameters changing while the environment also changes," we kept the benchmark methodology as consistent as possible throughout this round. The test machine is Apple Silicon (14 logical CPUs, 10P + 4E). The core test path is placement-create-session, using the project's built-in cli-bench tool, with primary concurrency levels of 2,000 and 5,000. Key metrics are ops/s, p95/p99, timeouts, and Raft-related metrics (write/log append/apply).

It's important to note that most of these experiments were in a "local benchmark + local service" setup, and results are susceptible to scheduling and thermal throttling effects. The conclusions presented here are reproducible results "under the current experimental conditions" and do not directly translate to absolute limits in a multi-machine production environment.

First, Let's Clarify: What Is a Raft Group, and What Did We Have Before?

Before getting into the experiments, let's align on terminology. Before the refactor, paths like data and offset were essentially a single-group model: one logical type corresponded to one Raft state machine, all requests entered through the same client_write entry point, and a single RaftCore serially advanced commits and applies. This model's advantages are: short pipeline, centralized state, and controllable overhead. Its disadvantage is that single-core serial capacity becomes the upper limit.

After the refactor, we introduced RaftGroup: a logical type no longer corresponds to just one node, but to a shard group (e.g., data_0..data_N). A request is first hashed by business key (e.g., client_id) to a particular shard, and then executes that shard's client_write. From an architectural standpoint, this step is meant to transform "a single RaftCore serial bottleneck" into "multiple RaftCores processing in parallel," theoretically improving total throughput.

A side-by-side comparison makes this clearer:

text

Before refactor (single group):
request -> data(client_write) -> single RaftCore -> log append -> apply -> return

After refactor (RaftGroup):
request -> hash(key) -> data_i(client_write) -> RaftCore_i -> log append -> apply -> return

There's a point very easy to overlook here: the refactor adds not just parallelism, but also the background maintenance costs for routing, shard management, and each shard's operation (heartbeats, election timers, internal tasks, monitoring dimensions, etc.). When the deployment setup shares resources on a single machine, these "fixed costs" may manifest before the "parallel gains" do. In other words, the prerequisite for RaftGroup's benefits is not "more shards always means faster" — it's "the parallel gains from sharding must outweigh the system costs that sharding introduces."

Experiment 1: Runtime Threads from Auto to Explicit 16

We started with runtime threads. The test machine has 14 logical CPUs (10P + 4E). Under the auto strategy, throughput was usually around 11k ops/s. After explicitly setting threads to the following configuration, throughput noticeably improved to 20k+:

toml

[runtime]
server_worker_threads = 16
meta_worker_threads = 16
broker_worker_threads = 16

Continuing to increase threads to 32, the result was essentially the same as 16 — no linear improvement appeared. This shows that this pipeline is not "more threads always better" — there is an effective threshold, beyond which you enter a new bottleneck. For this machine, 14 to 16 is the critical step; from 16 to 32, the marginal gains are already minimal. In other words, thread tuning here is more like "replenishing the scheduling budget" rather than "continuously scaling up by stacking threads."

Experiment 2: Group Sharding — `data_raft_group_num=4` Shows Severe Regression

Intuitively, if a single Raft is serial and multiple groups run in parallel, throughput should increase. But actual measurements showed the exact opposite: increasing data_raft_group_num from 1 to 4 caused a dramatic cliff-edge drop in performance — QPS fell from 20k+ to a few thousand or less, raft_write_latency went to the second level, and timeouts began accumulating continuously. A representative observation: at the start of the benchmark it could still maintain a few thousand ops/s, but quickly entered a 1k–3k plateau with growing timeouts, and p95/p99 would stretch to the second level.

This result shows that in the current experimental setup, multiple groups bring not just parallelism but also significant management overhead. Especially in a "local service + local benchmark" single-machine shared-resource scenario, 4 groups sharing the same CPU, runtime, scheduler, and storage path causes extra overhead to be amplified, ultimately exceeding the parallel gains. In this pipeline, data_raft_group_num=4 is a negative optimization; =1 is more stable.

Experiment 3: Raft Clock Parameters — From Aggressive to Conservative

One of the most critical variables in this round was the Raft clock. We first validated aggressive parameters:

rust

heartbeat_interval: 10,
election_timeout_min: 15~20,
election_timeout_max: 50,

With this parameter set, throughput easily dropped to around 11k, and the regression was more pronounced in multi-group scenarios. We then switched to conservative parameters:

rust

heartbeat_interval: 100,
election_timeout_min: 1000,
election_timeout_max: 2000,

Under the following configuration:

toml

server_worker_threads = 16
meta_worker_threads = 16
broker_worker_threads = 16
data_raft_group_num = 1

placement-create-session recovered to 20k+ ops/s, with more stable error rates and latency. This result is mutually corroborating with the earlier thread experiment: threads just replenished the scheduling budget, while the clock parameters determine the background cost of Raft and the stability boundary.

This also explains a common misconception: heartbeat_interval and election_timeout are not "only active during failures." Under high-concurrency writes, they continuously affect background task frequency and coordination overhead, which ultimately feeds back into foreground throughput.

Why Do Clock Parameters Affect Throughput?

Many people understand these three parameters as "only affecting failover speed." In practice, on high-concurrency write paths, they directly affect throughput as well. The reason is that these parameters change the frequency and stability of Raft background tasks, and those background tasks share the same scheduling resources with foreground client_write calls.

The smaller heartbeat_interval is, the more frequently the Leader heartbeat and related state advancement run. With a single group, this overhead may still be acceptable, but as the number of groups increases, background ticks stack per group, causing the scheduler to spend more time on high-frequency maintenance tasks. Foreground requests don't necessarily see errors, but they're more likely to queue at the entry point, ultimately manifesting as declining throughput and rising tail latency. This process doesn't always show up as maxed-out CPU, so it's easy to misjudge as "the system isn't under pressure."

election_timeout_min/max are also not purely failure parameters. When the timeout window is too short, network jitter, scheduling jitter, and brief pauses are more likely to be interpreted as "heartbeat anomalies," potentially triggering election-related processes. Even without a genuinely frequent leader change, this kind of jitter introduces additional coordination costs that interfere with the stable progression of the write path. In terms of observable symptoms, this means a lower throughput plateau and greater variance, especially pronounced in high-concurrency and multi-group scenarios.

So clock parameters are not a binary choice between "availability priority" or "performance priority" — they are two ends of the same dial. For the current high-concurrency write pipeline, conservative parameters (100/1000/2000) are more appropriate for throughput stability.

Stability Note: Local Benchmark Environments Amplify Variance

Even with the same parameters, repeated benchmark runs still show variance — sometimes 20k+, sometimes falling back to around 15k. This phenomenon is strongly correlated with the local benchmark environment. The key is not "whether there are errors" but "whether noise is continuously interfering with the sampling window."

First, there is same-machine contention. A local benchmark running against a local service means the benchmark client, gRPC stack, and service-side runtime are all competing for CPU time on the same machine. When benchmark concurrency is high, the client itself consumes significant compute and scheduling resources, and this overhead is directly reflected in what looks like throughput variance from the service side.

Second, macOS scheduling and thermal management. Apple Silicon has performance cores and efficiency cores, and the system dynamically allocates tasks based on load and temperature. It's common to see a high plateau at the start of a benchmark run, followed by a drop-off after running for a while, then a recovery on the next run. This "high then low then high again" rhythm is very consistent with dynamic frequency scaling and thermal state transitions.

Third, the execution method itself. If you always use cargo run, compilation and execution are chained together in the same process. The prewarming caused by the compilation phase, changes in system cache state, and instantaneous load spikes all contaminate subsequent samples. Even though you're using the same parameters, the machine state at the start of each run is not actually the same.

Therefore, this kind of experiment should minimize environmental noise: standardize on --release, build first and then run the binary directly, warmup before sampling a fixed window each round, and take the median across multiple consecutive runs. Results obtained this way are closer to the system's true behavior rather than the machine's instantaneous state.

The Core Counter-Intuitive Question (Before the Conclusion): Why Did More Raft Nodes Not Increase Throughput — Why Did They Drop It Drastically?

Let's return to the most fundamental and counter-intuitive question raised here: a single Raft node having a bottleneck is understandable, but when CPU, memory, and Tokio runtime all appear to not be obviously saturated — why did multiple Raft nodes (group sharding) not improve performance, but instead cause a severe degradation?

The key to this question is: "resources not saturated" only means we haven't hit a hardware ceiling, not that there are no soft bottlenecks in the write pipeline. The current data looks more like waiting within the pipeline being amplified, rather than insufficient hardware compute. We observed raft_write_latency at the second level, while log append and apply only showed millisecond-level increases, indicating that the majority of time is not spent in execution itself, but in waiting before and between executions.

Breaking this phenomenon down makes it easier to understand why it seems "counter-intuitive":

First, when groups increase from 1 to 4, the system adds not only parallel workers but also 4 sets of Raft background activity. Each shard has its own heartbeats, election timers, state advancement, and internal tasks — individually lightweight, but they continuously consume scheduling slices and share the same runtime resource pool with foreground requests.

Second, on a single machine no new physical resources are introduced. We're not distributing 4 groups to 4 machines or 4 independent disks; we're splitting into 4 logical paths on the same machine. The result is "logical parallelism increases while physical resources don't scale in tandem," so parallel gains and coordination costs don't change in the same direction.

Third, the entry gate of the current pipeline still exists. Even if each shard has an independent RaftCore internally, as long as requests queue up at any point — before or after routing, during commit advancement, or at internal notification — overall latency rapidly amplifies and throughput first plateaus, then noticeably declines.

This is also why we see what seems like a contradictory combination: no obvious CPU saturation, no RocksDB single-point explosion, yet end-to-end throughput degrades dramatically. It's not that "some component broke," but that system-wide waiting costs are amplified across multiple concurrent sets.

Based on current evidence, our interim judgment is: this is not evidence that the Multi-Raft direction is wrong, but rather that in the current single-machine experimental setup, the extra costs of data_raft_group_num=4 exceed the parallel gains. To validate the true value of multiple groups, we need to continue the controlled experiment under conditions closer to production topology — for example, multi-machine, multi-replica, and more complete segment-level pipeline observation.

In other words, this round has answered "what the phenomenon is" and "where the phenomenon most likely comes from," but has not yet answered "whether this would still be the case in an online topology." This boundary must be clear in any technical discussion: the conclusion is stage-dependent and scenario-bound, not a final verdict on Multi-Raft.

Second Round Conclusions and Recommended Baseline

This round of tuning essentially locked in three most critical variables. The current recommended baseline is:

data_raft_group_num = 1. In this single-machine scenario, setting it to 4 causes obvious regression.
Raft clock using 100/1000/2000. This parameter set clearly outperforms aggressive parameters.
Runtime threads using 16. On this machine, 16 is the effective threshold; 32 provides almost no additional benefit.

These parameters are not necessarily theoretically optimal, but they are already a stable baseline that is reproducible, explainable, and can continue to evolve.

Next Steps

The next step is not to do broad optimizations, but to focus on solving one core problem: why does throughput not increase but instead decrease when data_raft_group_num goes from 1 to 4. Around this problem, we will focus subsequent work on a single main thread: first, break apart the complete client_write path in the multi-group scenario to identify whether the second-level latency is primarily concentrated in entry queuing, commit waiting, or shard scheduling contention; then, under the same benchmark model, do a one-to-one comparison of group=1 vs. group=4, putting per-segment latency and each shard's load distribution on the same graph to confirm exactly where the regression comes from; finally, only apply targeted optimizations to confirmed bottlenecks, rather than continuing to blindly tune threads or change other parameters.

The goal of this round is not to find another "looks faster" configuration, but to thoroughly explain "why multiple Raft nodes cause regression" and form conclusions that can be stably reproduced.

Project URL: https://github.com/robustmq/robustmq

RobustMQ Raft State Machine Continued Optimization: Re-Validating Groups, Heartbeat Clocks, and Runtime Threads ​

Three Questions This Article Aims to Answer ​

Experiment Setup and Environment Notes ​

First, Let's Clarify: What Is a Raft Group, and What Did We Have Before? ​

Experiment 1: Runtime Threads from Auto to Explicit 16 ​

Experiment 2: Group Sharding — data_raft_group_num=4 Shows Severe Regression ​

Experiment 3: Raft Clock Parameters — From Aggressive to Conservative ​

Why Do Clock Parameters Affect Throughput? ​

Stability Note: Local Benchmark Environments Amplify Variance ​

The Core Counter-Intuitive Question (Before the Conclusion): Why Did More Raft Nodes Not Increase Throughput — Why Did They Drop It Drastically? ​

Second Round Conclusions and Recommended Baseline ​

Next Steps ​