Skip to content

Tokio Scheduling Jitter and io_uring: From Iggy's Exploration to RobustMQ's Thinking

Background

While load-testing the RobustMQ MQTT Broker, we noticed a strange phenomenon: in the meta-service's raft apply path, the set_last_applied step had wildly inconsistent latency — sometimes under 1ms, sometimes spiking to 5–11ms. Yet this step does something very simple: write one record to RocksDB.

RocksDB's monitoring metrics showed actual write latency consistently under 1ms.

The gap between those two numbers points not to the storage itself, but to scheduling jitter in the tokio async runtime.

While investigating this, we discovered that Apache Iggy — an open-source message streaming system written in Rust — went through a system-level architecture migration from tokio to io_uring between 2024 and 2026, leaving behind highly detailed technical analysis and performance data. The problems they encountered closely matched what we were observing. This article summarizes Iggy's exploration, RobustMQ's current observations, and our thinking.

References:


What Is Iggy

Apache Iggy is a high-performance message streaming platform written in Rust, aiming to provide Kafka-like persistent message streaming with lower latency and higher throughput. The core path is sequential local file writes: producer writes → persisted to segment files → consumer reads. This path is extremely sensitive to I/O latency — any jitter shows up directly in end-to-end latency.


The Problems Iggy Encountered

Problem 1: The Inherent Cost of tokio's Work-Stealing Scheduler

tokio's multi-threaded executor uses a work-stealing scheduling strategy. Each worker thread maintains a local task queue; when its local queue is empty, it "steals" tasks from other threads' queues. This design works extremely well for general web services — it automatically balances load and keeps CPU utilization high.

But for high-throughput storage systems, work-stealing has two inherent costs:

CPU cache invalidation: When a task suspends at await on worker A and is then stolen by worker B, the data the task was accessing (stack frames, local variables, associated data structures) is already in worker A's L1/L2 cache. Once moved to worker B, that data is cold and must be reloaded from slower L3 cache or main memory. For a system executing storage operations at high frequency, the accumulated cache misses are significant.

Uncontrollable scheduling timing: When a task is scheduled after await — and on which thread — is determined by the runtime's internal state; the application layer has no visibility. When many tasks compete for worker threads simultaneously (e.g., 1000 concurrent connections all hitting the storage path), the wait from a task's await suspension to its next scheduled execution can be several milliseconds. That wait is added to the measured latency of whatever operation appears after the await.

Iggy's team described this as: "high-performance systems quickly exhaust the capacity of this thread pool" and "we lack control over scheduling behavior."

Problem 2: tokio Cannot Do True Async I/O on Block Devices

This is a more fundamental architectural problem.

tokio's underlying mechanism is epoll (on Linux). epoll watches file descriptors for "readiness" and notifies the application when an fd is readable or writable. For network sockets this is a perfect fit — TCP connections genuinely have readiness states, data arrival makes the fd readable, available send buffer makes it writable.

But Linux handles regular files (on block devices) completely differently: regular file fds are always considered "ready." This means epoll cannot sense the actual completion of file I/O — when you call write() to a file, epoll cannot tell you "this write has been flushed to disk."

tokio's solution is to hand all file I/O operations (tokio::fs::write, synchronous I/O inside spawn_blocking) to an internal auxiliary thread pool. This thread pool is blocking — operations complete synchronously inside it, then results are passed back to async context. The thread pool can scale up to 512 threads.

For a message broker whose core path is continuous sequential disk writes, using a 512-thread auxiliary thread pool for all disk I/O not only introduces thread-switching and context-switching overhead, but sets an explicit scaling ceiling. When concurrent writes ramp up, the thread pool itself becomes a bottleneck, producing queuing latency.

Iggy measured a peak sequential read throughput of 10–12 GB/s on top of tokio — already approaching this architecture's limits.

Problem 3: Fundamental Incompatibility Between Poll-Based and Completion-Based Models

io_uring, introduced in Linux 5.1, is a high-performance I/O interface with a fundamentally different design philosophy from epoll.

epoll is readiness-based:

  1. Register interest in an fd
  2. epoll tells you "this fd is ready to operate"
  3. You issue a syscall to complete the operation

io_uring is completion-based:

  1. You submit an operation description (what to do, which buffer) to the submission queue
  2. The kernel executes the operation asynchronously
  3. On completion, it places the result in the completion queue to notify you

This difference is most critical in buffer ownership. io_uring requires that from the time an operation is submitted to the kernel until the kernel returns a completion notification, the buffer is under kernel control — the application cannot touch it. tokio's Future model is poll-based; buffer lifetimes are managed by Rust's ownership system, which inherently conflicts with io_uring's kernel-ownership semantics.

Integrating io_uring on top of tokio requires extensive compromises and adaptations at the Rust type system level (e.g., using Arc<[u8]> for shared ownership, or manually managing lifetimes in unsafe code), and the available io_uring features are restricted. This isn't something engineering can optimize away — it's a fundamental clash of two design philosophies.


Iggy's Exploration Process

Iggy spent nearly two years going through four phases to complete this migration.

Phase 1: Hitting the Ceiling on tokio (H1 2024)

After exhausting conventional engineering optimizations (reducing lock contention, batching writes, optimizing serialization, etc.), Iggy's sequential read throughput stabilized at 10–12 GB/s. The team concluded this was near the architectural limit of tokio + epoll on current hardware — further optimization had near-zero marginal returns — and began systematically evaluating alternatives.

Phase 2: monoio Proof-of-Concept — Proving the Direction (H2 2024)

The first candidate was ByteDance's open-source monoio.

monoio's design: abandon work-stealing, adopt thread-per-core — each thread is pinned to a CPU core, runs an independent io_uring-based event loop, and tasks never migrate between threads. This completely eliminates cache invalidation and scheduling uncertainty. It also uses io_uring directly rather than epoll, so file I/O is true async completion notification — no auxiliary thread pool needed.

After porting the server's core path to monoio, Iggy's sequential read throughput jumped from 10–12 GB/s to 15+ GB/s, an improvement of more than 25%. This data validated the direction.

But monoio had clear limitations: incomplete support for the io_uring feature set (many advanced operations unavailable), and limited community maintenance outside ByteDance. Iggy's team wasn't willing to bet their core architecture on a dependency that might become unmaintained.

Phase 3: Glommio — Technically Mature, but Maintenance Had Stalled (H2 2024)

The second candidate was DataDog's open-source Glommio.

Glommio's technical background is strong: the original author, Glauber Costa, previously led Linux kernel cgroup work. Glommio's design is directly inspired by Seastar (a high-performance C++ async framework, the runtime underlying ScyllaDB), also using thread-per-core + io_uring, and includes production-grade features like task priorities and shared memory pools.

From a purely technical standpoint, Glommio was the most mature of the options.

But after Glauber Costa left DataDog to join Turso (a database company), Glommio's maintenance essentially stalled — PRs piling up, issues going unanswered. By late 2024 the project was effectively unmaintained. Iggy rejected this option.

Phase 4: Final Choice — compio (Active Maintenance + Architectural Decoupling, 2025–2026)

The final choice was compio, a relatively new (started 2023) Rust async runtime.

compio's core architecture:

  • Completion-based I/O: based on io_uring (Linux) and IOCP (Windows), not an epoll wrapper
  • Executor decoupled from I/O driver: executor and underlying I/O mechanism are separate and independently replaceable — upgrading to newer io_uring features doesn't require changing application-layer code
  • Cross-platform: io_uring on Linux, IOCP on Windows, kqueue on macOS (fallback), same interface everywhere
  • Actively maintained: continuous contributors and version iteration

The final system architecture: compio + io_uring + thread-per-core shared-nothing.

Each thread is pinned to a physical CPU core (via CPU affinity). Network connections and storage partitions are assigned to fixed threads by consistent hash. Threads share no data structures and require no locks. This design philosophy is the same as ScyllaDB and Redpanda — use shared-nothing to eliminate concurrency primitive overhead, use thread affinity to maintain cache warmth.

Post-Migration Performance Data

The following data is from the February 2026 test report. Test environment: AWS. Target throughput: 1,000 MB/s. fsync enabled to guarantee per-message durability.

ScenarioMetricTokiocompio/io_uringImprovement
8 producers × 8 streamsP999 latency2.36 ms1.81 ms23%
8 producers × 8 streamsP9999 latency34.00 ms6.51 ms81%
16 producers × 16 streamsP95 latency2.52 ms1.82 ms28%
16 producers × 16 streamsP99 latency3.01 ms2.05 ms32%
16 producers × 16 streamsP9999 latency86.30 ms7.17 ms92%
32 producers × 32 streamsP95 latency3.77 ms1.62 ms57%
32 producers × 32 streamsP99 latency4.52 ms1.82 ms60%
32 producers × 32 streamsP9999 latency27.52 ms11.83 ms57%
16 partitions + fsyncThroughput843 MB/s992 MB/s18%
16 partitions + fsyncP95 latency18.00 ms9.98 ms45%

Key pattern: throughput improvement is modest (10–25%), but tail latency improves dramatically — and the more load, the greater the improvement.

P95/P99 improvements range from 23% to 60%; P9999 improvements reach 57%–92%. This pattern is exactly what you'd expect: tokio work-stealing scheduling jitter is mild at low concurrency but causes tail latency to deteriorate sharply at high concurrency (32 producers × 32 streams). Thread-per-core latency distribution is narrower and unaffected by increasing concurrency.

The earlier monoio proof-of-concept (2024) data is more direct: sequential read throughput jumped from 10–12 GB/s to 15+ GB/s, showing io_uring's stronger advantage in pure-throughput scenarios. The reason compio's final numbers show modest throughput improvement is that the test included fsync, shifting the bottleneck to disk persistence latency rather than I/O throughput itself.


What RobustMQ Observes

RobustMQ's meta-service implements distributed consensus based on openraft, using RocksDB to persist raft logs and state machine data. During MQTT Broker connection benchmarking (mqttx bench conn -c 1000), we added fine-grained step-by-step timing in the raft apply path and observed the following.

Observation 1: Extreme Jitter in set_last_applied on the Apply Path

log
WARN meta_service::raft::store::state: [data_0] apply batch slow: total=11.07ms route=0.07ms membership=0.00ms set_last_applied=11.00ms
WARN meta_service::raft::store::state: [data_0] apply batch slow: total=6.29ms  route=0.05ms membership=0.00ms set_last_applied=6.24ms
WARN meta_service::raft::store::state: [data_0] apply batch slow: total=8.76ms  route=0.05ms membership=0.00ms set_last_applied=8.71ms
WARN meta_service::raft::store::state: [data_0] apply batch slow: total=5.29ms  route=0.86ms membership=0.00ms set_last_applied=4.43ms
WARN meta_service::raft::store::state: [data_0] apply batch slow: total=10.10ms route=0.07ms membership=0.00ms set_last_applied=10.02ms
WARN meta_service::raft::store::state: [data_0] apply batch slow: total=5.56ms  route=0.03ms membership=0.00ms set_last_applied=5.54ms
WARN meta_service::raft::store::state: [data_0] apply batch slow: total=5.08ms  route=0.05ms membership=0.00ms set_last_applied=5.03ms

The three fields:

  • route: actual business processing (decode request, update cache, write RocksDB storage data)
  • membership: raft membership change processing
  • set_last_applied: write the current log index to RocksDB, recording how far the state machine has applied

From the numbers: route takes 0.03–0.86ms; set_last_applied takes 4–11ms. The former is the actual business logic; the latter is just one put_cf write.

Observation 2: Consistently Slow session_process on the Connect Path

Slow-request logs from MQTT Broker at the same time:

log
WARN mqtt_broker::mqtt::connect: [connect] slow connect_id=5922  total=30.61ms get_cluster=0.00ms check_limit=0.04ms build_conn=0.01ms session_process=30.44ms add_cache=0.04ms st_report=0.02ms
WARN mqtt_broker::mqtt::connect: [connect] slow connect_id=18258 total=31.47ms get_cluster=0.00ms check_limit=0.04ms build_conn=0.01ms session_process=31.32ms add_cache=0.02ms st_report=0.02ms
WARN mqtt_broker::mqtt::connect: [connect] slow connect_id=19237 total=51.11ms get_cluster=0.00ms check_limit=0.05ms build_conn=0.02ms session_process=50.91ms add_cache=0.03ms st_report=0.03ms
WARN mqtt_broker::mqtt::connect: [connect] slow connect_id=19238 total=51.22ms get_cluster=0.00ms check_limit=0.04ms build_conn=0.01ms session_process=51.08ms add_cache=0.02ms st_report=0.03ms

session_process is consistently 30–51ms; all other steps combined are under 1ms. session_process sends a gRPC request to meta-service and waits for raft consensus to complete and return.

Root Cause Analysis

Put the two logs together and the chain is clear:

text
MQTT connect → session_process (gRPC) → meta-service → raft apply → set_last_applied (RocksDB)

session_process's 30–51ms = gRPC network round trip + raft consensus latency + apply() execution time.

The 5–11ms of set_last_applied inside apply() is the time spent waiting for the task to be scheduled onto a tokio worker thread — not RocksDB being slow, but the task waiting in the queue to be executed.

Why set_last_applied and not route?

apply() is an async fn with multiple await points — it awaits the route function for each log entry. After each await, the task is suspended and re-enters the scheduling queue. With 1000 concurrent CONNECTs saturating tokio worker threads, the wait from a task's await suspension to being rescheduled can be several milliseconds.

route is the first await in each log entry's processing — the task was just scheduled and can usually execute quickly. set_last_applied is the very last step after all entries have been processed. By then the task has been through multiple await/reschedule cycles, and the accumulated scheduling wait shows up most prominently here.

This matches Iggy's description of work-stealing scheduling uncertainty exactly: what's being measured isn't the operation's own latency — it's the wait time in the tokio scheduling queue.


Our Thinking

Iggy's exploration gave us a clear problem framework: in high-concurrency, storage-intensive paths, tokio work-stealing's scheduling jitter is a structural problem — it cannot be solved by tuning parameters inside tokio.

RobustMQ's current architecture: the MQTT Broker is network-intensive, handling many concurrent client connections; meta-service is storage-intensive, with every state change going through raft consensus to RocksDB. Both run in the same tokio runtime, and resource contention amplifies each other at high concurrency.

The jitter we currently observe (5–11ms scheduling wait) occurs at a relatively low load of 1000 concurrent connections. As connection count grows, worker thread competition will intensify, tail latency distribution will widen, and worst-case will get worse. Iggy's data already proved this trend — P9999 deteriorates most severely under high load.

Whether thread-per-core + io_uring is the right direction for RobustMQ — we don't have a conclusion yet. This is a system-level architecture migration involving runtime replacement, network layer rewrite, and storage layer adaptation; Iggy spent two years. We're currently in the observation and problem-understanding phase, and need more benchmark data and deeper analysis before deciding whether and how to take this path.

🎉 既然都登录了 GitHub,不如顺手给我们点个 Star 吧!⭐ 你的支持是我们最大的动力 🚀