Skip to content

Where Does a Messaging System's Boundary Lie?

A few industry trends have caught my attention recently, and together they're worth thinking through carefully.

First, more and more systems are starting to use Lakehouse formats as the underlying storage for messages — writing data into formats like Iceberg and Delta Lake, where message consumption and SQL queries share the same data, hoping to break down the boundary between messaging systems and data analytics. Second, S3 object storage is becoming the primary storage choice for many systems rather than just cold archival storage — low cost, inherently highly available, with no need to pay extra for cross-availability-zone replication. Both Kafka and Pulsar are moving in this direction. Third, messaging systems themselves are beginning to extend toward stream semantics, where messages are no longer "fire and forget" but can be persisted, replayed, and consumed by offset.

The latest version of EMQX (6.x) is a concrete embodiment of this trend. It introduces message stream semantics on top of MQTT — consumers can control replay starting points via offset, it supports ordered-by-key delivery similar to Kafka's consumption model, yet is completely transparent to existing devices. It also natively supports message queues, decoupling producers from consumers, with support for offline message storage and flexible dispatch strategies — a capability set that rivals the queue semantics of RabbitMQ-style systems.

Taken together, these changes indicate that the boundaries of messaging systems are blurring — or rather, every system is answering the same question in its own way: how much should a messaging system be responsible for?

This prompted me to reconsider RobustMQ's own answer.

Where Is the Industry Heading?

Why would an MQTT broker support stream semantics? Behind this question lies a more fundamental one: where should a messaging system's boundary be?

IoT data is inherently a stream. Devices continuously report sensor readings, state changes, and triggered events — it is essentially an endless time series. Traditional message brokers solve the problem of "how to deliver data," but as business complexity grows, users increasingly care about "what to do with the data after it's delivered." So messaging systems have started extending toward stream processing, storage, and analytics — a natural evolutionary result for the entire industry.

This evolution has its own logic. When a user builds an IoT platform, they're not facing an isolated broker but a complete data pipeline: device ingestion, real-time consumption, persistent storage, historical querying — each stage needs tooling. If every stage requires separate selection, deployment, and operations, it's a heavy burden for small and mid-sized teams. So messaging systems extending toward both ends of the pipeline is a natural commercial driver — reducing the number of systems users need to maintain and lowering the overall barrier to entry.

The problem is that different systems extend their boundaries in different ways, and this ultimately determines the system's long-term shape. One approach is to continuously stack capabilities on top of existing protocols and features — as streaming demand comes in, add streaming; as queue demand comes in, add queues. The system's features grow increasingly comprehensive, but internally storage may be fragmented, and protocols need bridges to synchronize data. Another approach is to first think clearly about the underlying data model, then build the protocol layer on top of it — protocols are ways of accessing data, not ways of storing it.

RobustMQ chose the latter. Rather than stacking capabilities at the protocol layer, we rethink the problem from the storage layer: if there is only one copy of data at the bottom, and multiple protocols are simply different read/write views of that data, then the extensibility of the messaging system is completely different.

RobustMQ's Approach

RobustMQ's answer is: use a unified storage layer to hold a single copy of data, and provide different read/write views through multiple protocols.

The storage layer solves the question of "where data lives" — memory, RocksDB, File Segment, or in the future S3, Lakehouse; different scenarios use different engines, configurable down to the topic level. The protocol layer solves the question of "how data is used" — MQTT writes, Kafka consumes, AMQP ingests; different protocols access the same underlying data without any need to move or synchronize it between protocols.

These two things together constitute the basic shape of RobustMQ: a communication conduit that can be used by different protocols simultaneously, with storage that can be flexibly chosen based on the scenario. What it does is always what a messaging system should do — the boundary does not extend toward data processing, analytics, or visualization. Restraint here is not a slogan; it is a deliberate architectural choice.

Unified Storage: One Copy of Data, Multiple Forms

RobustMQ currently supports three underlying storage engines: in-memory storage, RocksDB, and File Segment. These three storage options correspond to different scenarios — in-memory storage pursues ultra-low latency and is suited to lightweight messaging scenarios like NATS; RocksDB provides reliable persistence and is suited to RabbitMQ-style queue semantics; File Segment is oriented toward high-throughput sequential writes and is the foundation for Kafka-style log-streaming scenarios.

Why support three storage engines instead of just one? Because no single storage engine is universal. Memory is fast but expensive and lost on power failure; RocksDB has good persistence, but sequential read throughput is lower than File Segment; File Segment excels at sequential appends of large volumes of data but doesn't shine at random access. Different protocols and scenarios have different storage requirements. Forcing a single storage to cover all scenarios means either compromising performance or incurring excessive cost.

The key, however, is not how many storage types are supported, but that these three storage options are transparent to the upper protocol layer. Users choose the appropriate storage strategy for their scenario, and the protocol layer doesn't need to care which engine is used underneath. Once this abstraction is properly established, adding a new storage form is simply adding an implementation — no changes to the protocol layer are needed.

This is the most critical judgment in storage layer design: storage engines are replaceable, but the upper protocols' dependency on storage should be unified. Rather than maintaining a separate storage stack for each protocol, all protocols share the same storage abstraction.

Multi-Protocol: Not Feature Stacking, but Different Views of the Same Data

Once the storage layer design is understood, the value of multi-protocol support becomes very direct: because the underlying storage is unified, messages written via MQTT can be directly consumed by the Kafka protocol, and vice versa — without any data movement.

This sounds simple, but in practice many systems cannot achieve it. Traditional multi-protocol support is essentially "maintaining a separate storage stack for each protocol, then building data pipelines between protocols." Pipelines mean latency, data replication, intermediate states where data has just been written on the MQTT side but hasn't yet synced to the Kafka side, and additional operational overhead. Every additional protocol means one more pipeline to maintain.

RobustMQ's approach is different. MQTT in, Kafka out — or Kafka in, MQTT out — access the same data with no movement in between. Data is consumable the moment it's written, with no sync delay and no consistency window. For users, this is not an abstract architectural advantage but a very tangible difference in experience: the data flow pipeline is shorter, there are fewer places where things can go wrong, and there are fewer stages to debug.

This model can continue to scale. When AMQP, NATS, and other protocols are added, they also access the same underlying storage. Write in via Protocol A, and Protocols B, C, and D can all read it out. Behind each protocol lies an ecosystem — the IoT device ecosystem behind MQTT, the stream processing and data engineering ecosystem behind Kafka, the enterprise messaging ecosystem behind AMQP. These ecosystems previously required users to build their own data pipelines to interconnect them, but in RobustMQ they are naturally connected without any glue layer. The more protocols are supported, the greater the value of this unified view.

For users, this is not an abstract architectural advantage — it's a very practical experience. A typical IoT data platform previously had to maintain four separate systems: an MQTT broker, a Kafka cluster, a time-series database, and a data lake, with data pipelines needed between each pair, every pipeline a maintenance burden and a potential failure point. With RobustMQ, devices write in via MQTT; stream processing systems consume the same data directly using the Kafka protocol; complex analytics flow naturally out via the Kafka protocol to specialized tools. The pipeline is shorter, there are fewer seams, and there are fewer places where things can go wrong.

RobustMQ will not do what Flink does, nor what GreptimeDB does. Clear boundaries are what allow us to do our own work well.

Features can be copied, but once a structural design is well executed, the cost of catching up is very high. This is RobustMQ's most fundamental judgment on the multi-protocol front: the most valuable messaging infrastructure of the future will not be the one with the most features, but the one with the fewest seams.

Storage Engine Extensibility Is a Long-Term Bet

The extensibility of the storage layer is a long-term bet for RobustMQ.

Several prominent trends in the industry are worth watching. One is that more and more systems are sinking data down to object storage like S3, driven by a straightforward rationale: local disks are expensive, S3 is cheap, and you don't need to pay separately for cross-availability-zone data replication — data stored in object storage is inherently highly available. Kafka's Tiered Storage, Pulsar's BookKeeper tiering, and various cloud-native messaging systems are all moving in this direction. Two is the rise of Lakehouse — writing data into Lakehouse formats (Delta Lake, Iceberg, etc.) allows data to be both consumed by stream processors and directly queried by SQL query engines, with one copy of data satisfying multiple consumption patterns. Both directions have their limitations — S3 has high random-read latency and is unsuitable for real-time consumption; Lakehouse has higher write costs and query latency than dedicated messaging systems — but both are solving real problems, and real users are using them.

From this, a deeper judgment can be drawn: no single storage model can serve as a unified communication conduit. Memory is suited to low-latency scenarios, local disks to high-throughput streaming, S3 to low-cost cold data, and Lakehouse to scenarios that need both consumption and querying. Each storage type is reasonable in its corresponding scenario; forcing a single storage to cover all scenarios inevitably means compromising on some dimension.

This is precisely why RobustMQ chose a pluggable storage engine, with granularity down to per-topic configurability — different topics can choose different storage strategies. High-frequency real-time data uses memory or File Segment; data that needs to be retained long-term uses S3; data that needs analytical querying is written to Lakehouse. Users don't need to deploy different systems for different data types — the messaging system itself selects the appropriate storage engine based on configuration, and the upper protocol layer notices no difference.

Laying this storage abstraction now is preparation for these future directions. When hot-cold tiering becomes an essential requirement, the cost of evolution will be relatively low, because the architecture reserved space for it from the beginning. Each new storage engine added grows the capability by another increment — without ever needing to change the protocol layer. This kind of extensibility is not over-engineering; it is a realistic judgment about future trends.

What Do Users Get?

Putting all of this together, what users get is: building a messaging infrastructure no longer requires choosing between an MQTT broker, a Kafka cluster, and a queue system, nor maintaining the data synchronization pipelines between them. Devices write in via MQTT, stream processing systems consume via Kafka, enterprise systems ingest via AMQP — each uses the approach it knows, and all access the same underlying data. Where data is stored and on what medium is configured per topic; the system handles it.

Operational complexity is reduced because fewer systems means fewer seams between systems, and fewer seams means fewer debugging costs when something goes wrong. When specialized tools are needed for complex analytics, the Kafka protocol is already the natural exit — no extra adaptation needed.

This is not the most feature-rich solution — it is a structurally simpler solution. A unified messaging system that adapts to different scenarios and is genuinely easy to use. That's all.

🎉 既然都登录了 GitHub,不如顺手给我们点个 Star 吧!⭐ 你的支持是我们最大的动力 🚀