In-Depth Technical Analysis and Reflections on FlowMQ
Disclaimer: This article uses technical reasoning based on an architecture speculated from FlowMQ's public documentation, for personal technical learning and reflection only. It does not represent FlowMQ's actual implementation details. All analysis is logical inference and does not guarantee consistency with FlowMQ's internal architecture, nor does it constitute any disparagement of EMQ or FlowMQ. Technical discussion is welcome.
I. Can AMQP Messages Be Stored in FoundationDB?
FlowMQ uses FoundationDB for metadata storage. Could Queue (AMQP semantics) message bodies also be stored in FoundationDB?
Technically yes, but there are hard limits. FoundationDB has a single-value limit of 100KB, and a single transaction limit of 10MB with a 5-second timeout. If AMQP message bodies are ordinary business messages (a few hundred bytes to a few KB), they can absolutely be stored there. Most enterprise-grade AMQP scenario message bodies fall within this range.
But Queue semantics put pressure on FoundationDB. The core operations for a Queue are: write → competing consume → ACK → delete. Each message requires at least two write transactions (one for writing, one for deletion), doubling the transaction volume in high-throughput scenarios. FoundationDB is optimized for OLTP small transactions; individual message operations are fine, but at message queue scales of millions of QPS, transaction overhead becomes a bottleneck.
There is also a hidden issue: FoundationDB keys are ordered. If a Queue uses incrementing keys for message ordering, all writes will concentrate at the tail of the same range, creating a write hotspot. Distributed KVs are most vulnerable precisely to sequential writes concentrating on a single shard. To solve this you would need to hash the keys, but hashing destroys queue ordering.
The conclusion is that it is technically possible to store there, but it is not the optimal solution. Small-scale, low-throughput is fine; at large-scale, high-throughput, transaction overhead and write hotspots become bottlenecks.
II. How Should Queue Messages Be Handled?
Within FlowMQ's architecture, the more sensible approach is most likely: Queue message bodies are stored in S3, with metadata stored in FoundationDB.
Message bodies are batch-appended in segments to S3 (reusing the Stream's KaS3 module), while FoundationDB only stores Queue control information — message index (which segment and offset in S3 a given message lives at), consumption state (unconsumed/delivered/ACKed), consumer assignment relationships, and retry counts.
On the write path, message bodies are batch-appended to S3, sharing the same storage path and write logic as Stream. FoundationDB only handles lightweight index writes, avoiding the 100KB value limit and transaction pressure.
On the consumption path, consumers fetch the index of the next unconsumed message from FoundationDB, then read the message body from S3 or page cache. After ACKing, the consumption state is marked in FoundationDB, and ACKed indexes and S3 segments are periodically batch-cleaned.
Competing consumption leverages FoundationDB's transaction capability — atomically changing a message's state from "unconsumed" to "delivered to consumer X" in a transaction, naturally preventing duplicate consumption.
But the trade-off is that consuming a single message requires two I/Os: first read FoundationDB for the index, then read S3 for the message body. When the underlying storage is not unified, you are constantly coordinating across multiple storage systems.
III. How Are Stream (Kafka-Compatible) Messages Handled?
Write Path
When a producer sends a message, the Broker does not write to S3 for every single message — S3 writes have too coarse a granularity and too high a latency. Most likely, the Broker accumulates a batch in memory, and once the batch reaches a certain size or time threshold, writes the entire batch as a single segment file to S3. This is corroborated by the kas3_segment_size monitoring metric — segment is the basic storage unit on S3.
Index Management
After each segment is written to S3, index information is recorded in FoundationDB: the segment's path on S3, the offset range it contains, the partition it belongs to, and its creation time.
FoundationDB's ordered KV feature is very useful here. The key design is probably {topic}/{partition}/{start_offset} → value is the segment's S3 path and metadata. When a consumer wants to read a message at a certain offset, they first do a range query in FoundationDB to locate the corresponding segment, then go to S3 to read that segment file.
Offset Management
Consumer group committed offsets are stored in FoundationDB, with keys probably structured as {group_id}/{topic}/{partition} → value is the current committed offset. Each commit is a single FoundationDB transaction write; the frequency is low (usually batch commits), well within transactional capability.
Read Path
The flow for a consumer to fetch messages:
- Read the current committed offset of the consumer group from FoundationDB
- Locate the corresponding segment in the FoundationDB index based on the offset
- Check local page cache first; return immediately if hit
- If cache miss, pull the segment from S3, put it in page cache, then return the messages
The kas3_page_cache_hit_count and kas3_page_cache_miss_count monitoring metrics track the hit rates for steps 3 and 4 respectively.
Retention and Compaction
Cleaning up expired segments: the Broker periodically scans the indexes in FoundationDB to find segments that have exceeded the retention time, deletes the S3 files first, then deletes the FoundationDB indexes. Deleting S3 first, then the index: if a crash occurs between the two, at worst you're left with an index pointing to a nonexistent file, which gets cleaned up in the next pass. The reverse order — deleting the index first — would leave orphaned files on S3.
Compaction (kas3_online_running_tasks) most likely involves merging small segments into larger ones, reducing the S3 file count and index entries, and improving read efficiency.
Problems with This Approach
Every consumption requires at least two I/Os — one FoundationDB index query, and one S3/page cache data read. Write latency is bound by batch policy — while a batch is not full, messages are still in memory, and a Broker crash can lose data. To address this, you would need to first write to FoundationDB as a WAL during the batch period, adding yet another write. Index bloat — when supporting advanced features like consumption by timestamp or key-based lookup, the volume of indexes grows rapidly.
IV. As a FlowMQ Architect, How Would You Move Forward in the Short and Long Term?
Short Term (6–12 months): Survive First
Don't change the architecture. The current FoundationDB + S3 + routing engine can run and sell; no major surgery. The core goal is commercial validation:
- Push MQTT and Kafka compatibility to excellence so customers can directly replace their existing clients
- Land 3–5 reference production case studies with benchmark customers; IoT + Kafka data pipelines are the easiest story to tell
- Publish performance benchmarks publicly, benchmarking against EMQX, Kafka, and RabbitMQ
- Don't touch the storage layer; first prove the product has a market
Medium Term (1–2 years): Reduce External Dependencies
Once the product is established, start addressing architectural technical debt:
Reduce FoundationDB's role weight. Build a lightweight embedded metadata storage module in-house, first taking over the high-frequency write parts (consumer offsets, consumption state), and gradually retiring FoundationDB to only store low-frequency cluster configuration and routing table changes. Similar to Kafka's gradual migration away from ZooKeeper.
Unify the write path. Currently there are three separate paths: Subscription uses in-memory, Stream uses S3, Queue uses FoundationDB. Design a unified local write layer — all messages first write to a local WAL (in-house), then are asynchronously flushed to S3 for persistence. First unify the Subscription and Stream write paths, then bring Queue in line later.
Reduce write amplification. On top of the unified write layer, explore the possibility of "one write, multiple consumption views." No rush to eliminate the routing engine, but start laying the groundwork for its eventual retirement.
Long Term (2–3 years): Move Toward Unified Storage
If the medium-term approach is validated as feasible, the long-term goal is to fully replace FoundationDB with a completely in-house unified storage engine. Message bodies, indexes, metadata, and consumption state all live in the same storage engine; Raft handles replica synchronization; an embedded local KV handles single-node storage. S3 is demoted to an optional cold data archival tier. The routing engine gradually weakens, eventually becoming a lightweight Topic alias mapper. Single binary deployment, no dependency on any external distributed system.
But This Long-Term Goal Most Likely Won't Happen
The reason is practical: EMQ is a commercial company, and FlowMQ's current architecture can generate revenue and serve customers. The ROI of rewriting the storage layer is hard to justify. It took Kafka several years to move from ZooKeeper to KRaft, driven by enormous market pressure and engineering resources. As a new product, FlowMQ won't face that kind of pressure in the short term.
The more likely outcome is: some optimizations happen in the medium term that reduce FoundationDB's pressure and let the product handle larger loads, then it stabilizes at that point. A fundamental storage layer refactor will always be on the roadmap but will perpetually be pushed to the next version.
This is also why a compositional architecture, once chosen, is very hard to walk back from. It's not that no one wants to change it — it's that the commercial pace doesn't allow it. Once you have customers and revenue, then want to change the underlying layer, the cost has become enormous. Meanwhile, projects that chose to build their own storage layer from day one start slower, but they won't have this historical baggage.
V. If You Were an Independent Architect, How Would You Design from Scratch?
Stepping outside the constraints of FlowMQ, as an independent infrastructure architect given the "unified messaging platform" challenge, the thinking process would be:
First Define the Essence of the Problem
The three consumption semantics (Pub/Sub, Stream, Queue) look very different on the surface, but when abstracted one level down, the common thread is: all are sequential append writes of messages + different consumption cursor management strategies.
- Stream's offset is a read position maintained by the consumer themselves
- Queue's ACK is also fundamentally a consumption cursor, just with state (unconsumed/delivered/acknowledged) and supporting competing allocation
- Pub/Sub real-time push can be understood as a consumption cursor that always chases the latest
The unified storage model is: append-only log + pluggable consumption cursor strategy. Messages are written only once; different consumption semantics are just different cursor management approaches.
Build the Storage Engine Yourself
A general KV cannot be optimized to the extreme for sequential appends. The write pattern of a messaging system is extremely special — almost entirely sequential appends, with no random updates. You can use io_uring for asynchronous batch flushing to disk, zero-copy from disk to network, and segment size aligned to SSD characteristics. None of these optimizations are achievable on top of a general-purpose distributed KV.
The distributed protocol also needs to be built in-house. Use Raft for replica synchronization, optimized for messaging workloads. Use an embedded KV (like RocksDB) for the single-node storage engine; metadata, indexes, and message bodies are all in the same process, with a single write completing atomically.
A Thin Protocol Layer
With a unified storage model, the protocol layer is very thin. The MQTT adapter translates publish to append, and subscribe to registering a consumption cursor. The Kafka adapter translates produce to append, and fetch to reading by offset. The AMQP adapter translates basic.publish to append, and basic.consume to a competing consumption cursor.
No routing engine needed, no Destination abstraction needed. The protocol layer only does translation; the storage layer only does append + read; consumption semantics are determined by consumption cursor strategy.
S3 as an Optional Tier
S3 is not the primary storage but a cold data archival tier. Hot data lives on local disk; expired segments can optionally be archived to S3 or deleted directly. Edge scenarios, private deployments, and environments without S3 can still run.
Cost vs. Benefit
The cost is difficulty. Building your own storage engine, distributed protocol, and replica management is an order of magnitude more work.
But the ceiling is high. Full control over the storage layer makes extreme optimizations possible. Single binary deployment means minimal operations. No external dependencies means the smallest possible failure domain. A unified data model makes upper-layer extension natural.
VI. Summary
FlowMQ's architecture is reasonable from an engineering delivery standpoint; the FoundationDB + S3 combination enables fast product delivery. But the long-term cost of a compositional architecture is: high cross-system coordination complexity, write amplification, optimization ceilings constrained by external components, and uncontrollable storage layers.
From first principles, the core of a unified messaging platform should be a unified storage model — an append-only log + pluggable consumption cursor strategy. Data is written once; consumption semantics are differentiated at the read layer. The storage engine is built and controlled in-house; the protocol layer only does translation.
The two paths serve different constraints. With a team, customers, and delivery pressure, choosing a compositional architecture is pragmatic. With patience for the long game, building your own storage layer is the harder but higher-ceiling path.
