"It's just sending messages back and forth. How hard can it be?"
This is the deceptive promise that draws countless engineering teams into disaster. On the surface, the components seem identical: a user sends a message, it reaches a server, it gets delivered to another user. This early success is misleading.
The real challenge isn't sending one message; it's managing the state and context around billions of them. The architecture that works beautifully for a simple 1:1 chat will actively sabotage you when you need to support groups.
The State Explosion: 1:1 vs. N:N
This is the heart of the problem. The data models and state management are fundamentally different beasts.
| Feature | 1:1 Chat (Simple) | Group Chat (Complex Beast) |
|---|---|---|
| Membership | Two participants. Static. | N participants. Dynamic (joins, leaves, kicks). Requires a robust permissions system. |
| Read Status | A single boolean: is_read. Easy. | A nightmare. Each of N members has their own last_read_message_id. You are now tracking N read horizons per conversation. |
| Mentions | Not applicable. | Requires parsing message content and generating notifications for specific users. A whole new query path. |
| Permissions | Not applicable. | Who can change the group name? Who can invite new members? Who can mute others? This is a complex ACL system in itself. |
A 1:1 chat is a two-party agreement. A group chat is a chaotic, multi-party negotiation where state is always in flux.
The Tyranny of the Fan-Out
Delivering a message also transforms completely. In a 1:1 chat, it's one write, one push. In a group with 10,000 members, a single message triggers a massive fan-out. That innocent-looking "is typing..." indicator? In a large group, it can trigger an event storm that melts your servers.
Anatomy of a Production-Ready Solution
To solve these problems, a monolith won't cut it. You need a decoupled crew of specialists. Let's design for a target of 1 million concurrent users, a peak of 100,000 messages/sec, and a sub-100ms p99 delivery latency.¹
+-------------------+
| [ API Gateway ] |
+---------+---------+
| (Session Auth)
|
+-------------------------------------------+----------------------------------------------+
| (Real-time / Fast Path) | (Durable / Safe Path) |
| | |
v v v
+----------------+ +---------+ +-------------------+
| [ NATS Bus ] | | [Redis] | | [ Kafka Log ] |
+-------+--------+ +---------+ +---------+---------+
| (Live Messages, Fan-out) ^ (Read State, Typing...) | (Message Archive)
| | |
v | v
+--------------------+ | +--------------------------+
| [ Online Users ] |------------------------+ | [Message Persister Svc] |
+--------------------+ +-----------+--------------+
| (Batch ingest)
v
+------------------+
| [ScyllaDB/CASS] |
+------------------+
1. The Hybrid Message Bus (NATS + Kafka)
Separate real-time delivery from durable, offline delivery.
- For Online Users (Fast Path): Use a lightweight pub/sub system like NATS. When a user sends a message, it's published to a NATS topic. All connected clients get it instantly. This is fire-and-forget; its only job is speed.
- For Offline Users (Safe Path): In parallel, the same message is published to a durable log like Apache Kafka, often with a 7-day log retention policy. This is the system of record. For more advanced use cases, check out tools like NATS JetStream .
Trade-off: Why NATS over Redis Streams? NATS is a specialist scalpel—purpose-built for high-performance, at-most-once messaging. It's simpler and faster for this specific "fire-and-forget" job, leaving Redis free to handle its core responsibility: state.
2. The Wide-Column Message Store (ScyllaDB)
Storing billions of messages in a relational database is a recipe for operational pain. A wide-column store like ScyllaDB or Apache Cassandra is the right tool.
-- A simplified Cassandra/ScyllaDB table schema
CREATE TABLE messages (
channel_id uuid,
message_id timeuuid,
author_id uuid,
content text,
PRIMARY KEY ((channel_id), message_id)
) WITH CLUSTERING ORDER BY (message_id DESC);
Trade-off: Why ScyllaDB over DynamoDB? While DynamoDB is a fantastic managed service, self-hosting ScyllaDB gives you extreme performance, no vendor lock-in, and can be significantly more cost-effective at massive scale. For those migrating, ScyllaDB also offers a DynamoDB-compatible API called Alternator to ease the transition.
3. The Decoupled State Tracker (Redis)
High-volume, low-value state changes will hammer your main database into submission. Offload them to an in-memory store like Redis.
- Read Receipts: Don't write a new row for every read. Instead, store a "read watermark" in Redis: a single
SEToperation toread_watermark:{user_id}:{channel_id}with the lastmessage_id. - Presence: For "is typing..." events, use Redis keys with a short, 3-second TTL. A client's "typing" event creates a key that expires automatically. It's the digital equivalent of just walking away when you're done—no "stopped typing" event needed.
Don't Forget the Gremlins
Once basic delivery works, a new class of problems appears.
- Message Edits/Deletes: You can't just
DELETEa row; that breaks the immutable log. Instead, you write a "tombstone" event (e.g., a message with adeleted: trueflag) to mark it as gone. - Large Files: Never pipe blobs through your real-time bus. Generate a pre-signed URL on the client, upload directly to S3/Cloud Storage, and send only the file link as the message.
- Search: Your primary database is not a search engine. Ship your message data to a dedicated service like Elasticsearch for indexing and full-text search capabilities.
Locking the Doors
Production systems have rules.
- Encryption: End-to-End Encryption (E2EE) is the gold standard. At minimum, you need TLS for encryption-in-transit and transparent encryption-at-rest on your databases and file storage.
- Compliance: You need a strategy for GDPR's "right to be forgotten." This means having a process to scrub user data, which is much easier when you know exactly which services hold PII.
The Final Principle
Building a chat system that scales is a journey from blissful ignorance to painful enlightenment. If you remember nothing else, remember this:
Design for the state, not the message. Your success hinges on three rules:
- Model the complex state of a conversation first.
- Use specialist tools, not generalist ones.
- Decouple your fast path from your safe path.
¹ Latency measured edge-to-edge, from client publish to the last online client's receive.