Skip to content

Why 1:1 and Group Chats Are Completely Different Engineering Problems

"It's just sending messages back and forth. How hard can it be?"

This is the deceptive promise that draws countless engineering teams into disaster. On the surface, the components seem identical: a user sends a message, it reaches a server, it gets delivered to another user. This early success is misleading.

The real challenge isn't sending one message; it's managing the state and context around billions of them. The architecture that works beautifully for a simple 1:1 chat will actively sabotage you when you need to support groups.

The State Explosion: 1:1 vs. N:N

This is the heart of the problem. The data models and state management are fundamentally different beasts.

Feature1:1 Chat (Simple)Group Chat (Complex Beast)
MembershipTwo participants. Static.N participants. Dynamic (joins, leaves, kicks). Requires a robust permissions system.
Read StatusA single boolean: is_read. Easy.A nightmare. Each of N members has their own last_read_message_id. You are now tracking N read horizons per conversation.
MentionsNot applicable.Requires parsing message content and generating notifications for specific users. A whole new query path.
PermissionsNot applicable.Who can change the group name? Who can invite new members? Who can mute others? This is a complex ACL system in itself.

A 1:1 chat is a two-party agreement. A group chat is a chaotic, multi-party negotiation where state is always in flux.

The Tyranny of the Fan-Out

Delivering a message also transforms completely. In a 1:1 chat, it's one write, one push. In a group with 10,000 members, a single message triggers a massive fan-out. That innocent-looking "is typing..." indicator? In a large group, it can trigger an event storm that melts your servers.

Anatomy of a Production-Ready Solution

To solve these problems, a monolith won't cut it. You need a decoupled crew of specialists. Let's design for a target of 1 million concurrent users, a peak of 100,000 messages/sec, and a sub-100ms p99 delivery latency

                                  +-------------------+
                                  |   [ API Gateway ] |
                                  +---------+---------+
                                            | (Session Auth)
                                            |
+-------------------------------------------+----------------------------------------------+
| (Real-time / Fast Path)                   |                               (Durable / Safe Path) |
|                                           |                                              |
v                                           v                                              v
+----------------+                      +---------+                                +-------------------+
| [ NATS Bus ]   |                      | [Redis] |                                |  [ Kafka Log ]    |
+-------+--------+                      +---------+                                +---------+---------+
        | (Live Messages, Fan-out)            ^ (Read State, Typing...)                    | (Message Archive)
        |                                     |                                            |
        v                                     |                                            v
+--------------------+                        |                                +--------------------------+
|  [ Online Users ]  |------------------------+                                | [Message Persister Svc]  |
+--------------------+                                                         +-----------+--------------+
                                                                                            | (Batch ingest)
                                                                                            v
                                                                                    +------------------+
                                                                                    | [ScyllaDB/CASS]  |
                                                                                    +------------------+

1. The Hybrid Message Bus (NATS + Kafka)

Separate real-time delivery from durable, offline delivery.

  • For Online Users (Fast Path): Use a lightweight pub/sub system like NATS. When a user sends a message, it's published to a NATS topic. All connected clients get it instantly. This is fire-and-forget; its only job is speed.
  • For Offline Users (Safe Path): In parallel, the same message is published to a durable log like Apache Kafka, often with a 7-day log retention policy. This is the system of record. For more advanced use cases, check out tools like NATS JetStream .

Trade-off: Why NATS over Redis Streams? NATS is a specialist scalpel—purpose-built for high-performance, at-most-once messaging. It's simpler and faster for this specific "fire-and-forget" job, leaving Redis free to handle its core responsibility: state.

2. The Wide-Column Message Store (ScyllaDB)

Storing billions of messages in a relational database is a recipe for operational pain. A wide-column store like ScyllaDB or Apache Cassandra is the right tool.

-- A simplified Cassandra/ScyllaDB table schema
CREATE TABLE messages (
    channel_id uuid,
    message_id timeuuid,
    author_id uuid,
    content text,
    PRIMARY KEY ((channel_id), message_id)
) WITH CLUSTERING ORDER BY (message_id DESC);

Trade-off: Why ScyllaDB over DynamoDB? While DynamoDB is a fantastic managed service, self-hosting ScyllaDB gives you extreme performance, no vendor lock-in, and can be significantly more cost-effective at massive scale. For those migrating, ScyllaDB also offers a DynamoDB-compatible API called Alternator to ease the transition.

3. The Decoupled State Tracker (Redis)

High-volume, low-value state changes will hammer your main database into submission. Offload them to an in-memory store like Redis.

  • Read Receipts: Don't write a new row for every read. Instead, store a "read watermark" in Redis: a single SET operation to read_watermark:{user_id}:{channel_id} with the last message_id.
  • Presence: For "is typing..." events, use Redis keys with a short, 3-second TTL. A client's "typing" event creates a key that expires automatically. It's the digital equivalent of just walking away when you're done—no "stopped typing" event needed.

Don't Forget the Gremlins

Once basic delivery works, a new class of problems appears.

  • Message Edits/Deletes: You can't just DELETE a row; that breaks the immutable log. Instead, you write a "tombstone" event (e.g., a message with a deleted: true flag) to mark it as gone.
  • Large Files: Never pipe blobs through your real-time bus. Generate a pre-signed URL on the client, upload directly to S3/Cloud Storage, and send only the file link as the message.
  • Search: Your primary database is not a search engine. Ship your message data to a dedicated service like Elasticsearch for indexing and full-text search capabilities.

Locking the Doors

Production systems have rules.

  • Encryption: End-to-End Encryption (E2EE) is the gold standard. At minimum, you need TLS for encryption-in-transit and transparent encryption-at-rest on your databases and file storage.
  • Compliance: You need a strategy for GDPR's "right to be forgotten." This means having a process to scrub user data, which is much easier when you know exactly which services hold PII.

The Final Principle

Building a chat system that scales is a journey from blissful ignorance to painful enlightenment. If you remember nothing else, remember this:

Design for the state, not the message. Your success hinges on three rules:

  1. Model the complex state of a conversation first.
  2. Use specialist tools, not generalist ones.
  3. Decouple your fast path from your safe path.

¹ Latency measured edge-to-edge, from client publish to the last online client's receive.

Kartikey — Chatwoot Test