• Data & Analytics
  • Digital Engineering
  • Data, AI & Analytics

Why Your “Read Replica” Trick Fails for Writes — And How Cassandra Fixes It

Published On: 10 June 2026.By .
Engineering  —  System Design  —  Databases  —  10 June 2026

You are in a system design interview. "Our app is read-heavy, 95% reads, 5% writes. How do you scale the database?" Easy. One primary, a fleet of read replicas, done. Then the interviewer flips it: "Now make it write-heavy. Two million IoT events per second. Now what?" Suddenly that beautiful diagram is useless. Here is why, and here is how Apache Cassandra fixes it.

System Design Apache Cassandra Write-Heavy Systems Distributed Databases NoSQL
Topic
Scaling write-heavy databases
Database
Apache Cassandra (NoSQL)
Architecture
Masterless, peer-to-peer ring
Data placement
Consistent hashing (Murmur3)
Storage engine
LSM-tree (commit log + memtable + SSTable)
Consistency
Tunable (ONE / QUORUM / ALL)
Scaling
Near-linear, add nodes
Used by
Netflix, Apple, Instagram
01 - The Setup

Let's Start With a Question

The interviewer says your app is read-heavy: 95% reads, 5% writes. How do you scale the database? Easy. You draw the classic picture: one primary node that takes all the writes, and a fleet of read replicas that copy data from the primary and serve all the reads. Need more read throughput? Add another replica. Done.

This pattern, primary-replica replication, powers a huge chunk of the internet, and it works beautifully.

Primary-replica architecture diagram: all writes funnel through one primary node while reads fan out to read replicas
Primary-replica architecture: all writes funnel through one primary node while reads fan out to read replicas

Then the interviewer leans forward: "Cool. Now flip it. The system is write-heavy. Think IoT sensors firing 2 million events per second, or a chat app, or clickstream analytics. Now what?" And suddenly your beautiful diagram is useless.

02 - The Problem

Why the Read-Replica Pattern Collapses Under Writes

In primary-replica setups, every single write must go through one node: the primary. Read replicas do not help at all, because they only copy writes; they cannot accept them. So your write throughput is capped at whatever one machine can handle.

You can fight this in three ways, and all three hurt.

Vertical scaling
Buy a bigger machine. Works until it does not. There is a physical and financial ceiling, and you still have a single point of failure.
Manual sharding
Split data across primaries (users A-M on DB1, N-Z on DB2). Now you own routing logic, rebalancing hot shards, cross-shard queries, and 3 AM pages when shard 2 dies.
Async write queues
Buffer writes in Kafka and drain slowly. Helps with bursts, but the database is still the bottleneck. You have just moved the queue.

The root problem is architectural: a single write master is a funnel. No amount of replicas widens a funnel.

⏸ Pause and think What if there was no primary at all? What if every node could accept writes? That is exactly Cassandra's bet.
03 - Big Idea #1

Kill the Master

Cassandra uses a masterless, peer-to-peer architecture. Every node in the cluster is equal. There is no primary, no replica hierarchy, no leader election drama. Any node can accept both reads and writes, which means there is no master to become a single point of failure or a throughput funnel.

The nodes arrange themselves logically in a ring. A client can connect to any node, and that node becomes the coordinator for the request: a temporary proxy that routes the write to the right place and replies to the client.

Cassandra masterless ring of six equal nodes; a client can connect to any node, which becomes the coordinator
Cassandra masterless ring of six equal nodes; a client connects to any node, which becomes the coordinator

The immediate consequence is glorious:

  • 10 nodes means 10 machines accepting writes in parallel.
  • Need more write throughput? Add a node. Throughput scales near-linearly.
  • A node dies? The ring keeps serving traffic. No failover ceremony.

But wait: if any node can take any write, how does data not turn into chaos? How does the cluster know where a piece of data should live?

04 - Big Idea #2

Consistent Hashing Decides Where Data Lives

Every row in Cassandra has a partition key (say, sensor_id or user_id). When a write arrives, the coordinator runs the partition key through a hash function (the default is Murmur3), which produces a token: a number on a giant circle of possible values.

Each node in the ring owns a range of tokens. Whichever node owns the range your token falls into, that is the first home for your data.

Consistent hashing token ring: sensor_42 hashes to token 60, which falls in Node C's range
Consistent hashing token ring: sensor_42 hashes to token 60, which falls in Node C's range

Think of it like a roundabout with houses on it. The hash of your key tells you the address on the roundabout, and you simply walk clockwise to the first house. That is where your data lives.

Why is this clever?

  • No lookup table, no router service. Every node can compute the hash itself and instantly knows where any data belongs. Routing is math, not metadata.
  • Adding a node only reshuffles a slice of data, not everything. That is the "consistent" in consistent hashing.
  • Hot-spotting is minimized, because a good hash function sprays keys uniformly around the ring.

This is the answer to "who handles the write?" The hash decides, automatically, with zero central coordination.

05 - Big Idea #3

Replication Without a Primary

"But wait," you say, "if data for sensor_42 lives on Node 3, and Node 3 dies, didn't we just lose data?" Nope, because Cassandra never stores data on just one node.

You configure a Replication Factor (RF), typically 3. The first replica goes to the token owner, and the next replicas go to the following nodes walking clockwise around the ring.

The crucial difference from primary-replica systems: all replicas are equal. There is no "main copy" and "backup copies." The coordinator sends the write to all replica nodes simultaneously, in parallel, not in a chain.

So when is a write "successful"? You decide.

This is one of Cassandra's most elegant features: tunable consistency. With RF=3, you choose how many replicas must acknowledge before the client gets a success.

Consistency LevelWaits forYou get
ONE1 of 3 replicasBlazing-fast writes, maximum availability
QUORUM2 of 3 replicasStrong-enough consistency, still fast
ALL3 of 3 replicasStrongest consistency, lowest availability

Write-heavy systems typically run ONE or QUORUM on writes. The remaining replicas still receive the write; the coordinator just does not wait for them.

And if a replica node is down?

The coordinator stores a hint, a small "you missed this write" note, and replays it to the node when it comes back online. This mechanism is called hinted handoff, and it is why a dead node does not block your writes.

⏸ Pause and think Notice the trade we just made. We gained massive write availability, but two replicas might briefly disagree about a value. Cassandra embraces this: it is called eventual consistency, and for sensor data, likes, messages, and logs, it is almost always the right trade.
06 - Big Idea #4

Make Each Individual Write Stupidly Fast

Distributing writes across nodes solves the cluster-level bottleneck. But Cassandra also attacks the disk-level bottleneck, and this part is beautiful.

Traditional databases (B-tree based) often must read before they write: find the page where the row lives, possibly rewrite it in place. That means random disk I/O, the slowest thing you can ask a disk to do.

Cassandra's storage engine (an LSM-tree design) refuses to play that game. When a write lands on a node, exactly two things happen:

  • Append to the Commit Log — a write-ahead log on disk. This is a pure sequential append, the fastest possible disk operation. Once it is here, the write is durable; even if the node crashes, it can replay the log.
  • Update the Memtable — an in-memory sorted structure. This makes the data instantly readable.

That is it. The client gets an acknowledgment. One sequential disk write plus one in-memory write. No reads. No seeks. No locks on existing data.

Cassandra write path: a write hits the commit log and memtable simultaneously, then later flushes to an immutable SSTable
Cassandra write path: a write hits the commit log and memtable, then later flushes to an immutable SSTable

Later, in the background, when a memtable grows past a threshold, it is flushed to disk as an SSTable (Sorted String Table): an immutable, sorted file. Updates never modify old SSTables; they just write new data with a newer timestamp. Even deletes are writes (a marker called a tombstone). A background process called compaction periodically merges SSTables and discards stale data.

This is why Cassandra's docs call the write path one of its key strengths: the engine is literally designed so that the hot path of a write touches the disk only in the cheapest way possible.
07 - End to End

Putting It All Together: The Life of One Write

Let's trace a single event, sensor_42 reporting a temperature, through a 6-node cluster with RF=3 and consistency QUORUM.

  • The client's driver connects to any node, say Node 1. It becomes the coordinator.
  • Node 1 hashes sensor_42 and the token lands in Node 3's range. Replicas: Nodes 3, 4, 5.
  • Node 1 fires the write to Nodes 3, 4, and 5 in parallel.
  • Each replica appends to its commit log and updates its memtable, microseconds of work.
  • The moment 2 of 3 replicas acknowledge (QUORUM), Node 1 tells the client: done.
  • If Node 5 was down, Node 1 keeps a hint and delivers it later.

Meanwhile, the next write might hash to Nodes 6, 1, 2: a completely different set of machines. Multiply this by millions of writes per second, and you see the magic: the load spreads itself across the entire cluster, automatically.

08 - The Honest Section

What Does This Cost You?

No free lunches in distributed systems. Cassandra optimizes writes by making other things harder.

⚠ Reads are more work A read may need to check the memtable and several SSTables, then merge results by timestamp. Cassandra mitigates this with bloom filters, caches, and compaction, but reads are inherently pricier than writes here. (Notice the irony: the exact opposite of the primary/read-replica world we started with.)
⚠ Eventual consistency is a mindset shift If you need strict ACID transactions across rows, like banking ledgers or inventory counts, Cassandra is the wrong hammer.
⚠ Query-first data modeling No joins, no ad-hoc queries on random columns. You design tables around your queries, often duplicating data. Denormalization is not a sin in Cassandra; it is the lifestyle.
⚠ Compaction needs care and feeding Background merging consumes I/O and CPU; tuning it is part of operating Cassandra at scale.
09 - The Decision

So, When Do You Reach for Cassandra?

Great fit
  • IoT and sensor telemetry
  • Time-series data
  • Messaging and chat history
  • Activity feeds
  • Clickstream and event logging
  • Fraud-detection event stores
  • Anything append-heavy at scale with high availability needs
Poor fit
  • Strongly transactional systems
  • Heavy ad-hoc analytics with joins and aggregations
  • Small datasets that fit comfortably on one Postgres box
  • When you would be over-engineering, seriously

This is why Netflix, Apple, and Instagram run some of the largest Cassandra clusters on the planet.

10 - Cheat Sheet

The One-Table Summary

Problem in write-heavy systemsCassandra's answer
Single write master is a bottleneckMasterless ring: every node accepts writes
Who stores which data?Consistent hashing on the partition key
Node failures losing dataReplication Factor plus all-equal replicas
Latency vs safety trade-offTunable consistency (ONE / QUORUM / ALL)
Replica down during a writeHinted handoff
Slow random disk I/O per writeCommit log + memtable + SSTables (LSM-tree)
Need more throughputAdd nodes: near-linear scaling

Read-heavy scaling is about copying data outward from one writer. Write-heavy scaling is about removing the single writer entirely.

Cassandra does it with a masterless ring, hash-based data placement, parallel replication with tunable consistency, and a storage engine where every write is a cheap sequential append. That combination is why "add a node" is genuinely the answer to "we need more write throughput."

11 - Frequently Asked Questions

Cassandra and Write-Heavy Systems — FAQ

What is Apache Cassandra?
Apache Cassandra is an open-source, distributed NoSQL database designed for write-heavy workloads and high availability. It uses a masterless, peer-to-peer architecture where every node can accept reads and writes, so throughput scales near-linearly as you add nodes. Netflix, Apple, and Instagram run some of the largest Cassandra clusters in the world.
Why does primary-replica replication fail for write-heavy workloads?
In a primary-replica setup, every write must go through a single primary node. Read replicas only copy writes, they cannot accept them, so total write throughput is capped at what one machine can handle. Adding replicas widens read capacity but never write capacity, because a single write master is a funnel that no number of replicas can widen.
What is a masterless architecture in Cassandra?
Masterless means there is no primary node and no replica hierarchy. Every node in a Cassandra cluster is equal and can accept both reads and writes. Nodes arrange themselves in a logical ring, and any node a client connects to becomes the coordinator for that request. With 10 nodes, all 10 accept writes in parallel, and a single node failure does not stop the cluster.
How does Cassandra decide where data is stored?
Cassandra uses consistent hashing on the partition key. The coordinator hashes the partition key (using Murmur3 by default) to produce a token, a number on a ring of possible values. Each node owns a range of tokens, and the node owning the range your token falls into is the first home for that data. Routing is pure math, so no lookup table or router service is needed.
What is Replication Factor in Cassandra?
Replication Factor (RF) is how many copies of each piece of data Cassandra stores, typically 3. The first replica goes to the token owner and the next replicas go to the following nodes clockwise around the ring. All replicas are equal, there is no main copy and backup copies, and the coordinator writes to all replicas in parallel rather than in a chain.
What is tunable consistency (ONE, QUORUM, ALL)?
Tunable consistency lets you choose how many replicas must acknowledge a write before the client gets a success. With RF=3: ONE waits for 1 replica (fastest, most available), QUORUM waits for 2 of 3 (strong-enough consistency, still fast), and ALL waits for all 3 (strongest consistency, lowest availability). Write-heavy systems typically use ONE or QUORUM.
What is hinted handoff?
If a replica node is down during a write, the coordinator stores a hint, a small note recording the missed write, and replays it to that node when it comes back online. Hinted handoff is why a dead node does not block writes in Cassandra.
How does the Cassandra write path work?
Cassandra uses an LSM-tree storage engine. When a write lands on a node, it appends to the commit log (a sequential write-ahead log on disk, the fastest disk operation) and updates the memtable (an in-memory sorted structure). The client is acknowledged immediately, with no reads or random seeks. Later, the memtable flushes to disk as an immutable SSTable, and a background compaction process merges SSTables and discards stale data.
What is eventual consistency in Cassandra?
Eventual consistency means replicas may briefly disagree about a value after a write, but converge to the same value over time. Cassandra trades strict immediate consistency for massive write availability. For sensor data, likes, messages, and logs this is almost always the right trade, but for banking ledgers or inventory counts that need strict ACID transactions, Cassandra is the wrong choice.
When should you use Cassandra instead of a relational database?
Use Cassandra for write-heavy, append-heavy workloads at scale with high availability needs: IoT and sensor telemetry, time-series data, messaging and chat history, activity feeds, clickstream and event logging, and fraud-detection event stores. Avoid it for strongly transactional systems, heavy ad-hoc analytics with joins and aggregations, and small datasets that fit comfortably on a single PostgreSQL instance.
Which companies use Apache Cassandra?
Netflix, Apple, and Instagram run some of the largest Cassandra clusters in the world. It is widely adopted for write-heavy and high-availability use cases such as time-series data, messaging, activity feeds, and event logging.
What is a coordinator node in Cassandra?
The coordinator is whichever node a client connects to for a given request. It acts as a temporary proxy: it hashes the partition key to find which nodes own the data, forwards the read or write to those replica nodes, waits for the required number of acknowledgments based on the consistency level, and replies to the client. Any node can act as coordinator for any request.

Building Systems That Scale?

Write-heavy pipelines, distributed databases, and high-throughput backends are everyday work at Auriga IT. From NHAI processing 20M+ daily transactions to data migrations with zero data loss, scale is our home turf.

Sushil Kumar Sadhnani, Backend Developer at Auriga IT
Sushil Kumar Sadhnani
Backend Developer — Auriga IT
Sushil Kumar Sadhnani is a Backend Developer at Auriga IT specializing in Django and Django REST Framework (DRF). He focuses on building scalable APIs and high-performance backend systems, with expertise in database management and backend optimization.

Related content

Stay Close to What We’re Building

Get insights on product engineering, AI, and real-world technology decisions shaping modern businesses.

Go to Top