Skip to main content

Command Palette

Search for a command to run...

ClickHouse: Advantages, Challenges, and Pitfalls

Updated
6 min read
ClickHouse: Advantages, Challenges, and Pitfalls

ClickHouse: Advantages, Challenges, and Pitfalls

ClickHouse is one of those databases that generates excitement after the first benchmark. It's extremely fast, column-oriented, and designed for analytics at scale.

It's also remarkably simple to integrate into an existing infrastructure. You can stream data from Postgres, MongoDB, S3, or virtually any source. That's what makes it so attractive. You reach the point where your Postgres queries begin to struggle, you don't want to rebuild everything, so you introduce ClickHouse and suddenly your dashboard loads in milliseconds.

It's like adding a turbocharger to your reporting system.

ClickHouse is also evolving at an incredible pace. Every month they release new features, bug fixes, and faster query performance.

But with speed comes responsibility. ClickHouse is a powerful system. It'll reward you when you treat it properly, but it'll cause problems if you take shortcuts.


Cloud vs Self-Hosting

Your first major decision is whether to self-host or choose a managed provider.

CloudSelf-Hosted
Setup timeMinutesDays/weeks
Cost at scale$$$$ + engineer time
Backup/HAAutomaticDIY
HeadachesA few but not hosting relatedMany
Good forMost use casesCost optimization

The cloud approach is straightforward. You get uptime, automatic scaling, and minimal headaches. The trade-off is cost. ClickHouse Cloud, Altinity Cloud, Tinybird, they all work well, but the bill can become significant once you start moving large amounts of data. You don't deal with server issues, but you pay for that peace of mind.

In ClickHouse Cloud you also don't need to worry about replication, as this is handled automatically so you don't need to create replicated and distributed tables.

Self-hosting appears simple at first, but it's not.

You set up a single node, everything runs smoothly, and then one day something fails, stops merging data, corrupts data, or something else goes wrong. This is just the tip of the iceberg.

To handle real production traffic you'll end up with replicated and distributed tables. You'll need to choose between vertical scaling, horizontal scaling, or both. Then you start worrying about corrupted parts, cluster topology, and backups.

Running ClickHouse yourself works fine for smaller setups. Once you grow, it's a full-time job unless you use something like the Altinity ClickHouse Operator on Kubernetes. That operator makes things manageable. You define clusters in YAML, it handles replication, ZooKeeper (ClickHouse Keeper), and has excellent backup strategies. If you ever plan to self-host long-term, start there.


The Dark Side of Joins

Joins in ClickHouse are not the joins you're accustomed to. They work, but they're not "free."

ClickHouse doesn't have a full query optimizer like Postgres or MySQL. That means it doesn't plan your joins intelligently. If you join two large tables, it'll happily try to load everything in memory and fail in the process.

You have to think ahead. Filter first, join later.

A few ways to survive:

  • Use CTEs or sub-queries to narrow down the joined dataset before the join actually happens.
  • Use dictionaries (in-memory lookup tables) for small reference data. They're extremely fast, but they have to fit in memory.
  • Know your sorting keys. ClickHouse relies on them for efficient reads. Poor keys make joins worse.
  • Always join the smaller table

You'll notice when you have a bad join since it will take a long time or fail completely.


Updates, Deletes, and the Reality of Immutability

ClickHouse was never designed for frequent updates or deletes. It's a write-once, append-forever kind of database. You can't just run UPDATE users SET ... like in Postgres.

To their credit, the ClickHouse team has made significant progress here. They've added lightweight deletes and updates, and there are new table engines like ReplacingMergeTree and VersionedCollapsingMergeTree that can simulate mutable data. But it still requires extra consideration.

You need to design your tables knowing that changing data later is more difficult. That's fine for analytics workloads, but painful if you expect relational behavior.

These kinds of issues still occur today. Hoping the lightweight updates which is in beta now will make life easier.


Inserting Data the Right Way

Here's the biggest beginner trap.

ClickHouse loves large inserts. It hates small ones.

Every insert triggers background merges, index updates, compression, and part creation. Do that one row at a time and you'll overwhelm it. Batch your inserts into chunks, ideally thousands of rows at a time. You'll immediately see CPU drop and throughput increase dramatically.

If you're ingesting data continuously, throw it into a queue and batch it there. That's what we do at OpenPanel.dev. It smooths out traffic spikes and keeps our ingestion fast and predictable.


Replication and Sharding

This isn't a bad thing about ClickHouse. In fact, it's one of its best features.
But I still want to cover a few parts that confused me when I first started using it.

There are three kinds of tables you'll deal with when setting up replication or sharding:

  • Your existing table (usually MergeTree or something similar)
  • A replicated table (ReplicatedMergeTree)
  • A distributed table (Distributed)

Each of these plays a different role in your cluster, and it's worth understanding them before you begin.


First, decide how many replicas you want. Most setups use three replicas for high availability and to get replication working, you'll replace your existing table with a replicated one.

You can do that by creating a new table and swapping MergeTree for ReplicatedMergeTree in the engine section.

CREATE TABLE events_replicated ON CLUSTER '{cluster}' (
  ...
)
ENGINE = ReplicatedMergeTree(
  '/clickhouse/{installation}/{cluster}/tables/{shard}/openpanel/v1/{table}',
  '{replica}'
)
PARTITION BY toYYYYMM(created_at)
ORDER BY (created_at)

Once that's done, any data written to one node will be replicated to the others. ZooKeeper or ClickHouse Keeper handles the synchronization automatically.

If you want to move your data from your existing table to the replicated table you can use INSERT SELECT to do this.

INSERT INTO events_replicated SELECT * FROM events;

Now let's look at where sharding and Distributed tables come in.
Sharding is how you scale horizontally by splitting data into smaller chunks and spreading them across nodes. That said, it's usually better to scale vertically first, because ClickHouse handles vertical scaling surprisingly well.

If you decide to shard, you'll need to create a distributed table. A distributed table knows where your data lives and redirects queries to the right node.

When creating one, you define how data should be split across nodes. In the example below, the data is sharded using cityHash64(project_id), which spreads rows evenly based on the project_id.

CREATE TABLE events_distributed ON CLUSTER '{cluster}' AS events_replicated
ENGINE = Distributed(
  '{cluster}',
  currentDatabase(),
  events_replicated,
  cityHash64(project_id)
)

Now you can query data from any node, and ClickHouse will automatically route the request to where the data actually sits.

If you want to dig deeper, check out the official docs on Distributed tables and Replication.


So Why Stick With It?

Because when it works, it's magic.

At OpenPanel we hit all these issues. Slow inserts, bad joins, tricky replication, and we still use ClickHouse every single day. Once you set it up correctly, nothing else compares. It's incredibly fast and scales far beyond what most relational databases can handle.

You just have to respect it. Treat it like a Ferrari, not a Corolla.

If you want me to go deeper into how we deploy and manage our own cluster on Kubernetes using the Altinity operator, let me know in the comments. I can show exactly how we keep it stable and cost-efficient.

More from this blog

Go & DevOps Blog

24 posts

Backend Developer | Python | Go | gRPC | Kubernetes | Ansible | IaC