The ClickHouse Memory Win That Didn't Show Up On the Dashboard

2026-05-27 - 11 min read
Daniel Young
Daniel Young
Founder, DRYCodeWorks

A ClickHouse materialized view cutover at Frost dropped per-fire memory from 140 MiB to 7.4 MiB. The cluster's resident memory didn't move a byte. Here's the SQL pattern that made it work, and the more important lesson about which memory metric actually predicted our OOM exposure.

We applied a ClickHouse migration this week that cut per-insert materialized view memory by 95%. Per-fire allocations on the worst-affected MV dropped from ~140 MiB combined across two environments to ~7.4 MiB — a number we had measured, modeled, and rehearsed against a shadow A/B harness for a full week.

Then we ran SELECT max(value) FROM system.asynchronous_metric_log WHERE metric = 'MemoryResident' across the cutover window. The cluster's resident memory was 13.0 GiB before the migration. It was 13.0 GiB after.

Both numbers are correct. They answer different questions.

The migration's actual win — the thing that made it worth doing — was a ~15× reduction in transient allocation throughput, from 35.7 GiB/min to 2.4 GiB/min at peak, measured in system.query_views_log. That's the number that maps to our historical OOM exposure under concurrent insert bursts. MemoryResident was never going to move; the per-fire allocations had always been transient, allocated and freed sub-second, invisible at the 1-second async-metric sampling resolution most engineers reach for first.

This post is about the cutover — what we did to the SQL, and why we did it. It is also about the measurement gap we tripped over on the way, which is probably the more reusable lesson.

The Problem: A Write-Time ASOF JOIN That Multiplied Under Concurrency

Frost's pipeline writes weather telemetry through a transmissions table backed by a _mv_transmissions_forecast_joiner materialized view. Every insert fires the MV, which enriches the row with forecast data by joining against a deduped best_known_forecasts table.

The original join was an ASOF LEFT JOIN: for each transmission, find the forecast bucket whose timestamp is the latest preceding the transmission. Elegant in SQL. Expensive in memory.

ASOF JOIN: Match Each Transmission to Closest Preceding Forecast14:0015:0016:0017:00ForecastsF1F2F3F4TransmissionsT114:37T215:22T316:45T1 + F1 (37min)T2 + F2 (22min)T3 + F3 (45min)

ClickHouse implements ASOF JOIN by building an in-memory hash table from the right-hand side per query fire, with rows sorted by the asof column inside each bucket for the inequality match. On Frost's nonprod cluster, we measured this per-fire structure at 108–207 MiB depending on which environment was firing — call it ~140 MiB combined across the two nonprod MVs at typical load.

The MV builds that structure on every single insert. Run two inserts concurrently and you have two copies. Run twenty and you have twenty. Per-fire memory that's modest in isolation becomes a memory multiplier under concurrency, and concurrent insert bursts are exactly what a weather pipeline produces when a storm front rolls a fleet of devices through their reporting windows simultaneously.

That memory multiplier was the structural driver behind a string of OOM cascades we'd been chasing on the nonprod cluster — events where peak concurrent inserts punched the resident memory ceiling and ClickHouse killed in-flight queries to recover. The fix had to address the per-fire build, not the concurrency.

The Fix: A range_hashed Dictionary, One Resident Copy

Instead of joining against best_known_forecasts per fire, we built a range_hashed ClickHouse dictionary — forecast_dict — that holds the relevant forecast window resident in memory, once, and serves it via dictGet.

range_hashed dictionary — a ClickHouse dictionary type that indexes rows by a key plus a range of values (start and end). dictGet returns the row whose range contains the lookup value. One resident copy, lookups are O(1) on the key + O(log N) on the range.

The dictionary is populated by a source query against best_known_forecasts, keyed on DeviceID, with each row's range spanning one forecast bucket:

sql
-- Source query (abbreviated)
SELECT
DeviceID,
toDateTime(ForecastDateTimeUTC)                         AS range_start,
toDateTime(leadInFrame(ForecastDateTimeUTC) OVER (
  PARTITION BY DeviceID
  ORDER BY ForecastDateTimeUTC
)) - 1                                                  AS range_end,
argMax(temp_f,           LatestCaptureDateTimeUTC)      AS temp_f,
argMax(precipitation_mm, LatestCaptureDateTimeUTC)      AS precipitation_mm,
-- ... 20 more forecast fields ...
FROM best_known_forecasts
WHERE ForecastDateTimeUTC BETWEEN now() - INTERVAL 6 HOUR
                            AND now() + INTERVAL 1 HOUR
AND ForecastDateTimeUTC >= '2022-01-01'
GROUP BY DeviceID, ForecastDateTimeUTC

Three details matter here, and each took an iteration to get right.

argMax over LatestCaptureDateTimeUTC collapses forecast versions. Forecasts get republished — a 14:00 model run gets refined by a 15:00 run. The original ASOF join read best_known_forecasts without FINAL, so it picked an arbitrary un-merged version. The dictionary picks the latest by LatestCaptureDateTimeUTC. This turned out to fix a latent stale-data bug as a side effect: ~9% of pre-cutover transmissions were carrying stale forecast values that the dictionary now corrects. Independent ticket, accidental win.

leadInFrame makes adjacent buckets tile without overlap. range_hashed bounds are inclusive on both sides. Without the -1 second, two adjacent 5-minute buckets would both claim the same boundary instant. With it, they tile cleanly.

The window is [now() - 6h, now() + 1h]. The +1h future headroom is load-bearing — and it took a freeze incident to figure out why.

The INVALIDATE_QUERY trap. The dictionary's first iteration bounded the window at <= now() and used INVALIDATE_QUERY to skip no-op reloads. Both decisions look correct in isolation. Together they froze the dictionary for ~50 minutes at the top of each hour, because (1) the window's content depended on wall-clock advancement that the INVALIDATE_QUERY couldn't observe, and (2) max(LatestCaptureDateTimeUTC) only meaningfully advanced hourly. The fix was to extend the window to now() + 1h so the dictionary always holds the soon-to-be-current bucket, and to drop the INVALIDATE_QUERY clause so reloads run unconditionally on the 15–30s LIFETIME. If your dictionary's SOURCE query contains now(), an INVALIDATE_QUERY is structurally incompatible with it — it's a SQL query over data tables and physically cannot observe wall-clock advancing.

The MV's new body is, end to end, two lines of dictionary access:

sql
-- Inside _mv_transmissions_forecast_joiner, replacing the ASOF JOIN
dictGet(
'forecast_dict',
('temp_f', 'precipitation_mm', /* ...22 fields... */),
DeviceID,
CaptureDateTimeUTC
) AS forecast

The actual cutover was a single ALTER TABLE _mv_transmissions_forecast_joiner MODIFY QUERY — atomic, ~210 ms per environment, indistinguishable from a normal insert from outside. Per-fire MV memory dropped from ~67.5 MiB average to ~7.4 MiB within the same second the migration applied. Bucket selection — which forecast row matches each transmission — was identical to the old joiner. Zero insert errors. Zero MV exceptions.

That should have been the end of the story.

The Win Wasn't In Resident Memory

We had advertised the cutover internally as a memory reduction. The shadow A/B harness had measured per-fire MV memory at ~140 MiB combined pre-cutover and ~7.4 MiB post — a 95% drop, the headline number we'd been quoting for a week.

After applying, the first place we looked was system.asynchronous_metric_log for MemoryResident — process RSS, sampled per-second by ClickHouse's async-metric thread. We aggregated it as a 5-minute windowed max across replicas to smooth the noise. The thing every CH operator reaches for when they want to know "did the cluster get lighter."

It was flat. 13.0 GiB before the cutover, 13.0 GiB after. A 95% reduction in per-fire memory had produced a 0% change in the cluster's resident memory.

MemoryResident — process RSS as sampled by ClickHouse's async-metric thread on a 1-second interval. The number that answers "how much memory is the cluster currently holding?"

query_views_log.peak_memory_usage — the per-MV-fire peak memory recorded for each materialized view execution. Aggregate this by minute and you get transient allocation throughput.

The two metrics observe different things. MemoryResident samples a long-running average — page cache, mark cache, jemalloc working set, query cache. On Frost's nonprod cluster, that floor is ~12.6 GiB and is structural. It barely moves with insert load.

query_views_log.peak_memory_usage captures per-fire peaks. Sum it across all MV fires in a minute and you get something closer to "how much memory was this cluster momentarily allocating and freeing." That number is what actually predicts OOM exposure under concurrent bursts — concurrent allocations stack, the structural resident floor doesn't.

Here's what those two metrics actually showed across the cutover:

WindowPre-dev cutover (both MVs on ASOF)Dev-only on dictBoth envs on dict
peak_memory_usage sum per minute (avg)20.24 GiB9.66 GiB1.91 GiB
peak_memory_usage sum per minute (peak)35.71 GiB19.94 GiB2.43 GiB
MemoryResident per replica13.0 GiB13.1 GiB13.0 GiB

The cluster's resident memory was never the right place to look. The per-fire allocations had always been transient — allocated and freed within the query's lifetime, which is sub-second on a write-time MV. They never lived long enough to show up at 1-second sampling resolution, and they never accumulated into the resident floor.

But they were real, and they stacked. A 35.7 GiB/min sustained peak transient throughput on a cluster with (at the time) a 16 GiB cgroup ceiling is exactly the configuration where a concurrent insert burst converts a routine workload into an OOM cascade. The cutover dropped that peak to 2.4 GiB/min — a 15× reduction in burst exposure that maps cleanly to the OOM history we'd been fighting.

The reusable framing: when you profile memory for a write-time MV in ClickHouse, look at both layers.

  • MemoryResident answers "is the cluster bloated?" — relevant for steady-state sizing, page cache pressure, slow leaks.
  • query_views_log.peak_memory_usage aggregated per minute answers "is the cluster being momentarily punched by concurrent allocations?" — relevant for burst headroom and OOM exposure.
  • A write-time MV optimization usually changes the second number, not the first. The instinct to reach for MemoryResident first is the wrong instinct.

Two Validation Gotchas Worth Stealing

Before we declare a measurement framing a generalizable lesson, here are two more concrete things that bit us during verification. Both are reusable.

CreatedDateTimeUTC is device-reported, not pipeline-stamped. Our post-cutover parity query filtered rows by WHERE CreatedDateTimeUTC >= '<cutover_time>'. That looked obviously correct — until we saw 81.6% drift on the first verification run, where the shadow harness had predicted 0.7%.

The cause: a device (DeviceID 3928, for the record) was sending CreatedDateTimeUTC = 23:52:23 at wall-clock 15:00. Future-dated timestamps from a clock-skewed device. Those rows had been physically inserted hours earlier — pre-cutover, by the old ASOF MV — but their device-reported timestamp passed our "post-cutover" filter. The verification query was comparing post-cutover dict reads against pre-cutover ASOF reads and calling them drift.

The fix was a one-liner — AND CreatedDateTimeUTC <= now() + INTERVAL 1 MINUTE — but the gotcha generalizes. Whenever you scope a verification query by a timestamp that's not pipeline-stamped, you're trusting that the upstream system is wall-clock honest. IoT fleets, in particular, are not.

MV-vs-MV parity becomes tautological post-cutover. During the shadow rollout, we ran a live MV (ASOF) alongside a shadow MV (dictionary) and compared their outputs row-by-row. Different SELECT bodies, same inserts. Meaningful test.

After the cutover, both MVs write identical values because they're now the same query. Continuing to compare them produces 0% drift, which is not a verification, just an identity. The gate has to switch from "live vs shadow" to "materialized values vs an independent oracle" — for us, that's re-resolving dictGet at query time and comparing it to the values the MV wrote at insert time. Different code path, different evaluation moment, real signal.

If you're rolling out a write-time enrichment behind a shadow A/B harness, plan the post-cutover verification before you cut over. The harness you built for the rollout is no longer useful the instant the cutover lands.

What Generalizes

The technical pattern — replacing a write-time ASOF JOIN with a range_hashed dictionary in a high-volume MV — generalizes to any pipeline where you're enriching inserts via a join against a slowly-changing reference table. ClickHouse dictionaries are designed for exactly this shape: one resident copy, O(1) lookup, content that's a function of source data plus (carefully) wall-clock window. If your MV's hot path is a join, it's a candidate.

The measurement framing generalizes further. Any system where short-lived high-allocation queries run on a low-ceiling instance — ClickHouse Cloud, an undersized container, a 16 GiB cgroup someone forgot to resize when the cluster grew — has the same gap between resident memory and transient allocation throughput. The resident view is what monitoring shows you by default. The transient view is where the OOM exposure actually lives.

We had been measuring the wrong layer. The migration we shipped was correct on its own terms — a 95% per-fire memory reduction was real, the SQL pattern is the right one, the cutover was clean. But the more useful artifact, the thing we'll carry forward into the next CH engagement, is the instinct to look at both MemoryResident and query_views_log.peak_memory_usage together before declaring a memory problem understood.

The dashboard answers the question you ask it. Ask the right question first.