How We Clean the Data

Our approach to data quality, in detail.

Air quality data from real-world sensors is messy. Instruments malfunction. Readings spike for no apparent reason. A sensor freezes and reports the same number for hours. A government station switches units without warning.

If you publish this data as-is, people will draw wrong conclusions. If you silently delete the bad parts, people cannot verify what you did. Neither option is acceptable to us.

Our approach is to be thorough and transparent: check everything we can, flag every problem openly, and let people see exactly what we did and why. Here is how.

Architecture

Three layers, one principle

The original data is sacred. We never modify it. Instead, we build clean data on top of it — and keep the link between the two, so anyone can trace any published value back to its raw source.

Layer 0

Raw storage

The exact responses we receive from each API, stored as-is. One table per source. Nothing is changed, nothing is lost. This is our ground truth.

Layer 1

Clean & harmonized

All sources converted to common units, all measurements in a single format. Every row carries quality flags and a link to its raw source. This is where cleaning and validation happen.

Layer 2

Published open data

Only measurements that passed every quality check. This is what you download — Parquet files, CSV station listings, GeoJSON for mapping. Clean, reliable, ready to use.

Quality control

Four layers of checks

Each layer catches different kinds of problems. A measurement must pass all four before it reaches the published dataset.

1

At the door

Before data even enters the system, the database itself enforces basic rules — required fields cannot be empty, duplicates are rejected, coordinates must be valid, only known parameter codes are accepted. Broken or incomplete API responses are caught here.

2

During cleaning

As we transform raw data into the harmonized format, every measurement is checked individually: is this value physically possible? Is PM2.5 higher than PM10 (which usually indicates a sensor problem)? Is this concentration negative? Each issue gets a specific flag.

3

Deep statistical analysis (PM2.5)

PM2.5 is our most important parameter and the one most prone to sensor problems. We run a dedicated multi-stage cleaning process:

  • Hard cap — values above 1,000 ug/m3 are flagged as invalid
  • Suspect stations — sensors that report the same value more than 80% of the time, or have implausibly high baselines
  • Statistical outliers — using robust Z-scores that adapt to each station's local conditions
  • Spikes — isolated single-point jumps that do not match surrounding readings
  • Stuck sensors — the same value repeated for 6 or more consecutive hours
4

Final validation

Before anything is published, we run two independent validation systems across the entire dataset. If either one finds a problem — even a single failing check — publication is blocked until the issue is resolved.

The details

Validation checks

These are the specific checks that run before every publication. Errors block export entirely. Warnings are logged and reviewed.

Check Severity What it catches
out_of_range error Values outside physical bounds
negative_values error Negative concentrations
invalid_floats error NaN, Infinity values
orphan_measurements error Measurements without a registered station
duplicate_measurements error Should-not-exist duplicates
pm25_exceeds_pm10 warning Known AirKaz sensor issue
stuck_sensors warning Frozen sensor readings
spikes warning Sudden value jumps
stale_stations warning Active stations not reporting 48+ hours

We know this level of detail is not for everyone. Most people just want to download the data and trust that it is clean. We hope they can.

But for those who want to look under the hood — researchers, engineers, anyone who has been burned by bad data before — we want everything to be visible. The methodology, the flags, the raw originals. That is the kind of project we want to be.