How We Clean the Data
Our approach to data quality, in detail.
Air quality data from real-world sensors is messy. Instruments malfunction. Readings spike for no apparent reason. A sensor freezes and reports the same number for hours. A government station switches units without warning.
If you publish this data as-is, people will draw wrong conclusions. If you silently delete the bad parts, people cannot verify what you did. Neither option is acceptable to us.
Our approach is to be thorough and transparent: check everything we can, flag every problem openly, and let people see exactly what we did and why. Here is how.
Architecture
Three layers, one principle
The original data is sacred. We never modify it. Instead, we build clean data on top of it — and keep the link between the two, so anyone can trace any published value back to its raw source.
Layer 0
Raw storage
The exact responses we receive from each API, stored as-is. One table per source. Nothing is changed, nothing is lost. This is our ground truth.
Layer 1
Clean & harmonized
All sources converted to common units, all measurements in a single format. Every row carries quality flags and a link to its raw source. This is where cleaning and validation happen.
Layer 2
Published open data
Only measurements that passed every quality check. This is what you download — Parquet files, CSV station listings, GeoJSON for mapping. Clean, reliable, ready to use.
Quality control
Four layers of checks
Each layer catches different kinds of problems. A measurement must pass all four before it reaches the published dataset.
At the door
Before data even enters the system, the database itself enforces basic rules — required fields cannot be empty, duplicates are rejected, coordinates must be valid, only known parameter codes are accepted. Broken or incomplete API responses are caught here.
During cleaning
As we transform raw data into the harmonized format, every measurement is checked individually: is this value physically possible? Is PM2.5 higher than PM10 (which usually indicates a sensor problem)? Is this concentration negative? Each issue gets a specific flag.
Deep statistical analysis (PM2.5)
PM2.5 is our most important parameter and the one most prone to sensor problems. We run a dedicated multi-stage cleaning process:
- Hard cap — values above 1,000 ug/m3 are flagged as invalid
- Suspect stations — sensors that report the same value more than 80% of the time, or have implausibly high baselines
- Statistical outliers — using robust Z-scores that adapt to each station's local conditions
- Spikes — isolated single-point jumps that do not match surrounding readings
- Stuck sensors — the same value repeated for 6 or more consecutive hours
Final validation
Before anything is published, we run two independent validation systems across the entire dataset. If either one finds a problem — even a single failing check — publication is blocked until the issue is resolved.
The details
Validation checks
These are the specific checks that run before every publication. Errors block export entirely. Warnings are logged and reviewed.
| Check | Severity | What it catches |
|---|---|---|
| out_of_range | error | Values outside physical bounds |
| negative_values | error | Negative concentrations |
| invalid_floats | error | NaN, Infinity values |
| orphan_measurements | error | Measurements without a registered station |
| duplicate_measurements | error | Should-not-exist duplicates |
| pm25_exceeds_pm10 | warning | Known AirKaz sensor issue |
| stuck_sensors | warning | Frozen sensor readings |
| spikes | warning | Sudden value jumps |
| stale_stations | warning | Active stations not reporting 48+ hours |
We know this level of detail is not for everyone. Most people just want to download the data and trust that it is clean. We hope they can.
But for those who want to look under the hood — researchers, engineers, anyone who has been burned by bad data before — we want everything to be visible. The methodology, the flags, the raw originals. That is the kind of project we want to be.