Datasheet for Dataset

Following the framework by Gebru et al. (2021) for transparent dataset documentation.

Motivation

For what purpose was the dataset created?

AirData.kz was created to fill a critical gap in publicly available air quality data for Kazakhstan. Government monitoring data was scattered across agencies, reported in inconsistent formats and units, and often inaccessible to researchers and the public. Historical readings were routinely deleted after short retention periods. The dataset was created to collect, harmonize, clean, and permanently archive air quality measurements from every available source — making them freely accessible for research, journalism, public health analysis, and civic awareness.

Who created the dataset and on behalf of which entity?

The dataset was created and is maintained by volunteers from the Global Shapers Almaty Hub, an initiative of the World Economic Forum. AirData.kz operates as a non-profit, non-commercial open data project with no corporate affiliation.

Who funded the creation of the dataset?

The project is entirely self-funded by its volunteers. Infrastructure costs (server, domain, API access) are covered through personal contributions and occasional public donations. There are no grants, corporate sponsors, or government funding.

Composition

What do the instances represent?

Each instance is a single air quality or meteorological measurement: one parameter, at one station, at one point in time. For example: "PM2.5 = 45.2 µg/m³ at station KGMT-040 on 2024-01-15 at 14:00 UTC+6."

How many instances are there in total?

As of March 2026, the dataset contains approximately:

45+ million raw KGMT (government) readings
2.3 million WAQI (international aggregator) readings
60,000+ OpenAQ readings
664,000+ cleaned Almaty PM2.5 hourly readings
1,094 monitoring stations in the station registry

The dataset grows continuously as new data is ingested every 20 minutes from active sources.

Does the dataset contain all possible instances or is it a sample?

The dataset aims to be a census, not a sample — we attempt to collect every available reading from every source. However, it is inherently incomplete: government stations have downtime, sensors go offline, and some historical data was lost before we began archiving in 2019. Coverage varies by city and time period. Almaty has the densest coverage (5 sources, 200+ stations). Other cities rely primarily on KGMT government stations.

What data does each instance consist of?

Each measurement record contains:

Station identifier and geographic coordinates (latitude, longitude)
Timestamp (UTC and local timezone)
Parameter code (e.g., pm25, no2, co, temperature)
Measured value in harmonized units (µg/m³ for concentrations)
Original raw value and original unit (preserved for auditability)
Data source identifier (kgmt, airgradient, openaq, waqi, airkaz)
Quality control flag (raw, clean, suspect, invalid)
QC reason code if flagged (e.g., spike, stuck_sensor, cluster_outlier)

Is there a label or target associated with each instance?

No. This is an observational dataset, not a labeled dataset for supervised learning. However, each measurement carries a quality control flag (clean / suspect / invalid) assigned by our automated cleaning pipeline, which could be used as a label for data quality research.

Is any information missing from individual instances?

Yes. Common sources of missing data include: sensor downtime (no reading recorded), network outages affecting data transmission, government stations reporting only during business hours (some early KGMT data), and parameters not measured by all stations (e.g., only KGMT measures H₂S and SO₂, only AirGradient measures CO₂ and TVOC). Missing values are represented as NULL — we never impute or interpolate.

Are there errors, sources of noise, or redundancies?

Yes, extensively documented. Known issues include: sensor drift and calibration errors (especially in low-cost sensors), stuck sensors reporting identical values for hours, sudden spikes from electromagnetic interference, government station unit changes without notification, and overlapping coverage between sources (e.g., KGMT stations also appear in WAQI feeds). Our 7-stage cleaning pipeline specifically targets these issues, and all flags are preserved in the published data.

Is the dataset self-contained?

Yes. The published dataset (CSV, Parquet, GeoJSON files) is fully self-contained and does not require external resources. The station registry includes all necessary metadata (coordinates, source, operator). Raw data from upstream APIs is archived in our database — the published files do not depend on these APIs remaining available.

Does the dataset contain confidential data?

No. All data consists of environmental measurements from fixed monitoring stations in public locations. No personal data is collected. Station locations are public infrastructure coordinates.

Does the dataset contain offensive content?

No. The dataset contains only numerical measurements and station metadata.

Collection Process

How was the data acquired?

Data is directly observed by physical instruments (hardware sensors and reference-grade monitors). It is acquired via automated API polling from five sources: KazHydroMet (government REST API), AirGradient (public sensor API), OpenAQ (open data API), WAQI/aqicn.org (public feeds), and AirKaz (historical CSV archives). Additionally, historical government data (2018-2022) was extracted from Excel spreadsheets provided directly by KazHydroMet under an official data-sharing agreement.

What mechanisms were used to collect the data?

Collection mechanisms by source:

KGMT — Reference-grade analyzers (BAM, chemiluminescence, UV fluorescence) housed in climate-controlled stations. Data transmitted to central server, accessed via REST API.
AirGradient — Low-cost optical particle counters (PMS5003) with CO₂ (SenseAir S8) and TVOC sensors. Data uploaded via WiFi to AirGradient cloud API.
OpenAQ — Aggregation platform that collects from government and research networks worldwide. We poll their v3 API.
WAQI — Aggregation platform using government feeds. We poll their JSON API.
AirKaz — Historical low-cost sensor network (2017-2020) with daily CSV exports.

Who was involved in the data collection?

Upstream data collection is performed by the operating organizations (KazHydroMet, AirGradient sensor owners, etc.). AirData.kz's role is aggregation, not primary collection. Our automated pipeline fetches, harmonizes, and archives the data. Pipeline development and maintenance is done by project volunteers.

Over what timeframe was the data collected?

Earliest records: March 2017 (AirKaz PM2.5 sensors in Almaty). KGMT data: June 2020 to present (with backfilled Excel data from 2018). AirGradient: ongoing real-time polling. OpenAQ: 2024-present. WAQI: October 2025-present. Collection is continuous and ongoing — new data is ingested every 20 minutes.

Preprocessing, Cleaning, and Labeling

Was any preprocessing/cleaning done?

Yes, extensively. The cleaning pipeline has 7 stages applied to all parameters:

S1: Negative value and NULL filtering — applied at database insertion
S2: Hard cap — physically implausible values flagged as invalid (e.g., PM2.5 > 1,000 µg/m³)
S3a: Constant station detection — stations reporting identical values ≥70% of a month
S3b: Implausible baseline — station-month medians exceeding realistic thresholds
S3c: Dead sensor detection — stations reporting >80% zero values
S4: Statistical outlier detection — robust Z-scores (MAD-based) with partial pooling
S5: Singleton spike detection — isolated jumps >10× from neighboring readings
S6: Stuck sensor detection — identical consecutive values for ≥6 hours
S7: Cluster outlier detection — daily station averages >3 robust-Z from geographic cluster median

Unit harmonization is also performed: KGMT mg/m³ → µg/m³ (×1000), WAQI AQI index → µg/m³ (EPA breakpoint reverse conversion), KGMT pressure mmHg → hPa.

Was the raw data saved?

Yes. Raw data is preserved exactly as received in dedicated Layer 0 tables (one per source). Every published measurement retains a link to its raw source, including the original value and original unit. The raw data is never modified.

Is the cleaning software available?

Yes. The entire pipeline, including all cleaning stages, is open source and available in the project's GitHub repository.

Uses

Has the dataset been used for any tasks already?

Yes:

AirData-AI — an AI-powered analytics tool that answers natural language questions about air quality using the dataset
Calendar heatmap visualizations on airdata.kz showing daily PM2.5 levels
Cigarette equivalent calculator (Berkeley Earth methodology) for public health awareness
Personal exposure estimator based on daily activity patterns
Internal research on seasonal and geographic patterns of air pollution in Kazakhstan

What other tasks could the dataset be used for?

Epidemiological research linking air quality to health outcomes. Urban planning and transportation policy analysis. Climate and weather pattern studies. Machine learning research on time-series anomaly detection, sensor fusion, or air quality forecasting. Environmental journalism investigations. Education in data science, environmental science, or public health courses.

Is there anything about composition or collection that might impact future uses?

Yes. Users should be aware that: (1) Station density varies significantly by city — Almaty has 200+ stations while smaller cities may have only 1-3. (2) Temporal coverage is uneven — some sources only started in 2023-2025. (3) Low-cost sensors (AirGradient, AirKaz) have lower accuracy than government reference monitors. (4) KGMT data before 2023 was backfilled from Excel archives, which had formatting inconsistencies that required manual correction. (5) Our cleaning pipeline flags ~11% of Almaty PM2.5 data as suspect or invalid — users should decide whether to include flagged data based on their use case.

Are there tasks for which the dataset should not be used?

The dataset should not be used for: real-time health emergency alerts (use official government sources for that), regulatory compliance or legal proceedings (we are not an accredited monitoring network), individual-level health risk assessment without professional guidance, or as ground truth for training models without understanding the QC flags and known limitations.

Distribution

Will the dataset be distributed to third parties?

Yes. The dataset is publicly available to anyone, without restriction. It is distributed via the AirData.kz website and GitHub repository.

How will the dataset be distributed?

Multiple formats: compressed CSV files (partitioned by city and parameter), Apache Parquet files (for analytics), GeoJSON (for mapping), and a station registry CSV. All files are available for direct download from airdata.kz/data/ and from the GitHub repository.

Will the dataset be distributed under a license?

The dataset is released under open terms with no restrictions on use. Attribution is appreciated but not required. Upstream data sources have their own terms: KGMT data is shared under an official agreement with KazHydroMet for research use, AirGradient and OpenAQ data is open by their respective policies, WAQI data is subject to their terms of use.

Maintenance

Who is maintaining the dataset?

The AirData.kz volunteer team, operating under the Global Shapers Almaty Hub.

How can the maintainer be contacted?

Via email at airdatakz@gmail.com.

Will the dataset be updated?

Yes, continuously. New data is ingested every 20 minutes from active sources. Daily aggregations and quality checks run automatically. CSV/Parquet exports are regenerated daily. The station registry is updated as new stations come online.

Will older versions continue to be available?

Historical data is never deleted — the dataset is append-only. Older measurements remain in the dataset permanently. However, quality flags may be updated if our cleaning methodology improves. The raw data layer is immutable.

Can others contribute to the dataset?

Yes. The project is open source. Contributors can submit pull requests to improve the pipeline code, suggest new data sources, or report data quality issues via GitHub. We also welcome partnerships with monitoring networks that want their data included.

Citation

If you use this dataset in research or publication, please cite:

AirData.kz. Open Air Quality Dataset for Kazakhstan. Global Shapers Almaty Hub, 2019-present. Available at: https://airdata.kz

About this document

This datasheet follows the "Datasheets for Datasets" framework proposed by Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J.W., Wallach, H., Daumé III, H., and Crawford, K. (2021). Communications of the ACM, 64(12), 86-92.