9. Februar 2019
Inspecting air pollution data from OpenAQ using Colab, Pandas, and BigQuery
OpenAQ is publishing real-time air quality data from around the world to BigQuery. Today, I have poked around in the dataset to inspect air quality from many places of the world.
Why bother
As mentioned in a news bullet from Google Cloud, U.S. EPA and OpenAQ air quality data now available in BigQuery. I decided to have a look at the data for the following two reasons:
- I want to learn how to work with data from BigQuery in Colab.
- I care about breathing air of good quality.
In case you’re interested in any of the above, continue reading. All of what I am writing here is my personal opinion.
If you haven’t heard of it, BigQuery is a data warehouse by Google for analytics, and allows you to query enormous amounts of data, expressing your needs in SQL. Colab allows you to interactively work with data in Python.
Download Colab notebook
You can find the Colab notebook which I used for this article here:
File: Inspecting_air_pollution_data_from_OpenAQ_using_Colab_Pandas_and_BigQuery.ipynb [135.03 kB]
Category:
Download: 1232
You can open this Colab notebook using Go to File>Upload Notebook… in Colab.
Fetching the OpenAQ data with BigQuery in Colab
OpenAQ (https://openaq.org) “aggregate[s] physical air quality data from public data sources provided by government, research-grade and other sources.” with the purpose of fighting air inequality.
I might not contribute much to OpenAQ’s purpose today (except spreading the word), but am curious about air pollution, particularly since I moved to London about five years ago. I also played with a portable sensor once, see my earlier post on Air Pollution – Green, Yellow, Red: Toying with CleanSpace.
But back to the task at hand. OpenAQ regularly publishes the latest readings on air pollution to a public dataset stored in BigQuery (bigquery-public-data.openaq.global_air_quality
). The OpenAQ air quality dataset is small (at the time of writing, it has about 25k), so much of the potential power behind BigQuery is not really required because I can simply load the whole table into memory in Colab.
I should mention that there is an alternative to Colab, which is Cloud Datalab. Both are based on similar technology (Jupyter notebooks), but Cloud Datalab is more closely integrated into Google Cloud Platform (GCP). Today, I don’t need anything else than a connection to BigQuery, so Colab it is. The very first thing is to authenticate you as a user such that it is possible to run the BigQuery jobs against my GCP project.
#@title Authenticate user from google.colab import auth auth.authenticate_user()
As I said, let’s load the whole dataset into memory.
#@title Pull OpenAQ data via BigQuery %%bigquery --project silicon-perigee-208521 p #standardSQL select pollutant, country, city, location, value from `bigquery-public-data.openaq.global_air_quality`;
Inspecting data with Pandas in Colab
The next step is preparing the data such that I can easily pull out the worst spots reported in the world, and within each country (see pandas.DataFrame.rank).
#@title For each pollutant, calculate rank of each location # within its country and world-wide # Rank within world p = p.assign(world_rank=( p.groupby(['pollutant'])['value'] .rank(method='min', ascending=False) .astype(int))) # Rank within country p = p.assign(country_rank=( p.groupby(['pollutant', 'country'])['value'] .rank(method='min', ascending=False) .astype(int))) # Rank within city p = p.assign(city_rank=( p.groupby(['pollutant', 'country', 'city'])['value'] .rank(method='min', ascending=False) .astype(int))) p
#@title Pollutants in dataset print(', '.join(p['pollutant'].unique()))
pm25, co, bc, no2, pm10, o3, so2
Thus, there are seven different pollutants recorded in the dataset. I was not able to find a reference, but by looking at https://www.epa.gov/criteria-air-pollutants/naaqs-table, and by interpreting some of the codes as empirical formulas for molecules, we have:
no2 | Nitrogen Dioxide |
so2 | Sulphur Dioxide |
o3 | Ozone |
bc | Black Carbon |
co | Carbon Monoxide |
pm25 | Particulate Matter 2.5 |
pm10 | Particulate Matter 10 |
Nitrogen Dioxide (NO2)
Let us look at the three locations with the highest NO2 concentration─of course this only includes locations where there is a measurement in the dataset.
#@title Top three locations by Nitrogen Dioxide (NO2) (p.query('pollutant == "no2"') .nsmallest(3, 'world_rank') [['country', 'city', 'location', 'value', 'world_rank']])
country city location value world_rank 19239 XK Hani i Elezit Hani i Elezit 397.926 1 6078 IN Solapur Solapur, Solapur - MPCB 268.360 2 16204 IN Kalaburagi Lal Bahadur Shastri Nagar... 229.210 3
Hani i Elezit is in Kosovo. Note that these aren’t long term observations, and I can only take a guess here at why this location is listed high. There is a cement factory there, and “cement plants are a significant source of sulfur dioxide, nitrogen oxide and carbon monoxide” according to the EPA (https://www.epa.gov/enforcement/cement-manufacturing-enforcement-initiative).
Solapur is reported as “one of the most-polluted cities in Maharashtra” (https://en.wikipedia.org/wiki/Solapur#Environment). Kalaburagi is also located in India, but I have not come up with any hypotheses here.
#@title Top three locations in Great Britain by Nitrogen Dioxide (NO2) (p.query('pollutant == "no2" & country == "GB"') .nsmallest(3, 'country_rank') [['city', 'location', 'value', 'world_rank']])
city location value world_rank 4585 Belfast Belfast Stockman's Lane 63.0 359 6707 Chepstow Chepstow A48 60.0 418 6442 Aberdeen Aberdeen Union Street Roadside 54.0 544
Sulphur Dioxide (SO2)
#@title Top three locations by Sulphur Dioxide (SO2) (p.query('pollutant == "so2"') .nsmallest(3, 'world_rank') [['country', 'city', 'location', 'value', 'world_rank']])
country city location value world_rank 24629 MK State Air Quality ... Karpos 2.050019e+15 1 7092 CL Hualpén JUNJI 2.088350e+03 2 3412 ES Murcia ES1633A 1.301000e+03 3
The highest readings of SO2 were reported in Skopje, Macedonia (Karpoš is part of Skopje). I did quick research and found last year’s report from UN Enironment about The most polluted capital in Europe, you didn’t even know about, in which you can read that Skopje is Europe’s most pollted capital city (WHO Global Ambient Air Quality Database).
Hualpén has an oil refinery, and oil refineries are known to be sources for sulphur dioxide. So that’s a likely reason for the high readings.
You shall keep in mind here that I have not spent further time on evaluating what the usual range is for amounts of sulphur dioxide in the air. We also have to be careful when searching for maximum values, as any measurement errors (that are way off the scale) can easily distort the results.
#@title Top three locations in Great Britain by Sulphur Dioxide (SO2) (p.query('pollutant == "so2" & country == "GB"') .nsmallest(3, 'country_rank') [['city', 'location', 'value', 'world_rank']])
city location value world_rank 19339 Middlesbrough Middlesbrough 19.0 323 17408 Southampton Southampton Centre 14.0 497 1188 Derry Derry Rosemount 13.0 545
Middlesbrough, Southampton, and Derry (Londonderry) have made the top when I queried the database. These cities also have made it into the news for air pollution in the past:
- https://www.bbc.co.uk/news/uk-northern-ireland-foyle-west-42597054
- https://www.belfasttelegraph.co.uk/news/northern-ireland/derry-and-armagh-among-uks-leading-air-pollution-hotspots-36868078.html
- https://www.theguardian.com/environment/2018/sep/14/two-councils-warned-after-failing-to-meet-air-pollution-deadline
Particulate Matter (PM2.5)
#@title Top three locations by PM2.5 (p.query('pollutant == "pm25"') .nsmallest(3, 'world_rank') [['country', 'city', 'location', 'value', 'world_rank']])
country city location value world_rank 9849 CN 义乌市 江东 690.0 1 1695 IN Patna IGSC Planetarium... 674.0 2 9845 CN 义乌市 北苑 494.0 3
义乌市 ─ that’s Yiwu in Zhejiang, China.
Patna has been previously known as making it high in the list when it comes to air pollution (see references to WHO report in https://www.nytimes.com/2014/05/09/world/asia/cities-in-india-among-the-most-polluted-who-says.html).
#@title Top three locations in Great Britain by PM2.5 (p.query('pollutant == "pm25" & country == "GB"') .nsmallest(3, 'country_rank') [['city', 'location', 'value', 'world_rank']])
city location value world_rank 17442 Storrington Storrington Roadside 62.0 523 19611 West Midlands Birmingham A4540 Roadside 28.0 1118 8973 Newcastle Newcastle Centre 17.0 1672
Newcastle has been mentioned in several reports about poor air quality in 2018 in its local newspaper, the Chronicle Live. More surprising was the result of my quick online research about Storrington, which looks on the map like a quiet town near the South Downs but recently got some fame: traffic-choked Storrington was named by the World Health Organisation among the 32 worst polluted areas in the UK.
Comparing Great Britain and Sweden (Nitrogen Dioxide)
To close this off, let’s just do a little bit of Panda querying and pull out the worst location with regards to NO2 in Great Britain and Sweden, respectively.
#@title Compare the worst locations (NO2) in Great Britain and Sweden (p.query('pollutant == "no2"') .query('country_rank == 1') .query('(country == "GB") | (country == "SE")') [['country', 'city', 'location', 'value', 'world_rank']])
country city location value world_rank 4585 GB Belfast Belfast Stockma... 63.0000 359 9241 SE Stockholm Sveavägen 42.6175 917
Belfast came up in the corresponding query above. And in Sweden, the worst location is near a central street in Stockholm.
Conclusion
You have now seen how we can pull OpenAQ air quality data from BigQuery into Colab, and then to find pollution hotspots within either the world, or within Great Britain. I have tried─with very limited time at hand, and rather superficial research─to come up with hypotheses why some of these places showed high values for pollutants.