In this notebook we are going to see how one could compute statistics for certain geometric areas. Some potential usecases for this might be

average temperature per country
aggregate land cover per US state
average population density per country

and many more. In order to compute zonal statistics, we generally need two things:

a variable of interest defined on a raster image (i.e. one value per pixel)
a set of geometries that define the relevant areas for aggregation

Although the focus should be on computation of zonal statistics, we will actually also take a quick view on what we can learn from our analysis data while we are at it. The example we are looking at is US land cover.

In [1]:

# import geemap
import geemap as geemap
import ee
import pandas as pd
import tempfile
import matplotlib.pyplot as plt
import seaborn as sns

from egis.utils import (
    get_landcover_class_info
)

from egis.utils_plotting import make_outside_legend

In [2]:

# Settings for charts
plt.rcParams["figure.figsize"] = 14, 6
sns.set()

Variable of interest: US Land cover data¶

First we will load and visualize a US land cover dataset: NLCD: USGS National Land Cover Database

In [ ]:

Map = geemap.Map()

dataset = ee.Image('USGS/NLCD/NLCD2016')
landcover = ee.Image(dataset.select('landcover'))
Map.addLayer(landcover, {}, 'NLCD 2016')

lat = 40
lon = -84
Map.setCenter(lon, lat, 4)

Map.add_legend(builtin_legend='NLCD')

Map

Classified land cover will be our variable of interest. A more detailed legend with some explanations for the individual land cover classes can be found here: https://www.mrlc.gov/data/legends/national-land-cover-database-2019-nlcd2019-legend

The land cover dataset also includes some metadata with regards to the individual land cover classes. This metadata can be queried from the data. We are putting it into a DataFrame for later use.

In [4]:

landcover_data_info = landcover.getInfo()

In [5]:

lc_class_info = get_landcover_class_info(landcover_data_info)
lc_class_info

Out[5]:

	class_number	class_name	description	class_name_statistics	color	class_group	class_group_name	class_group_color
0	11	Open Water	All areas of open water, generally with less t...	Class_11	#476ba1	1	Water	#d1defa
1	12	Perennial Ice/Snow	All areas characterized by a perennial cover o...	Class_12	#d1defa	1	Water	#d1defa
2	21	Developed, Open Space	Includes areas with a mixture of some construc...	Class_21	#decaca	2	Developed	#ab0000
3	22	Developed, Low Intensity	Includes areas with a mixture of constructed m...	Class_22	#d99482	2	Developed	#ab0000
4	23	Developed, Medium Intensity	Includes areas with a mixture of constructed m...	Class_23	#ee0000	2	Developed	#ab0000
5	24	Developed, High Intensity	Includes highly developed areas where people r...	Class_24	#ab0000	2	Developed	#ab0000
6	31	Barren Land (Rock/Sand/Clay)	Barren areas of bedrock, desert pavement, scar...	Class_31	#b3aea3	3	Barren	#b3aea3
7	41	Deciduous Forest	Areas dominated by trees generally greater tha...	Class_41	#68ab63	4	Forest	#b5ca8f
8	42	Evergreen Forest	Areas dominated by trees generally greater tha...	Class_42	#1c6330	4	Forest	#b5ca8f
9	43	Mixed Forest	Areas dominated by trees generally greater tha...	Class_43	#b5ca8f	4	Forest	#b5ca8f
10	51	Dwarf Scrub	Alaska only areas dominated by shrubs less tha...	Class_51	#a68c30	5	Shrubland	#ccba7d
11	52	Shrub/Scrub	Areas dominated by shrubs less than 5 meters t...	Class_52	#ccba7d	5	Shrubland	#ccba7d
12	71	Grassland/Herbaceous	Areas dominated by grammanoid or herbaceous ve...	Class_71	#e3e3c2	7	Herbaceous	#78ae94
13	72	Sedge/Herbaceous	Alaska only areas dominated by sedges and forb...	Class_72	#caca78	7	Herbaceous	#78ae94
14	73	Lichens	Alaska only areas dominated by fruticose or fo...	Class_73	#99c247	7	Herbaceous	#78ae94
15	74	Moss	Alaska only areas dominated by mosses generall...	Class_74	#78ae94	7	Herbaceous	#78ae94
16	81	Pasture/Hay	Areas of grasses, legumes, or grass-legume mix...	Class_81	#dcd93d	8	Planted/Cultivated	#ab7028
17	82	Cultivated Crops	Areas used for the production of annual crops,...	Class_82	#ab7028	8	Planted/Cultivated	#ab7028
18	90	Woody Wetlands	Areas where forest or shrub land vegetation ac...	Class_90	#bad9eb	9	Wetlands	#70a3ba
19	95	Emergent Herbaceous Wetlands	Areas where perennial herbaceous vegetation ac...	Class_95	#70a3ba	9	Wetlands	#70a3ba

Geometries for aggregation: US states¶

Now we also need a set of geometric areas that we can use for aggregation. For this we will load the TIGER: US Census States 2018 dataset and visualize it.

In [ ]:

Map = geemap.Map()

states = ee.FeatureCollection("TIGER/2018/States")
Map.addLayer(states, {}, 'US States')

lat = 50
lon = -143
Map.setCenter(lon, lat, 3)

Map

As we can see, the tiger dataset also contains Alaska and other regions / islands apart from the contiguous US states. As we will see, these regions are not reflected in the US land cover dataset, so we will not be able to retrieve land cover statistics for them.

Apart from the geometric information the TIGER dataset contains some further metadata for individual observations:

This metadata can be retrieved from the data itself:

In [7]:

states_data_info = states.getInfo()
states_data_info['features'][0]['properties'].keys()

Out[7]:

dict_keys(['ALAND', 'AWATER', 'DIVISION', 'FUNCSTAT', 'GEOID', 'INTPTLAT', 'INTPTLON', 'LSAD', 'MTFCC', 'NAME', 'REGION', 'STATEFP', 'STATENS', 'STUSPS'])

We will use ALAND and AWATER data later on in order to double-check our aggregate zonal statistics.

Computation of zonal statistics¶

In order to query statistical data for the individual states we need to specify an output file to write to. We will just create a temporary file for this which we will immediately load into memory again for further analysis.

In [8]:

out_file_path = tempfile.NamedTemporaryFile().name + '.csv'

In [9]:

geemap.zonal_statistics_by_group(landcover, states, out_file_path, statistics_type='SUM', denominator=1000000, decimal_places=2)

Computing ... 
Generating URL ...
Downloading data from https://earthengine.googleapis.com/v1alpha/projects/earthengine-legacy/tables/64bafed35c132ff4476d96d44c785f9c-000f9999e12f725d5ead397130a20bc7:getFeatures
Please wait ...
Data downloaded to /tmp/tmpdjupkqvx.csv

When we read in the data again we will see that the statistics come with metadata columns attached.

In [10]:

us_landcover_stats_raw = pd.read_csv(out_file_path)
us_landcover_stats_raw.head(3)

Out[10]:

	...	STATEFP	FUNCSTAT	INTPTLAT	REGION	NAME	INTPTLON	MTFCC	ALAND	system:index
0	...	78	A	18.326748	9	United States Virgin Islands	-64.971251	G4000	348021896	00000000000000000022
1	...	69	A	14.936784	9	Commonwealth of the Northern Mariana Islands	145.601021	G4000	472292529	00000000000000000023
2	...	66	A	13.438289	9	Guam	144.772949	G4000	543555840	00000000000000000024

3 rows × 32 columns

The zonal statistics also include a Class_sum column which contains the aggregate area of all individual land cover classes for each of the individual US regions. This column now can be used to cross-check numbers with the area sizes provided by the US states dataset.

In [11]:

us_landcover_stats = us_landcover_stats_raw.set_index('NAME')
us_landcover_stats['approx_area'] = (us_landcover_stats['ALAND'] + us_landcover_stats['AWATER']) / 1000000
us_landcover_stats.loc[:, ['Class_sum', 'approx_area']].tail(5).round()

Out[11]:

	Class_sum	approx_area
NAME
California	423891.0	423968.0
Oregon	254797.0	254799.0
Washington	184649.0	184672.0
Hawaii	0.0	28412.0
Alaska	0.0	1724321.0

In [12]:

us_landcover_stats.loc[:, ['Class_sum', 'approx_area']].plot(kind='bar')
plt.title('Cross-check for aggregate area numbers and each US state')
plt.xlabel('')
plt.show()

From the chart we can see two things. First, for some regions we get 0 as aggregate area derived from the land cover dataset. The reason for this is what we already mentioned above: land cover is only provided for the contiguous US states, but not for Hawaii, Alaska and other islands. Second, for all states where we do have land cover values the aggregate US state areas match almost perfectly. Hence, the zonal statistics seem to work as expected.

Now that we are confident about the zonal statistics data, let's first have a look at land cover statistics for the full US (except for Hawaii, ...).

In [13]:

existing_classes = [this_col for this_col in us_landcover_stats.columns if this_col in lc_class_info['class_name_statistics'].values]
us_landcover_class_stats = us_landcover_stats.loc[:, existing_classes]

In [14]:

aggregate_us_land_cover = us_landcover_class_stats.sum(axis=0).reset_index()
aggregate_us_land_cover.columns = ['class_name_statistics', 'aggregate_area']
aggregate_us_land_cover = aggregate_us_land_cover.merge(lc_class_info)

In [15]:

rel_aggregate_us_land_cover = aggregate_us_land_cover.loc[:, ['class_name', 'aggregate_area', 'color']].set_index(['class_name', 'color'])
rel_aggregate_us_land_cover = rel_aggregate_us_land_cover / rel_aggregate_us_land_cover.sum() * 100
rel_aggregate_us_land_cover = rel_aggregate_us_land_cover.sort_values('aggregate_area')
rel_aggregate_us_land_cover = rel_aggregate_us_land_cover.reset_index().set_index('class_name')

colors = rel_aggregate_us_land_cover['color'].values

In [16]:

rel_aggregate_us_land_cover['aggregate_area'].plot(kind='barh', figsize=(12, 10), width=0.9, color=colors)
plt.xlabel('Covered area in percentages')
plt.ylabel('Land cover class')
plt.title('US land cover areas per individual land cover classes')
plt.show()

Next, we will analyse land cover areas for individual US states. But in order to make things slightly more manageable, we also aggregate land cover classes to slightly broader categories.

In [17]:

# compute percentage land cover values
us_landcover_class_stats_rel = us_landcover_class_stats.div(us_landcover_class_stats.sum(axis=1), axis=0) * 100
us_landcover_class_stats_rel = us_landcover_class_stats_rel.dropna(axis=0)
us_landcover_class_stats_rel = us_landcover_class_stats_rel.reset_index().melt(id_vars='NAME')
us_landcover_class_stats_rel.columns = ['state', 'class_name_statistics', 'area_pct']
us_landcover_class_stats_rel = us_landcover_class_stats_rel.merge(lc_class_info).loc[:, ['state', 'area_pct', 'class_name', 'class_group_name']]

# aggregate values to land cover group classes
us_landcover_class_group_stats_rel = us_landcover_class_stats_rel.loc[:, ['state', 'area_pct', 'class_group_name']]
us_landcover_class_group_stats_rel = us_landcover_class_group_stats_rel.groupby(['state', 'class_group_name']).sum()
us_landcover_class_group_stats_rel = us_landcover_class_group_stats_rel.reset_index()

# transform to wide format
us_landcover_wide = us_landcover_class_group_stats_rel.pivot(index='state', columns='class_group_name', values='area_pct')
us_landcover_wide = us_landcover_wide.iloc[::-1]

# define color coding for group classes
color_lookup = lc_class_info.loc[:, ['class_group_name', 'class_group_color']].drop_duplicates().set_index('class_group_name')
colors = color_lookup.loc[us_landcover_wide.columns.values, 'class_group_color'].values

# visualize
ax = us_landcover_wide.plot(kind='barh', stacked=True, figsize=(14, 18), width=0.9, color=colors)

make_outside_legend(ax, vspace=-0.05)
plt.title('Land cover areas per US state and land cover type')
plt.xlabel('Covered area in percentages')
plt.show()

From the chart we can already see that US states differ significantly with regards to average land cover ratios. For example, states like Iowa and Illinois seem to be perfectly suited for agricultural production, while states like Arizona or Nevada consist predominantly of shrubland.

A different way to see this is by showing the three states with highest coverage for each individual land cover class:

In [18]:

states_with_biggest_rel_area_per_class_type = us_landcover_class_group_stats_rel.sort_values(['class_group_name', 'area_pct']).groupby('class_group_name').tail(3).set_index('class_group_name')
states_with_biggest_rel_area_per_class_type

Out[18]:

	state	area_pct
class_group_name
Barren	Nevada	3.303700
Barren	California	4.551481
Barren	Utah	7.965546
Developed	Connecticut	18.277140
Developed	New Jersey	24.532109
Developed	District of Columbia	78.254237
Forest	Vermont	72.245349
Forest	New Hampshire	75.641841
Forest	West Virginia	84.398662
Herbaceous	Montana	37.544941
Herbaceous	South Dakota	48.366761
Herbaceous	Nebraska	53.002988
Planted/Cultivated	Indiana	65.695892
Planted/Cultivated	Illinois	71.506586
Planted/Cultivated	Iowa	85.057169
Shrubland	New Mexico	62.354696
Shrubland	Nevada	72.480131
Shrubland	Arizona	72.751338
Water	Massachusetts	26.109717
Water	Rhode Island	32.960488
Water	Michigan	41.358464
Wetlands	South Carolina	23.103330
Wetlands	Louisiana	29.575701
Wetlands	Florida	31.315429

We can also visualize these values in a bar chart:

In [19]:

color_dict = color_lookup.reset_index().sort_values('class_group_name').set_index('class_group_name').to_dict()['class_group_color']

In [20]:

fig = plt.figure(figsize=(12, 10))

xx = states_with_biggest_rel_area_per_class_type.iloc[::-1].reset_index().reset_index()
colors = xx['class_group_name'].map(color_dict)
xx['color'] = colors

plt.barh(xx['index'], xx['area_pct'], color=xx['color'])

# Create names on the x-axis
plt.yticks(xx['index'], xx['state'])

colors = color_dict      
labels = list(colors.keys())
handles = [plt.Rectangle((0,0),1,1, color=colors[label]) for label in labels]
plt.legend(handles, labels, loc='upper center', bbox_to_anchor=(0.5, -0.1), fancybox=False, shadow=False, ncol=len(labels))

plt.title('Top 3 US states per land cover class')
plt.xlabel('Covered are in percetages')
plt.show()