In this notebook we are going to see how one could compute statistics for certain geometric areas. Some potential usecases for this might be

  • average temperature per country
  • aggregate land cover per US state
  • average population density per country

and many more. In order to compute zonal statistics, we generally need two things:

  • a variable of interest defined on a raster image (i.e. one value per pixel)
  • a set of geometries that define the relevant areas for aggregation

Although the focus should be on computation of zonal statistics, we will actually also take a quick view on what we can learn from our analysis data while we are at it. The example we are looking at is US land cover.

In [1]:
# import geemap
import geemap as geemap
import ee
import pandas as pd
import tempfile
import matplotlib.pyplot as plt
import seaborn as sns

from egis.utils import (
    get_landcover_class_info
)

from egis.utils_plotting import make_outside_legend
In [2]:
# Settings for charts
plt.rcParams["figure.figsize"] = 14, 6
sns.set()

Variable of interest: US Land cover data

First we will load and visualize a US land cover dataset: NLCD: USGS National Land Cover Database

In [ ]:
Map = geemap.Map()

dataset = ee.Image('USGS/NLCD/NLCD2016')
landcover = ee.Image(dataset.select('landcover'))
Map.addLayer(landcover, {}, 'NLCD 2016')

lat = 40
lon = -84
Map.setCenter(lon, lat, 4)

Map.add_legend(builtin_legend='NLCD')

Map

image.png

Classified land cover will be our variable of interest. A more detailed legend with some explanations for the individual land cover classes can be found here: https://www.mrlc.gov/data/legends/national-land-cover-database-2019-nlcd2019-legend

The land cover dataset also includes some metadata with regards to the individual land cover classes. This metadata can be queried from the data. We are putting it into a DataFrame for later use.

In [4]:
landcover_data_info = landcover.getInfo()
In [5]:
lc_class_info = get_landcover_class_info(landcover_data_info)
lc_class_info
Out[5]:
class_number class_name description class_name_statistics color class_group class_group_name class_group_color
0 11 Open Water All areas of open water, generally with less t... Class_11 #476ba1 1 Water #d1defa
1 12 Perennial Ice/Snow All areas characterized by a perennial cover o... Class_12 #d1defa 1 Water #d1defa
2 21 Developed, Open Space Includes areas with a mixture of some construc... Class_21 #decaca 2 Developed #ab0000
3 22 Developed, Low Intensity Includes areas with a mixture of constructed m... Class_22 #d99482 2 Developed #ab0000
4 23 Developed, Medium Intensity Includes areas with a mixture of constructed m... Class_23 #ee0000 2 Developed #ab0000
5 24 Developed, High Intensity Includes highly developed areas where people r... Class_24 #ab0000 2 Developed #ab0000
6 31 Barren Land (Rock/Sand/Clay) Barren areas of bedrock, desert pavement, scar... Class_31 #b3aea3 3 Barren #b3aea3
7 41 Deciduous Forest Areas dominated by trees generally greater tha... Class_41 #68ab63 4 Forest #b5ca8f
8 42 Evergreen Forest Areas dominated by trees generally greater tha... Class_42 #1c6330 4 Forest #b5ca8f
9 43 Mixed Forest Areas dominated by trees generally greater tha... Class_43 #b5ca8f 4 Forest #b5ca8f
10 51 Dwarf Scrub Alaska only areas dominated by shrubs less tha... Class_51 #a68c30 5 Shrubland #ccba7d
11 52 Shrub/Scrub Areas dominated by shrubs less than 5 meters t... Class_52 #ccba7d 5 Shrubland #ccba7d
12 71 Grassland/Herbaceous Areas dominated by grammanoid or herbaceous ve... Class_71 #e3e3c2 7 Herbaceous #78ae94
13 72 Sedge/Herbaceous Alaska only areas dominated by sedges and forb... Class_72 #caca78 7 Herbaceous #78ae94
14 73 Lichens Alaska only areas dominated by fruticose or fo... Class_73 #99c247 7 Herbaceous #78ae94
15 74 Moss Alaska only areas dominated by mosses generall... Class_74 #78ae94 7 Herbaceous #78ae94
16 81 Pasture/Hay Areas of grasses, legumes, or grass-legume mix... Class_81 #dcd93d 8 Planted/Cultivated #ab7028
17 82 Cultivated Crops Areas used for the production of annual crops,... Class_82 #ab7028 8 Planted/Cultivated #ab7028
18 90 Woody Wetlands Areas where forest or shrub land vegetation ac... Class_90 #bad9eb 9 Wetlands #70a3ba
19 95 Emergent Herbaceous Wetlands Areas where perennial herbaceous vegetation ac... Class_95 #70a3ba 9 Wetlands #70a3ba

Geometries for aggregation: US states

Now we also need a set of geometric areas that we can use for aggregation. For this we will load the TIGER: US Census States 2018 dataset and visualize it.

In [ ]:
Map = geemap.Map()

states = ee.FeatureCollection("TIGER/2018/States")
Map.addLayer(states, {}, 'US States')

lat = 50
lon = -143
Map.setCenter(lon, lat, 3)

Map

image.png

As we can see, the tiger dataset also contains Alaska and other regions / islands apart from the contiguous US states. As we will see, these regions are not reflected in the US land cover dataset, so we will not be able to retrieve land cover statistics for them.

Apart from the geometric information the TIGER dataset contains some further metadata for individual observations:

image.png

This metadata can be retrieved from the data itself:

In [7]:
states_data_info = states.getInfo()
states_data_info['features'][0]['properties'].keys()
Out[7]:
dict_keys(['ALAND', 'AWATER', 'DIVISION', 'FUNCSTAT', 'GEOID', 'INTPTLAT', 'INTPTLON', 'LSAD', 'MTFCC', 'NAME', 'REGION', 'STATEFP', 'STATENS', 'STUSPS'])

We will use ALAND and AWATER data later on in order to double-check our aggregate zonal statistics.

Computation of zonal statistics

In order to query statistical data for the individual states we need to specify an output file to write to. We will just create a temporary file for this which we will immediately load into memory again for further analysis.

In [8]:
out_file_path = tempfile.NamedTemporaryFile().name + '.csv'
In [9]:
geemap.zonal_statistics_by_group(landcover, states, out_file_path, statistics_type='SUM', denominator=1000000, decimal_places=2)
Computing ... 
Generating URL ...
Downloading data from https://earthengine.googleapis.com/v1alpha/projects/earthengine-legacy/tables/64bafed35c132ff4476d96d44c785f9c-000f9999e12f725d5ead397130a20bc7:getFeatures
Please wait ...
Data downloaded to /tmp/tmpdjupkqvx.csv

When we read in the data again we will see that the statistics come with metadata columns attached.

In [10]:
us_landcover_stats_raw = pd.read_csv(out_file_path)
us_landcover_stats_raw.head(3)
Out[10]:
Class_81 Class_71 Class_82 Class_95 Class_41 Class_52 Class_31 Class_42 Class_21 Class_43 ... STATEFP FUNCSTAT INTPTLAT DIVISION REGION NAME INTPTLON MTFCC ALAND system:index
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 78 A 18.326748 0 9 United States Virgin Islands -64.971251 G4000 348021896 00000000000000000022
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 69 A 14.936784 0 9 Commonwealth of the Northern Mariana Islands 145.601021 G4000 472292529 00000000000000000023
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 66 A 13.438289 0 9 Guam 144.772949 G4000 543555840 00000000000000000024

3 rows × 32 columns

The zonal statistics also include a Class_sum column which contains the aggregate area of all individual land cover classes for each of the individual US regions. This column now can be used to cross-check numbers with the area sizes provided by the US states dataset.

In [11]:
us_landcover_stats = us_landcover_stats_raw.set_index('NAME')
us_landcover_stats['approx_area'] = (us_landcover_stats['ALAND'] + us_landcover_stats['AWATER']) / 1000000
us_landcover_stats.loc[:, ['Class_sum', 'approx_area']].tail(5).round()
Out[11]:
Class_sum approx_area
NAME
California 423891.0 423968.0
Oregon 254797.0 254799.0
Washington 184649.0 184672.0
Hawaii 0.0 28412.0
Alaska 0.0 1724321.0
In [12]:
us_landcover_stats.loc[:, ['Class_sum', 'approx_area']].plot(kind='bar')
plt.title('Cross-check for aggregate area numbers and each US state')
plt.xlabel('')
plt.show()

From the chart we can see two things. First, for some regions we get 0 as aggregate area derived from the land cover dataset. The reason for this is what we already mentioned above: land cover is only provided for the contiguous US states, but not for Hawaii, Alaska and other islands. Second, for all states where we do have land cover values the aggregate US state areas match almost perfectly. Hence, the zonal statistics seem to work as expected.

Now that we are confident about the zonal statistics data, let's first have a look at land cover statistics for the full US (except for Hawaii, ...).

In [13]:
existing_classes = [this_col for this_col in us_landcover_stats.columns if this_col in lc_class_info['class_name_statistics'].values]
us_landcover_class_stats = us_landcover_stats.loc[:, existing_classes]
In [14]:
aggregate_us_land_cover = us_landcover_class_stats.sum(axis=0).reset_index()
aggregate_us_land_cover.columns = ['class_name_statistics', 'aggregate_area']
aggregate_us_land_cover = aggregate_us_land_cover.merge(lc_class_info)
In [15]:
rel_aggregate_us_land_cover = aggregate_us_land_cover.loc[:, ['class_name', 'aggregate_area', 'color']].set_index(['class_name', 'color'])
rel_aggregate_us_land_cover = rel_aggregate_us_land_cover / rel_aggregate_us_land_cover.sum() * 100
rel_aggregate_us_land_cover = rel_aggregate_us_land_cover.sort_values('aggregate_area')
rel_aggregate_us_land_cover = rel_aggregate_us_land_cover.reset_index().set_index('class_name')

colors = rel_aggregate_us_land_cover['color'].values
In [16]:
rel_aggregate_us_land_cover['aggregate_area'].plot(kind='barh', figsize=(12, 10), width=0.9, color=colors)
plt.xlabel('Covered area in percentages')
plt.ylabel('Land cover class')
plt.title('US land cover areas per individual land cover classes')
plt.show()

Next, we will analyse land cover areas for individual US states. But in order to make things slightly more manageable, we also aggregate land cover classes to slightly broader categories.

In [17]:
# compute percentage land cover values
us_landcover_class_stats_rel = us_landcover_class_stats.div(us_landcover_class_stats.sum(axis=1), axis=0) * 100
us_landcover_class_stats_rel = us_landcover_class_stats_rel.dropna(axis=0)
us_landcover_class_stats_rel = us_landcover_class_stats_rel.reset_index().melt(id_vars='NAME')
us_landcover_class_stats_rel.columns = ['state', 'class_name_statistics', 'area_pct']
us_landcover_class_stats_rel = us_landcover_class_stats_rel.merge(lc_class_info).loc[:, ['state', 'area_pct', 'class_name', 'class_group_name']]

# aggregate values to land cover group classes
us_landcover_class_group_stats_rel = us_landcover_class_stats_rel.loc[:, ['state', 'area_pct', 'class_group_name']]
us_landcover_class_group_stats_rel = us_landcover_class_group_stats_rel.groupby(['state', 'class_group_name']).sum()
us_landcover_class_group_stats_rel = us_landcover_class_group_stats_rel.reset_index()

# transform to wide format
us_landcover_wide = us_landcover_class_group_stats_rel.pivot(index='state', columns='class_group_name', values='area_pct')
us_landcover_wide = us_landcover_wide.iloc[::-1]

# define color coding for group classes
color_lookup = lc_class_info.loc[:, ['class_group_name', 'class_group_color']].drop_duplicates().set_index('class_group_name')
colors = color_lookup.loc[us_landcover_wide.columns.values, 'class_group_color'].values

# visualize
ax = us_landcover_wide.plot(kind='barh', stacked=True, figsize=(14, 18), width=0.9, color=colors)

make_outside_legend(ax, vspace=-0.05)
plt.title('Land cover areas per US state and land cover type')
plt.xlabel('Covered area in percentages')
plt.show()

From the chart we can already see that US states differ significantly with regards to average land cover ratios. For example, states like Iowa and Illinois seem to be perfectly suited for agricultural production, while states like Arizona or Nevada consist predominantly of shrubland.

A different way to see this is by showing the three states with highest coverage for each individual land cover class:

In [18]:
states_with_biggest_rel_area_per_class_type = us_landcover_class_group_stats_rel.sort_values(['class_group_name', 'area_pct']).groupby('class_group_name').tail(3).set_index('class_group_name')
states_with_biggest_rel_area_per_class_type
Out[18]:
state area_pct
class_group_name
Barren Nevada 3.303700
Barren California 4.551481
Barren Utah 7.965546
Developed Connecticut 18.277140
Developed New Jersey 24.532109
Developed District of Columbia 78.254237
Forest Vermont 72.245349
Forest New Hampshire 75.641841
Forest West Virginia 84.398662
Herbaceous Montana 37.544941
Herbaceous South Dakota 48.366761
Herbaceous Nebraska 53.002988
Planted/Cultivated Indiana 65.695892
Planted/Cultivated Illinois 71.506586
Planted/Cultivated Iowa 85.057169
Shrubland New Mexico 62.354696
Shrubland Nevada 72.480131
Shrubland Arizona 72.751338
Water Massachusetts 26.109717
Water Rhode Island 32.960488
Water Michigan 41.358464
Wetlands South Carolina 23.103330
Wetlands Louisiana 29.575701
Wetlands Florida 31.315429

We can also visualize these values in a bar chart:

In [19]:
color_dict = color_lookup.reset_index().sort_values('class_group_name').set_index('class_group_name').to_dict()['class_group_color']
In [20]:
fig = plt.figure(figsize=(12, 10))

xx = states_with_biggest_rel_area_per_class_type.iloc[::-1].reset_index().reset_index()
colors = xx['class_group_name'].map(color_dict)
xx['color'] = colors

plt.barh(xx['index'], xx['area_pct'], color=xx['color'])

# Create names on the x-axis
plt.yticks(xx['index'], xx['state'])

colors = color_dict      
labels = list(colors.keys())
handles = [plt.Rectangle((0,0),1,1, color=colors[label]) for label in labels]
plt.legend(handles, labels, loc='upper center', bbox_to_anchor=(0.5, -0.1), fancybox=False, shadow=False, ncol=len(labels))

plt.title('Top 3 US states per land cover class')
plt.xlabel('Covered are in percetages')
plt.show()

The source of the notebook can be found here