GeeFlow is a library for generating large scale geospatial datasets using Google Earth Engine (GEE). It contains utils, configs, and pipeline launch scripts to generate geospatial datasets. The focus is on supporting geospatial AI research, and it does not aim to be a production-ready utility.
The datasets created conform to the TFDS format and can be directly used with
TFDS tf.data.Dataset
data pipelines.
What can it be used for:
- Creating small- and large-scale datasets, supervised and unsupervised, ready for ingestion into geospatial AI model training (with standard and robust statistics precomputed).
- Creating inference maps, up to global scale at any resolution.
- Supporting any type of geospatial satellite and remote sensing data and labels data from Google Earth Engine.
- Arbitrary spatial and temporal resolution and sampling of data sources.
- Tooling for sampling and inference maps generation.
What is out of scope:
- Model training and inference. Pick your favorite framework, e.g. for Jax/Flax we use Jeo, for PyTorch, check out TorchGeo.
- Google Earth Engine data interactive visualization and analysis. Check out for example the amazing geemap for python-based analysis. Or explore GEE's own javascript-based EE Code Editor.
- Datasets repository. Check out e.g. Hugging Face, TFDS Catalog, TorchGeo Datasets.
An example workflow from geospatial datasets generation, to model training and evaluation, to inference and global inference maps creation:
For research and quick exploration, we want to keep GeeFlow lean and flexible, so that's easy to use and configure arbitrary data sources. Next, we care about reproducibility and versioning/bookkeeping of your data. Another important factor is scalability and efficiency when generating the data.
To provide the needed flexibility for dataset configuration, we specify them in ML Collections' ConfigDict config files in the geeflow/configs/ directory.
Usually it is split into two parts: labels configuration and sources configuration (see also the examples in geeflow/configs/).
Labels configuration:
- At the minimum, one has to provide a CSV (or
parquet) file with locations of the samples
(
lat
,lon
columns) and image sizes in meters (for UTM projected samples) or in degrees (for spherical CRS). - If other columns (for example image-level labels or metadata) have to be
included in the generated dataset samples, one can provide the list to the
meta_keys
field. - Optionally one can specify the default resolution per pixel
(
default_scale
) for all sources (which will be overwritten if the source specify their own scale), and the reference maximal pixel size in meters (max_cell_size_m
) for proper gridding of multi-scale sources. - If separate training splits are to be generated (e.g.
train
,val
,test
), either a column with split name per sample should be included (inmeta_keys
), or they will be randomly generated by geographically separate cells splitting based on the selected S2 geometry scale level.
Example:
labels = ml_collections.ConfigDict()
labels.path = "data/demo_labels.csv"
labels.img_width_m = 240 # Image width and height of 240 meters on each side.
labels.max_cell_size_m = 30 # Reference maximal pixel size in meters.
labels.meta_keys = ("lat", "lon", "split")
labels.num_max_samples = 10 # Only for debugging, limiting the number of generated examples.
Sources configuration: That part contains named sources that can be found in GEE.
- For every location
x
specified in labels configuration for each specified sources
an image tensor is created with shape(T_s,H_s,W_s,C_s)
(temporal size, height, width, number of channels), whereT_s
orC_s
could be absent and any dimension could be 1. The dimensions can be different from source to source in dependence of the specified spatial resolution (scale), temporal sampling, and the selected channels. - Usually one defines at least the source class (from
geeflow/ee_data.py), where one can provide additional
options (such as data mode) via the
kw
field. If a source class is not defined inee_data
, one can always use aCustomImage
,CustomIC
, orCustomFC
and set all values explicitly (likeasset_name
). - Other fields include
scale
(resolution per pixel in meters),select
(which bands to include, in the given order),sampling_kw
(keyword arguments for how to aggregate multiple images within a time range), and others. - Date ranges (
date_ranges
) is a list of time ranges to aggregate the data for each returned time sample. Each date range is specified by a tuple of the form (start date
,number of months
to aggregate over,number of days
to aggregate over).
Example:
sources = ml_collections.ConfigDict()
sources.s2 = utils.get_source_config("Sentinel2", "ic")
sources.s2.kw.mode = "L2A"
sources.s2.scale = 10
sources.s2.select = ["B3", "B2", "B1"]
sources.s2.sampling_kw.reduce_fn = "median"
sources.s2.sampling_kw.cloud_mask_fn = ee_data.Sentinel2.im_cloud_score_plus_mask
sources.s2.date_ranges = [("2023-01-01", 12, 0), ("2024-01-01", 12, 0)] # 2 annual samples
sources.s1 = utils.get_source_config("Sentinel1", "ic")
sources.s1.kw = {"mode": "IW", "pols": ("VV", "VH"), "orbit": "both"}
sources.s1.scale = 10
sources.s1.sampling_kw.reduce_fn = "mean"
sources.s1.date_ranges = [("2023-01-01", 3, 0), ("2023-04-01", 3, 0)] # 2 seasonal samples
sources.elevation = utils.get_source_config("NasaDem", "im")
sources.elevation.scale = 30
sources.elevation.select = ("elevation", "slope", "aspect")
Note that the generated data (in this demo example) will have different spatial, temporal, and spectral dimensions; and the temporal samples from Sentinel-2 and Sentinel-1 will cover different time ranges. This is used here only to emphasize the flexibility of the specifications.
Code map of core modules:
- geeflow/ee_data.py contains defined GEE data sources.
It also includes
CustomImage
,CustomIC
andCustomFC
classes to define any image, image collection, and feature collection sources in the configs on-the-fly. - geeflow/ee_algo.py contains samples processing algorithms that are selected for each source in the config (or might be preset).
- geeflow/pipelines.py contains the functions to construct labels and sources processing pipelines.
- geeflow/export_beam_tfds.py is the binary
entry point to launch the Apache beam pipeline
for dataset generation and saving as a TFDS dataset. Per requested dataset
split, the custom
TFDSBuilder
executes the_generate_examples()
method to:- create input data locations (from
config.labels
) as apd.DataFrame
. - determine and adjust the main sources processing function
pipelines.get_roi_sample_fn()
. Executing this per label item to get all sources processed. - optionally filtering the samples and making additional custom transformations
- final processing for TFDS records construction (casting to data types, id key selection)
- create input data locations (from
- geeflow/stats/compute_stats_beam.py is the binary entry point to compute statistics for each source band at scale using Apache beam.
Internal conventions to keep in mind:
- Unless otherwise specified, latitude and longitude are referred as
lat
andlon
, respectively. - By default, the WGS 84 coordinate system is used (as for GPS).
- For projected coordinates, local UTM coordinates are used.
Next to the primary and secondary dataset names, TFDS supports
Semantic Versioning 2.0.0 with
{MAJOR}.{MINOR}.{PATCH}
version annotations. We suggest to follow the
convention of: {LABELS_VERSION}.{SOURCES_VERSION}.{OTHERS}
:
- The first number is iterated whenever a new (or modified) set of labels is used.
- The second number is iterated whenever one changes the GEE data sources.
- The last number can be used in other cases (e.g. when the processing of some sources is changed internally, but neither the labels or sources config is modified, or the training splits are changed, etc.).
For bookkeeping purposes we also save the launch command with all adjusted
config arguments. It is also recommended to keep a CHANGELOG
of the dataset
versions, either in separate docs or within the config files.
The first step is to checkout geeflow and install relevant python dependencies in a virtual environment:
git clone https://github.com/google-deepmind/geeflow
# Go into the code directory for the examples below.
cd geeflow/geeflow
# Install and activate a virtual environment
python -m venv .venv
source .venv/bin/activate
# Install GeeFlow.
pip install -e ..
Ensure that you can authenticate and initialize Earth Engine with your Google Cloud Platform (GCP) project id, following GEE Python Installation. You should be able to execute in python:
import ee
ee.Authenticate()
ee.Initialize(project="YOUR_GCP PROJECT_ID") # Update with your project ID.
print(ee.String("Hello from GeeFlow!").getInfo())
You can also use glcloud
util
(installation instructions) and
set up the project via
gcloud config set
The project comes with a demo config script that can be used to verify everything is running as expected. It can be run as follows:
# Set output directory for your GeeFlow TFDS datasets.
export OUTDIR="/tmp/tfds"
# See your Google Earth Engine project ID (also GCP project ID):
export EE_PROJECT="YOUR_GCP_PROJECT_ID"
python -m geeflow.export_beam_tfds \
--config_path=configs/public/demo.py:labels_path=data/demo_labels.csv,num=20 \
--tfds_name=demo:1.0.0 --output_dir=$OUTDIR --logtostderr --running_mode=direct \
--splits=train,val,test --ee_project=$EE_PROJECT --file_format="tfrecord" -- \
--direct_num_workers=4 --direct_running_mode=multi_threading
The last flags after the --
separator are intended for
Apache Beam configuration to run locally in
parallel. Please check
Apache Beam documentation
for further options.
This should create the dataset in the output subdirectory demo
, which should
be looking like this:
/tmp/geeflow/tfds/demo\
├── demo\
│ └── 1.0.0\
│ ├── dataset_info.json\
│ ├── demo-test.tfrecord-00000-of-00001\
│ ├── demo-train.tfrecord-00000-of-00001\
│ ├── demo-val.tfrecord-00000-of-00001\
│ └── features.json\
└── extracted # Temporary directory used during processing.
Statistics for this dataset can then be generated as follows:
python -m geeflow.stats.compute_stats_beam --tfds_name=demo \
--split=test,train,val --data_dir=$OUTDIR -- \
--direct_num_workers=4 --direct_running_mode=multi_threading --no_pipeline_type_check
This will create a json file with diverse statistics for each band in the input
data, and stored in the dataset stats/
subdirectory.
Example for $OUTDIR/demo/1.0.0/stats/train_s2_band_1.json
:
{
"bins_iqr":316.0,
"bins_iqr_std":234.24759080800592,
"bins_mad":155.0,
"bins_mad_std":229.803,
"bins_mean":557.5357349537037,
"bins_median":420.5,
"bins_std":506.70404309973185,
"max":4252.0,
"mean":557.7291508427372,
"min":90.5,
"mode":249,
"n":13824,
"n_masked":0,
"sample_std":506.67234659545005,
"sample_var":256716.86680453984,
"std":506.6540204413037,
"sum":7710047.78125,
"sum2":7848715651.832647,
"total":7710047.78125,
"var":256698.296429337,
...
}
To cite this repository:
@software{geeflow2025:github,
author = {Maxim Neumann and Anton Raichuk and Michelangelo Conserva and Keith Anderson},
title = {{GeeFlow}: Large scale datasets generation and processing for geospatial {AI} research}.
url = {https://github.com/google-deepmind/geeflow},
year = {2025}
}
Copyright 2024 DeepMind Technologies Limited
This code is licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0.
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an AS IS BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
This is not an official Google product.