Skip to content

google-deepmind/geeflow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GeeFlow

GeeFlow is a library for generating large scale geospatial datasets using Google Earth Engine (GEE). It contains utils, configs, and pipeline launch scripts to generate geospatial datasets. The focus is on supporting geospatial AI research, and it does not aim to be a production-ready utility.

The datasets created conform to the TFDS format and can be directly used with TFDS tf.data.Dataset data pipelines.

What can it be used for:

  • Creating small- and large-scale datasets, supervised and unsupervised, ready for ingestion into geospatial AI model training (with standard and robust statistics precomputed).
  • Creating inference maps, up to global scale at any resolution.
  • Supporting any type of geospatial satellite and remote sensing data and labels data from Google Earth Engine.
  • Arbitrary spatial and temporal resolution and sampling of data sources.
  • Tooling for sampling and inference maps generation.

What is out of scope:

An example workflow from geospatial datasets generation, to model training and evaluation, to inference and global inference maps creation:

Howto

For research and quick exploration, we want to keep GeeFlow lean and flexible, so that's easy to use and configure arbitrary data sources. Next, we care about reproducibility and versioning/bookkeeping of your data. Another important factor is scalability and efficiency when generating the data.

Configuration

To provide the needed flexibility for dataset configuration, we specify them in ML Collections' ConfigDict config files in the geeflow/configs/ directory.

Usually it is split into two parts: labels configuration and sources configuration (see also the examples in geeflow/configs/).

Labels configuration:

  • At the minimum, one has to provide a CSV (or parquet) file with locations of the samples (lat, lon columns) and image sizes in meters (for UTM projected samples) or in degrees (for spherical CRS).
  • If other columns (for example image-level labels or metadata) have to be included in the generated dataset samples, one can provide the list to the meta_keys field.
  • Optionally one can specify the default resolution per pixel (default_scale) for all sources (which will be overwritten if the source specify their own scale), and the reference maximal pixel size in meters (max_cell_size_m) for proper gridding of multi-scale sources.
  • If separate training splits are to be generated (e.g. train, val, test), either a column with split name per sample should be included (in meta_keys), or they will be randomly generated by geographically separate cells splitting based on the selected S2 geometry scale level.

Example:

labels = ml_collections.ConfigDict()
labels.path = "data/demo_labels.csv"
labels.img_width_m = 240  # Image width and height of 240 meters on each side.
labels.max_cell_size_m = 30  # Reference maximal pixel size in meters.
labels.meta_keys = ("lat", "lon", "split")
labels.num_max_samples = 10  # Only for debugging, limiting the number of generated examples.

Sources configuration: That part contains named sources that can be found in GEE.

  • For every location x specified in labels configuration for each specified source s an image tensor is created with shape (T_s,H_s,W_s,C_s) (temporal size, height, width, number of channels), where T_s or C_s could be absent and any dimension could be 1. The dimensions can be different from source to source in dependence of the specified spatial resolution (scale), temporal sampling, and the selected channels.
  • Usually one defines at least the source class (from geeflow/ee_data.py), where one can provide additional options (such as data mode) via the kw field. If a source class is not defined in ee_data, one can always use a CustomImage, CustomIC, or CustomFC and set all values explicitly (like asset_name).
  • Other fields include scale (resolution per pixel in meters), select (which bands to include, in the given order), sampling_kw (keyword arguments for how to aggregate multiple images within a time range), and others.
  • Date ranges (date_ranges) is a list of time ranges to aggregate the data for each returned time sample. Each date range is specified by a tuple of the form (start date, number of months to aggregate over, number of days to aggregate over).

Example:

sources = ml_collections.ConfigDict()

sources.s2 = utils.get_source_config("Sentinel2", "ic")
sources.s2.kw.mode = "L2A"
sources.s2.scale = 10
sources.s2.select = ["B3", "B2", "B1"]
sources.s2.sampling_kw.reduce_fn = "median"
sources.s2.sampling_kw.cloud_mask_fn = ee_data.Sentinel2.im_cloud_score_plus_mask
sources.s2.date_ranges = [("2023-01-01", 12, 0), ("2024-01-01", 12, 0)]  # 2 annual samples

sources.s1 = utils.get_source_config("Sentinel1", "ic")
sources.s1.kw = {"mode": "IW", "pols": ("VV", "VH"), "orbit": "both"}
sources.s1.scale = 10
sources.s1.sampling_kw.reduce_fn = "mean"
sources.s1.date_ranges = [("2023-01-01", 3, 0), ("2023-04-01", 3, 0)]  # 2 seasonal samples

sources.elevation = utils.get_source_config("NasaDem", "im")
sources.elevation.scale = 30
sources.elevation.select = ("elevation", "slope", "aspect")

Note that the generated data (in this demo example) will have different spatial, temporal, and spectral dimensions; and the temporal samples from Sentinel-2 and Sentinel-1 will cover different time ranges. This is used here only to emphasize the flexibility of the specifications.

Internal workflows

Code map of core modules:

  • geeflow/ee_data.py contains defined GEE data sources. It also includes CustomImage, CustomIC and CustomFC classes to define any image, image collection, and feature collection sources in the configs on-the-fly.
  • geeflow/ee_algo.py contains samples processing algorithms that are selected for each source in the config (or might be preset).
  • geeflow/pipelines.py contains the functions to construct labels and sources processing pipelines.
  • geeflow/export_beam_tfds.py is the binary entry point to launch the Apache beam pipeline for dataset generation and saving as a TFDS dataset. Per requested dataset split, the custom TFDSBuilder executes the _generate_examples() method to:
    • create input data locations (from config.labels) as a pd.DataFrame.
    • determine and adjust the main sources processing function pipelines.get_roi_sample_fn(). Executing this per label item to get all sources processed.
    • optionally filtering the samples and making additional custom transformations
    • final processing for TFDS records construction (casting to data types, id key selection)
  • geeflow/stats/compute_stats_beam.py is the binary entry point to compute statistics for each source band at scale using Apache beam.

Internal conventions to keep in mind:

  • Unless otherwise specified, latitude and longitude are referred as lat and lon, respectively.
  • By default, the WGS 84 coordinate system is used (as for GPS).
  • For projected coordinates, local UTM coordinates are used.

Versioning and bookkeeping

Next to the primary and secondary dataset names, TFDS supports Semantic Versioning 2.0.0 with {MAJOR}.{MINOR}.{PATCH} version annotations. We suggest to follow the convention of: {LABELS_VERSION}.{SOURCES_VERSION}.{OTHERS}:

  1. The first number is iterated whenever a new (or modified) set of labels is used.
  2. The second number is iterated whenever one changes the GEE data sources.
  3. The last number can be used in other cases (e.g. when the processing of some sources is changed internally, but neither the labels or sources config is modified, or the training splits are changed, etc.).

For bookkeeping purposes we also save the launch command with all adjusted config arguments. It is also recommended to keep a CHANGELOG of the dataset versions, either in separate docs or within the config files.

Installation and usage

The first step is to checkout geeflow and install relevant python dependencies in a virtual environment:

git clone https://github.com/google-deepmind/geeflow
# Go into the code directory for the examples below.
cd geeflow/geeflow
# Install and activate a virtual environment
python -m venv .venv
source .venv/bin/activate
# Install GeeFlow.
pip install -e ..

Ensure that you can authenticate and initialize Earth Engine with your Google Cloud Platform (GCP) project id, following GEE Python Installation. You should be able to execute in python:

import ee
ee.Authenticate()
ee.Initialize(project="YOUR_GCP PROJECT_ID")  # Update with your project ID.
print(ee.String("Hello from GeeFlow!").getInfo())

You can also use glcloud util (installation instructions) and set up the project via gcloud config set

The project comes with a demo config script that can be used to verify everything is running as expected. It can be run as follows:

# Set output directory for your GeeFlow TFDS datasets.
export OUTDIR="/tmp/tfds"
# See your Google Earth Engine project ID (also GCP project ID):
export EE_PROJECT="YOUR_GCP_PROJECT_ID"

python -m geeflow.export_beam_tfds \
--config_path=configs/public/demo.py:labels_path=data/demo_labels.csv,num=20 \
--tfds_name=demo:1.0.0 --output_dir=$OUTDIR --logtostderr --running_mode=direct \
--splits=train,val,test --ee_project=$EE_PROJECT --file_format="tfrecord" -- \
--direct_num_workers=4 --direct_running_mode=multi_threading

The last flags after the -- separator are intended for Apache Beam configuration to run locally in parallel. Please check Apache Beam documentation for further options.

This should create the dataset in the output subdirectory demo, which should be looking like this:

/tmp/geeflow/tfds/demo\
├── demo\
│   └── 1.0.0\
│       ├── dataset_info.json\
│       ├── demo-test.tfrecord-00000-of-00001\
│       ├── demo-train.tfrecord-00000-of-00001\
│       ├── demo-val.tfrecord-00000-of-00001\
│       └── features.json\
└── extracted  # Temporary directory used during processing.

Statistics for this dataset can then be generated as follows:

python -m geeflow.stats.compute_stats_beam --tfds_name=demo \
--split=test,train,val --data_dir=$OUTDIR -- \
--direct_num_workers=4 --direct_running_mode=multi_threading --no_pipeline_type_check

This will create a json file with diverse statistics for each band in the input data, and stored in the dataset stats/ subdirectory.

Example for $OUTDIR/demo/1.0.0/stats/train_s2_band_1.json:

{
  "bins_iqr":316.0,
  "bins_iqr_std":234.24759080800592,
  "bins_mad":155.0,
  "bins_mad_std":229.803,
  "bins_mean":557.5357349537037,
  "bins_median":420.5,
  "bins_std":506.70404309973185,
  "max":4252.0,
  "mean":557.7291508427372,
  "min":90.5,
  "mode":249,
  "n":13824,
  "n_masked":0,
  "sample_std":506.67234659545005,
  "sample_var":256716.86680453984,
  "std":506.6540204413037,
  "sum":7710047.78125,
  "sum2":7848715651.832647,
  "total":7710047.78125,
  "var":256698.296429337,
  ...
}

Citing GeeFlow

To cite this repository:

@software{geeflow2025:github,
  author = {Maxim Neumann and Anton Raichuk and Michelangelo Conserva and Keith Anderson},
  title = {{GeeFlow}: Large scale datasets generation and processing for geospatial {AI} research}.
  url = {https://github.com/google-deepmind/geeflow},
  year = {2025}
}

License

Copyright 2024 DeepMind Technologies Limited

This code is licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0.

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an AS IS BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Disclaimer

This is not an official Google product.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages