Download data

Once labels have been processed, next up in the TerraKit pipeline is downloading the data.

Use the download_data function (or the download CLI subcommand) to download data from a set of data connectors for a time and location specified by the shapefiles output from the process_labels pipeline step.

Here's an example of how to use the download_data step in the TerraKit pipeline:

config = {
    "download": {
        "data_sources": [
            {
                "data_connector": "sentinel_aws",
                "collection_name": "sentinel-2-l2a",
                "bands": ["blue", "green", "red"],
                "save_file": "",
            },
        ],
        "date_allowance": {"pre_days": 0, "post_days": 21},
        "transform": {
            "scale_data_xarray": True,
            "impute_nans": True,
            "reproject": True,
        },
        "max_cloud_cover": 80,
    },
}

queried_data = download_data(
    data_sources=config["download"]["data_sources"],
    date_allowance=config["download"]["date_allowance"],
    transform=config["download"]["transform"],
    max_cloud_cover=config["download"]["max_cloud_cover"],
    dataset_name=DATASET_NAME,
    working_dir=WORKING_DIR,
    keep_files=False,
)

Write the same arguments in a config file that the TerraKit CLI can use:

# ./docs/examples/config.yaml
download:
  data_sources:
  - data_connector: "sentinel_aws"
    collection_name: "sentinel-2-l2a"
    bands: ["blue", "green", "red"]
  date_allowance: 
    pre_days: 0
    post_days: 21
  transform:
    scale_data_xarray: True
    impute_nans: true
    reproject: True

#!/bin/bash
terrakit --config ./docs/examples/config.yaml download

Alternatively, use the TerraKit data_connectors directly by specify the collection, bbox, date and bands of interest.

from terrakit import DataConnector

dc = DataConnector(connector_type="sentinel_aws")
dc.connector.list_collections()

Configure the Download pipeline

Use the following parameters to configure the TerraKit Download pipeline.

Active

active: Enables the labels pipeline to run. Set to False to skip the step. Default: True

Data Allowance: `data_allowance`

Date range allowance for data query.

Transform: `transform`

Transformation parameters for data.

Data Sources: `data_sources`

List of data sources to query. The list should contain a valid DataSource object which specifies the data_connector, collection_name and bands to download. Optionally specify a unique filename to used for the save the downloaded files as using save_file. If not specified, the data will be downloaded as saved as {working_dir}/{data_connector}_{collection_name}.tif.

# Example of a valid DataSource dictionary.
download_data(
    data_sources = [{
        "data_connector": "sentinel_aws",
        "collection_name": "sentinel-2-l2a",
        "bands": ["blue", "green", "red"],
    }]
)

Specify multiple data sources as follows:

# Example of a valid multiple DataSource dictionaries passed as a list to the `data_sources` argument.
download_data(
    data_sources = [{
        "data_connector": "sentinel_aws",
        "collection_name": "sentinel-2-l2a",
        "bands": ["blue", "green", "red"],
    },
    {
        "data_connector": "sentinelhub",
        "collection_name": "s1_grd",
        "bands": ["B04", "B03", "B02"]
    }]
)

To specify multiple data sources with the CLI with the following config:

# ./docs/examples/config.yaml
download:
  data_sources:
  - data_connector: "sentinel_aws"
    collection_name: "sentinel-2-l2a"
    bands: ["blue", "green", "red"]
  - data_connector: "sentinelhub"
    collection_name: "s1_grd"
    bands: ["B04", "B03", "B02"]
  date_allowance: 
    pre_days: 0
    post_days: 21
  transform:
    scale_data_xarray: True
    impute_nans: true
    reproject: True

Max Cloud Cover: `max_cloud_cover`

Maximum cloud cover percentage for data selection.

Datetime Bounding Box Shape File: `datetime_bbox_shp_file`

Path to a shapefile containing datetime and bounding box information. This shapefile will have been saved as {working_dir}/{dataset_name}_all_bboxes.shp if the process_labels set has already been run. If datetime_bbox_shp_file is not explicitly specified, TerraKit will first check for the default value (./tmp/terrakit_curated_dataset_all_bboxes.shp), followed by checking the working directory for {dataset_name}_all_bboxes.shp.

The shapefile {dataset_name}_all_bboxes.shp must contain a datetime field and geometry field.

Labels Shape File: `labels_shp_file`

Path to a shapefile containing datetime and label geometery information. This shapefile will have been saved as {working_dir}/{dataset_name}_labels.shp if the process_labels set has already been run. If datetime_bbox_shp_file is not explicitly specified, TerraKit will first check for the default value (./tmp/terrakit_curated_dataset_labels.shp), followed by checking the working directory for {dataset_name}_labels.shp.

The shapefile {dataset_name}_labels.shp must contain a datetime field and geometry field.

Keep files: `keep_files`

Flag to preserve shapefiles in the working directory once they have been used by the download data step. Downloaded files will not be removed. Set to True to ensure shapefiles remain in place.

Data Connectors

Data connectors are classes which enable a user to search for data and query data from a particular data source using a common set of functions. Each data connector has the following mandatory methods:

list_collections()
find_data()
get_data()

Available data connectors

The following data connectors and associated collections are available:

Connectors	Collections
sentinelhub	s2_l1c, dem, s1_grd, hls_l30, s2_l2a, hls_s30
nasa_earthdata	HLSL30_2.0, HLSS30_2.0
sentinel_aws	sentinel-2-l2a
IBMResearchSTAC	'HLSS30', 'esa-sentinel-2A-msil1c', 'HLS_S30',, 'atmospheric-weather-era5', 'deforestation-umd', 'Radar-10min', 'tasmax-rcp85-land-cpm-uk-2.2km', 'vector-osm-power', 'ukcp18-land-cpm-uk-2.2km', 'treecovermaps-eudr', 'ch4' + more
TheWeatherCompany	weathercompany-daily-forecast

Data connector access

Each data connector has a different access requirements. For example, connecting to SentinelHub and NASA EarthData, you will need to obtain credentials from each provider. Once these have been obtained, they can be added to a .env file at the root directory level using the following syntax:

SH_CLIENT_ID="<SentinelHub Client ID>"
SH_CLIENT_SECRET="<SentinelHub Client Secret>"
NASA_EARTH_BEARER_TOKEN="<NASA EarthData Bearer Token>"

NASA Earthdata

To access NASA Earthdata, register for an Earthdata Login profile and requests a bearer token. https://urs.earthdata.nasa.gov/profile

Sentinel Hub

To access sentinel hub, register for an account and requests an OAuth client using the Sentinel Hub dashboard https://www.planet.com

Sentinel AWS

Access sentinel AWS data is open and does not require any credentials.

The Weather Company

To access The Weather Company, register for an account and requests an API Key https://www.weathercompany.com/weather-data-apis/. Once you have an API key, set the following environment variable:

THE_WEATHER_COMPANY_API_KEY="<The Weather Company API key>"

IBM Research STAC

Access IBM Research STAC is currently restricted to IBMers and partners. If you're elegible, you need to register for an IBM AppID account and set the following environment variables:

APPID_ISSUER=<issuer>
APPID_USERNAME=<user-email>
APPID_PASSWORD=<user-password>
CLIENT_ID=<client-id>
CLIENT_SECRET=<client-secret>

Please reach out the maintainers of this repo.

IBMers don't need credentials to access the internal instance of the STAC service.

Try out

Data Connectors can be used outside the TerraKit Pipeline. Take a look at the TerraKit: Easy geospatial data search and query notebook for more help getting started with TerraKit Data Connectors.