Download data

Once labels have been processed, next up in the TerraKit pipeline is downloading the data.

Use the download_data function (or the download CLI subcommand) to download data from a set of data connectors for a time and location specified by the shapefiles output from the process_labels pipeline step.

Here's an example of how to use the download_data step in the TerraKit pipeline:

config = {
    "download": {
        "data_sources": [
            {
                "data_connector": "sentinel_aws",
                "collection_name": "sentinel-2-l2a",
                "bands": ["blue", "green", "red"],
                "save_file": "",
            },
        ],
        "date_allowance": {"pre_days": 0, "post_days": 21},
        "transform": {
            "scale_data_xarray": True,
            "impute_nans": True,
            "reproject": True,
        },
        "max_cloud_cover": 80,
    },
}

queried_data = download_data(
    data_sources=config["download"]["data_sources"],
    date_allowance=config["download"]["date_allowance"],
    transform=config["download"]["transform"],
    max_cloud_cover=config["download"]["max_cloud_cover"],
    dataset_name=DATASET_NAME,
    working_dir=WORKING_DIR,
    keep_files=False,
)

Write the same arguments in a config file that the TerraKit CLI can use:

# ./docs/examples/config.yaml
download:
  data_sources:
  - data_connector: "sentinel_aws"
    collection_name: "sentinel-2-l2a"
    bands: ["blue", "green", "red"]
  date_allowance: 
    pre_days: 0
    post_days: 21
  transform:
    scale_data_xarray: True
    impute_nans: true
    reproject: True

#!/bin/bash
terrakit --config ./docs/examples/config.yaml download

Alternatively, use the TerraKit data_connectors directly by specify the collection, bbox, date and bands of interest.

from terrakit import DataConnector

dc = DataConnector(connector_type="sentinel_aws")
dc.connector.list_collections()

Configure the Download pipeline

Use the following parameters to configure the TerraKit Download pipeline.

Active

active: Enables the labels pipeline to run. Set to False to skip the step. Default: True

Data Allowance: `data_allowance`

Date range allowance for data query.

Transform: `transform`

Transformation parameters for data.

Data Sources: `data_sources`

List of data sources to query. The list should contain a valid DataSource object which specifies the data_connector, collection_name and bands to download. Optionally specify a unique filename to used for the save the downloaded files as using save_file. If not specified, the data will be downloaded as saved as {working_dir}/{data_connector}_{collection_name}.tif.

# Example of a valid DataSource dictionary.
download_data(
    data_sources = [{
        "data_connector": "sentinel_aws",
        "collection_name": "sentinel-2-l2a",
        "bands": ["blue", "green", "red"],
    }]
)

Specify multiple data sources as follows:

# Example of a valid multiple DataSource dictionaries passed as a list to the `data_sources` argument.
download_data(
    data_sources = [{
        "data_connector": "sentinel_aws",
        "collection_name": "sentinel-2-l2a",
        "bands": ["blue", "green", "red"],
    },
    {
        "data_connector": "sentinelhub",
        "collection_name": "s1_grd",
        "bands": ["B04", "B03", "B02"]
    }]
)

To specify multiple data sources with the CLI with the following config:

# ./docs/examples/config.yaml
download:
  data_sources:
  - data_connector: "sentinel_aws"
    collection_name: "sentinel-2-l2a"
    bands: ["blue", "green", "red"]
  - data_connector: "sentinelhub"
    collection_name: "s1_grd"
    bands: ["B04", "B03", "B02"]
  date_allowance: 
    pre_days: 0
    post_days: 21
  transform:
    scale_data_xarray: True
    impute_nans: true
    reproject: True

Max Cloud Cover: `max_cloud_cover`

Maximum cloud cover percentage for data selection.

Datetime Bounding Box Shape File: `datetime_bbox_shp_file`

Path to a shapefile containing datetime and bounding box information. This shapefile will have been saved as {working_dir}/{dataset_name}_all_bboxes.shp if the process_labels set has already been run. If datetime_bbox_shp_file is not explicitly specified, TerraKit will first check for the default value (./tmp/terrakit_curated_dataset_all_bboxes.shp), followed by checking the working directory for {dataset_name}_all_bboxes.shp.

The shapefile {dataset_name}_all_bboxes.shp must contain a datetime field and geometry field.

Labels Shape File: `labels_shp_file`

Path to a shapefile containing datetime and label geometery information. This shapefile will have been saved as {working_dir}/{dataset_name}_labels.shp if the process_labels set has already been run. If datetime_bbox_shp_file is not explicitly specified, TerraKit will first check for the default value (./tmp/terrakit_curated_dataset_labels.shp), followed by checking the working directory for {dataset_name}_labels.shp.

The shapefile {dataset_name}_labels.shp must contain a datetime field and geometry field.

Keep files: `keep_files`

Flag to preserve shapefiles in the working directory once they have been used by the download data step. Downloaded files will not be removed. Set to True to ensure shapefiles remain in place.

Data Connectors

Data connectors are classes which enable a user to search for data and query data from a particular data source using a common set of functions. Check out the TerraKit Data Connectors section for more information.

Try out

Try out the TerraKit data pipeline workflow using the Terrakit: Labels to dataset pipeline notebook for more help getting started with TerraKit Data Connectors.

Download data

Configure the Download pipeline

Active

Data Allowance: data_allowance

Transform: transform

Data Sources: data_sources

Max Cloud Cover: max_cloud_cover

Datetime Bounding Box Shape File: datetime_bbox_shp_file

Labels Shape File: labels_shp_file

Keep files: keep_files