Terrakit: Labels to dataset pipeline¶
This notebook demonstrates generating a ML-ready dataset from a collection of labels. The labels used in this example are in GeoJSON format from two wildfire events reported on Copernicus Rapid Mapping Service. The two wildfire events are:
- Wildfire in Central Madeira, Portugal, 2024/08/26
- Wildfire in Biebrza National Park, Poland, 2025/04/23
import os
from glob import glob
from pathlib import Path
# Set LOGLEVEL to info for more verbose logging
os.environ["LOGLEVEL"] = "WARNING"
import terrakit
# Import TerraKit Utils
from terrakit.general_utils.labels_downloader import (
rapid_mapping_geojson_downloader,
hugging_face_file_downloader,
EXAMPLE_LABEL_FILES,
EXAMPLE_RASTER_LABEL_FILES,
)
from terrakit.general_utils.plotting import (
plot_label_dataframes,
plot_labels_on_map,
plot_tiles_and_label_pair,
plot_chip_and_label_pairs,
)
from terrakit.download.geodata_utils import check_and_crop_bbox
1. Process labels¶
This initial steps takes a directory containing some label files (geojson), and calculates a list of bboxes that encompass the geospatial locations described by the label files. Assuming that the date is included in the label filename, temporal information will also be captured.
The function process_geojson_labels returns a Geopandas DataFrame where each row corresponds to a bbox for a given area. The function also saves this information in a shp file. The shp file is output into the working directory (default = "./tmp").
Either the shp file or the DataFrame can now be used in the next step to tell Terrakit which time and location to download some data from.
# Set a working directory, a dataset name and a directory where some labels can be found
DATASET_NAME = "test_dataset"
WORKING_DIR = f"./tmp/{DATASET_NAME}"
LABELS_FOLDER = "./test_wildfire_vector"
# Download some labels if none already exist.
if (
Path(LABELS_FOLDER).is_dir() is False
or set(EXAMPLE_LABEL_FILES).issubset(glob(f"{LABELS_FOLDER}/*.json")) is False
):
rapid_mapping_geojson_downloader(
event_id="748",
aoi="01",
monitoring_number="05",
version="v1",
dest=LABELS_FOLDER,
)
rapid_mapping_geojson_downloader(
event_id="801",
aoi="01",
monitoring_number="02",
version="v1",
dest=LABELS_FOLDER,
)
# Process the labels by providing a labels folder, working directory and dataset name.
labels_gdf, grouped_bbox_gdf = terrakit.process_labels(
labels_folder=LABELS_FOLDER,
dataset_name=DATASET_NAME,
working_dir=WORKING_DIR,
)
print(labels_gdf)
print(grouped_bbox_gdf)
# Plot the process labels and bboxes to confirm they appear as expected.
plot_label_dataframes(labels_gdf, grouped_bbox_gdf)
# Additionally plot labels and bbox on map.
map_collection, title_list = plot_labels_on_map(labels_gdf, grouped_bbox_gdf)
for i in range(0, len(map_collection)):
print(title_list[i])
display(map_collection[i])
Download tile bounding box and labels for: EMSR748_AOI01_DEL_MONIT05_observedEventA_v1_2024-08-26.json
Download tile bounding box and labels for: EMSR748_AOI01_DEL_MONIT05_observedEventA_v1_2024-08-26.json
2. Download the data¶
Example 2.1: Use a shp file to download data¶
config = {
"download": {
"data_sources": [
{
"data_connector": "sentinel_aws",
"collection_name": "sentinel-2-l2a",
"bands": ["blue", "green", "red"],
},
],
"date_allowance": {"pre_days": 0, "post_days": 21},
"transform": {
"scale_data_xarray": True,
"impute_nans": True,
"reproject": True,
},
"max_cloud_cover": 80,
},
}
queried_data = terrakit.download_data(
data_sources=config["download"]["data_sources"],
date_allowance=config["download"]["date_allowance"],
transform=config["download"]["transform"],
max_cloud_cover=config["download"]["max_cloud_cover"],
dataset_name=DATASET_NAME,
working_dir=WORKING_DIR,
keep_files=False,
)
2.2: Inspect the data¶
Use the plot_tiles_and_label_pair function to inspect the downloaded tiles and corresponding labels.
plot_tiles_and_label_pair(
queried_data, bands=config["download"]["data_sources"][0]["bands"]
)
Legend
image_0: sentinel_aws_sentinel-2-l2a_2024-08-30_imputed.tif, label_0: sentinel_aws_sentinel-2-l2a_2024-08-30_imputed_labels.tif
---
Legend
image_1: sentinel_aws_sentinel-2-l2a_2025-04-23_imputed.tif, label_1: sentinel_aws_sentinel-2-l2a_2025-04-23_imputed_labels.tif
---
3. Chip the data¶
Now that the tiled data has been downloaded, let's chip it accordingly.
3.1 Example 1: Use the queried data list returned from download_data to find the files to chip.¶
chip_args = {
"chip": {"sample_dim": 256},
}
res = terrakit.chip_and_label_data(
dataset_name=DATASET_NAME,
sample_dim=chip_args["chip"]["sample_dim"],
queried_data=queried_data,
working_dir=WORKING_DIR,
)
# Note that we can use the working directory to find the list of files to chip.
# Uncomment to try this out.
# res = terrakit.chip_and_label_data(
# dataset_name=DATASET_NAME,
# working_dir=WORKING_DIR,
# sample_dim=chip_args["chip"]["sample_dim"],
# keep_files=True,
# )
Check the results¶
Use the plot_chip_and_label_pairs function to check the chip and label pairs look as expected.
plot_chip_and_label_pairs(
res, bands=config["download"]["data_sources"][0]["bands"], samples=10
)
Legend
image_0: sentinel_aws_sentinel-2-l2a_2024-08-30_imputed_9.data.tif
image_1: sentinel_aws_sentinel-2-l2a_2024-08-30_imputed_4.data.tif
image_2: sentinel_aws_sentinel-2-l2a_2024-08-30_imputed_16.data.tif
image_3: sentinel_aws_sentinel-2-l2a_2024-08-30_imputed_8.data.tif
image_4: sentinel_aws_sentinel-2-l2a_2024-08-30_imputed_17.data.tif
image_5: sentinel_aws_sentinel-2-l2a_2025-04-23_imputed_0.data.tif
image_6: sentinel_aws_sentinel-2-l2a_2024-08-30_imputed_20.data.tif
image_7: sentinel_aws_sentinel-2-l2a_2024-08-30_imputed_2.data.tif
image_8: sentinel_aws_sentinel-2-l2a_2024-08-30_imputed_18.data.tif
image_9: sentinel_aws_sentinel-2-l2a_2024-08-30_imputed_11.data.tif
---
label_0: sentinel_aws_sentinel-2-l2a_2024-08-30_imputed_9.label.tif
label_1: sentinel_aws_sentinel-2-l2a_2024-08-30_imputed_4.label.tif
label_2: sentinel_aws_sentinel-2-l2a_2024-08-30_imputed_16.label.tif
label_3: sentinel_aws_sentinel-2-l2a_2024-08-30_imputed_8.label.tif
label_4: sentinel_aws_sentinel-2-l2a_2024-08-30_imputed_17.label.tif
label_5: sentinel_aws_sentinel-2-l2a_2025-04-23_imputed_0.label.tif
label_6: sentinel_aws_sentinel-2-l2a_2024-08-30_imputed_20.label.tif
label_7: sentinel_aws_sentinel-2-l2a_2024-08-30_imputed_2.label.tif
label_8: sentinel_aws_sentinel-2-l2a_2024-08-30_imputed_18.label.tif
label_9: sentinel_aws_sentinel-2-l2a_2024-08-30_imputed_11.label.tif
4. Store¶
terrakit.taco_store_data(
dataset_name=DATASET_NAME,
working_dir=WORKING_DIR,
save_dir=WORKING_DIR,
tortilla_name="terrakit_curated_dataset.tortilla",
)
terrakit.load_tortilla("./tmp/test_dataset/terrakit_curated_dataset.tortilla")
5. Upload¶
6. Additional utility functions¶
check_and_crop_bbox(bbox=[-73.645550, 44.233885, -72.539635, 44.278023], resolution=10)
Raster Labels to Data¶
Let's take a quick look at how we can also use TerraKit to generating a ML-ready dataset from a collection of raster labels.
The label used in this section are in raster format from two burn scar events included in the ibm-nasa-geospatial/hls_burn_scars hugging face dataset.
1. Process raster labels¶
As before, this initial steps takes a directory containing some label files. This time the labels are raster files (.tif). The date is assumed to be contained in the filename again. Supported date types are YYYYDDD (7), YYYYMMDD (8), YYMMDD (6 -> 20YYMMDD).
# Set a working directory, a dataset name and a directory where some labels can be found
DATASET_NAME_RASTER = "test_dataset_raster"
WORKING_DIR_RASTER = f"./tmp/{DATASET_NAME_RASTER}"
LABELS_FOLDER_RASTER = "./test_burn_scar_raster"
if (
Path(LABELS_FOLDER_RASTER).is_dir() is False
or set(EXAMPLE_RASTER_LABEL_FILES).issubset(glob(f"{LABELS_FOLDER_RASTER}/*.tif"))
is False
):
for filename in EXAMPLE_RASTER_LABEL_FILES:
hugging_face_file_downloader(
repo_id="ibm-nasa-geospatial/hls_burn_scars",
filename=filename,
revision="e48662b31288f1d5f1fd5cf5ebb0e454092a19ce",
subfolder="training",
dest=LABELS_FOLDER_RASTER,
)
labels_gdf, grouped_bbox_gdf = terrakit.process_labels(
labels_folder=LABELS_FOLDER_RASTER,
dataset_name=DATASET_NAME_RASTER,
working_dir=WORKING_DIR_RASTER,
label_type="raster",
)
print(labels_gdf)
print(grouped_bbox_gdf)
plot_label_dataframes(labels_gdf, grouped_bbox_gdf)
2. Download the data¶
config = {
"download": {
"data_sources": [
{
"data_connector": "sentinel_aws",
"collection_name": "sentinel-2-l2a",
"bands": ["blue", "green", "red"],
},
],
"date_allowance": {"pre_days": 0, "post_days": 21},
"transform": {
"scale_data_xarray": True,
"impute_nans": True,
"reproject": True,
},
"max_cloud_cover": 80,
},
}
queried_data = terrakit.download_data(
data_sources=config["download"]["data_sources"],
date_allowance=config["download"]["date_allowance"],
transform=config["download"]["transform"],
max_cloud_cover=config["download"]["max_cloud_cover"],
dataset_name=DATASET_NAME_RASTER,
working_dir=WORKING_DIR_RASTER,
keep_files=False,
)
plot_tiles_and_label_pair(
queried_data, bands=config["download"]["data_sources"][0]["bands"]
)
Legend
image_0: sentinel_aws_sentinel-2-l2a_2018-09-02_imputed.tif, label_0: sentinel_aws_sentinel-2-l2a_2018-09-02_imputed_labels.tif
---
Legend
image_1: sentinel_aws_sentinel-2-l2a_2020-09-04_imputed.tif, label_1: sentinel_aws_sentinel-2-l2a_2020-09-04_imputed_labels.tif
---
3. Chip the data¶
Now that the tiled data has been downloaded, let's chip it accordingly.
chip_args = {
"chip": {"sample_dim": 256},
}
res = terrakit.chip_and_label_data(
dataset_name=DATASET_NAME_RASTER,
sample_dim=chip_args["chip"]["sample_dim"],
queried_data=queried_data,
working_dir=WORKING_DIR_RASTER,
)
plot_chip_and_label_pairs(
res, bands=config["download"]["data_sources"][0]["bands"], samples=10
)
Legend
image_0: sentinel_aws_sentinel-2-l2a_2020-09-04_imputed_13.data.tif
image_1: sentinel_aws_sentinel-2-l2a_2020-09-04_imputed_57.data.tif
image_2: sentinel_aws_sentinel-2-l2a_2020-09-04_imputed_69.data.tif
image_3: sentinel_aws_sentinel-2-l2a_2020-09-04_imputed_27.data.tif
image_4: sentinel_aws_sentinel-2-l2a_2020-09-04_imputed_5.data.tif
image_5: sentinel_aws_sentinel-2-l2a_2020-09-04_imputed_17.data.tif
image_6: sentinel_aws_sentinel-2-l2a_2021-11-02_imputed_17.data.tif
image_7: sentinel_aws_sentinel-2-l2a_2018-09-02_imputed_5.data.tif
image_8: sentinel_aws_sentinel-2-l2a_2020-09-04_imputed_10.data.tif
image_9: sentinel_aws_sentinel-2-l2a_2020-09-04_imputed_25.data.tif
---
label_0: sentinel_aws_sentinel-2-l2a_2020-09-04_imputed_13.label.tif
label_1: sentinel_aws_sentinel-2-l2a_2020-09-04_imputed_57.label.tif
label_2: sentinel_aws_sentinel-2-l2a_2020-09-04_imputed_69.label.tif
label_3: sentinel_aws_sentinel-2-l2a_2020-09-04_imputed_27.label.tif
label_4: sentinel_aws_sentinel-2-l2a_2020-09-04_imputed_5.label.tif
label_5: sentinel_aws_sentinel-2-l2a_2020-09-04_imputed_17.label.tif
label_6: sentinel_aws_sentinel-2-l2a_2021-11-02_imputed_17.label.tif
label_7: sentinel_aws_sentinel-2-l2a_2018-09-02_imputed_5.label.tif
label_8: sentinel_aws_sentinel-2-l2a_2020-09-04_imputed_10.label.tif
label_9: sentinel_aws_sentinel-2-l2a_2020-09-04_imputed_25.label.tif
4. Store¶
terrakit.taco_store_data(
dataset_name=DATASET_NAME_RASTER,
working_dir=WORKING_DIR_RASTER,
save_dir=WORKING_DIR_RASTER,
tortilla_name="terrakit_curated_dataset.tortilla",
)