Process labels in preparation for curating a new dataset
The first step in the TerraKit dataset curation pipeline is processing the labels.
Use the process_labels function (or the CLI labels subcommand) to provide a directory containing geospatial labels in either vector or raster form. The function will return two DataFrame, the first containing bound box and temporal information for all of the geospatial locations identified in the labels files, the second containing the labels. Each DataFrame is also saved as a .shp file.
The temporal information is expected in the label filename with YYYY-MM-DD format.
Here's an example of how to use the process_labels step in the TerraKit pipeline:
from terrakit.transform.labels import process_labels
label_args = {
"dataset_name": "MyDataset",
"working_dir": "./tmp",
"labels": {
"labels_folder": "./docs/example/test_wildfire_vector",
},
}
labels_gdf, grouped_bbox_gdf = process_labels(
dataset_name=label_args["dataset_name"],
working_dir=label_args["working_dir"],
labels_folder=label_args["labels"]["labels_folder"],
)
./tmp/MyDataset_all_bboxes.shp and ./tmp/MyDataset_labels.shp.
Alternatively, write the same arguments in a config file that the TerraKit CLI can use:
# ./config.yaml
dataset_name: "MyDataset"
working_dir: "./tmp"
labels:
labels_folder: "./docs/example/test_wildfire_vector"
Configure the Labels pipeline
Use the following parameters to configure the TerraKit Labels pipeline.
Active
active: Enables the labels pipeline to run. Set to False to skip the step. Default: True
Labels folder
labels_folder: Points to a directory containing geospatial label files to be processed. Required parameter.
Datetime Information
datetime_info: Set to filename by default, TerraKit will look for temporal information in the label filename in the format YYYY-MM-DD. Alternatively set to csv to provide datetime information in an accompanying csv file in the format:
# metadata.csv
filename,date
EMSR748_AOI01_DEL_MONIT05_observedEventA_v1.json,2024-08-26
EMSR801_AOI01_DEL_MONIT02_observedEventA_v1.json,2025-04-23
metadata.csv in the labels_folder.
label_type
label_type: Set to either raster or vector. TerraKit expects label data in either vector or raster format.
Download example labels
To download a set of example labels, use the rapid_mapping_geojson_downloader function to get started:
from terrakit.general_utils.labels_downloader import rapid_mapping_geojson_downloader
example_label_1 = rapid_mapping_geojson_downloader(event_id="748", aoi="01", monitoring_number="05", version="v1", dest="./docs/examples/test_wildfire_vector")
example_label_2 = rapid_mapping_geojson_downloader(event_id="801", aoi="01", monitoring_number="02", version="v1", dest="./docs/examples/test_wildfire_vector")