001-Introduction-to-Onboarding-Tuning-Data¶

📥 Download 001-Introduction-to-Onboarding-Tuning-Data.ipynb and try it out

Introduction¶

This notebook is intended to be a guide to onboarding a new fine-tuning dataset Geospatial Studio using the python SDK.

For more information about the Geospatial Studio see the docs page: Geospatial Studio Docs

For more information about the Geospatial Studio SDK and all the functions available through it, see the SDK docs page: Geospatial Studio SDK Docs

Prerequisites¶

Access to a deploy instance of the Geospatial Studio.
Ability to run and edit a copy of this notebook.
A sample dataset you want to onboard

Install SDK:¶

Prepare a python 3.9+ environment, however you normally do that (e.g. conda, pyenv, poetry, etc.) and activate this new environment.
Install Jupyter into that environment: python -m pip install --upgrade pip then pip install notebook
Install the SDK with: python -m pip install geostudio

In [ ]:

Copied!

%load_ext autoreload
%autoreload 2
%load_ext autoreload
%autoreload 2

In [ ]:

Copied!





# import the required packages
import urllib3
import json
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

from geostudio import Client
# import the required packages
import urllib3
import json
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

from geostudio import Client

Connecting to the platform¶

First, we set up the connection to the platform backend. To do this we need the base url for the studio UI and an API key.

To get an API Key:

Go to the Geospatial Studio UI page and navigate to the Manage your API keys link.
This should pop-up a window where you can generate, access and delete your api keys. NB: every user is limited to a maximum of two activate api keys at any one time.

Store the API key and geostudio ui base url in a credentials file locally, for example in /User/bob/.geostudio_config_file. You can do this by:

echo "GEOSTUDIO_API_KEY=<paste_api_key_here>" > .geostudio_config_file
echo "BASE_STUDIO_UI_URL=<paste_ui_base_url_here>" >> .geostudio_config_file

Copy and paste the file path to this credentials file in call below.

In [ ]:

Copied!





#############################################################
# Initialize Geostudio client using a geostudio config file
#############################################################
gfm_client = Client(geostudio_config_file=".geostudio_config_file")
#############################################################
# Initialize Geostudio client using a geostudio config file
#############################################################
gfm_client = Client(geostudio_config_file=".geostudio_config_file")

List and explore existing datasets in the platform¶

In [ ]:

Copied!

gfm_client.list_datasets(output="df")
gfm_client.list_datasets(output="df")

In [ ]:

Copied!

# paste the dataset_id of the dataset you want to explore
gfm_client.get_dataset("geodata-dtxfvhqh2poaylszpfigfd")
# paste the dataset_id of the dataset you want to explore
gfm_client.get_dataset("geodata-dtxfvhqh2poaylszpfigfd")

Onboard a new Dataset¶

In order to onboard your dataset to the Geospatial Studio, you need to have a direct download URL pointing to a zip file of the dataset. You can use this dataset url as an example to go through this notebook.

If you have the dataset locally, you can use Box, OneDrive or any other cloud storage you are used to.

Optionally, you can upload your data to a temporary location in the cloud (with in Studio object storage) and get a url which can be used to pass to the onboarding process. (NB: the same upload function can be useful for pushing files for inferecnce or to processing pipelines.)

The dataset needs to packaged as a zip file.

Optional: zip data files for upload:

zip -j flooding-dataset-upload.zip /Downloads/flooding-dataset-upload/*

In [ ]:

Copied!

# (Optional) If you wish to upload the data archive through the studio, you can uncomment and use this function and paste the path to your zipped dataset. 
uploaded_links = gfm_client.upload_file('/Users/beldinemoturi/Downloads/flood-dataset-test.zip')
uploaded_links
# (Optional) If you wish to upload the data archive through the studio, you can uncomment and use this function and paste the path to your zipped dataset. 
uploaded_links = gfm_client.upload_file('/Users/beldinemoturi/Downloads/flood-dataset-test.zip')
uploaded_links

Onboard the dataset to the dataset factory¶

Now we provide information about the dataset, including name, description, data and label file suffixes, dataset purpose, data sources, etc. Below is an example payload that defines most of the values you will need to onboard a dataset to the Studio. For more information on what you can provide during the onboarding process, check out the SDK Documentation

The Geospatial Studio allows users to onboard either multi-modal data or uni-modal data. For the multi-modal data, users shall provide, as a list, a different data source for each input modality of the dataset.

Below are some example data connectors, collections and modality_tags to be provided in the dataset need to be correctly matched. See table below. (The modality tags relate to the modalities in the Terramind model)

Collections	Modality tag	Connector
s2_l1c	S2L1C	sentinelhub
dem	DEM	sentinelhub
s1_grd	S1GRD	sentinelhub
hls_l30	HLS_L30	sentinelhub
hls_s30	HLS_S30	sentinelhub
s2_l2a	S2L2A	sentinelhub

In [ ]:

Copied!





# Multi-modal data
# Edit the details in the dict and dataset_url below to suit your dataset

multi_modal_datasetDict = {
    "dataset_name": "Sentinel Flood Multimodal Test",
    "data_sources": [
        {
            "bands": [
                {
                    "index": "0",
                    "band_name": "Coastal_aerosol",
                    "description": "",
                    "scaling_factor": "1",
                },
                {
                    "index": "1",
                    "band_name": "Blue",
                    "RGB_band": "B",
                    "description": "",
                    "scaling_factor": "1",
                },
                {
                    "index": "2",
                    "band_name": "Green",
                    "RGB_band": "G",
                    "description": "",
                    "scaling_factor": "1",
                },
                {
                    "index": "3",
                    "band_name": "Red",
                    "RGB_band": "R",
                    "description": "",
                    "scaling_factor": "1",
                },
                {
                    "index": "4",
                    "band_name": "05_-_Vegetation_Red_Edge",
                    "description": "",
                    "scaling_factor": "1",
                },
                {
                    "index": "5",
                    "band_name": "06_-_Vegetation_Red_Edge",
                    "description": "",
                    "scaling_factor": "1",
                },
                {
                    "index": "6",
                    "band_name": "07_-_Vegetation_Red_Edge",
                    "description": "",
                    "scaling_factor": "1",
                },
                {
                    "index": "7",
                    "band_name": "08_-_NIR",
                    "description": "",
                    "scaling_factor": "1",
                },
                {
                    "index": "8",
                    "band_name": "08A_-_Vegetation_Red_Edge",
                    "description": "",
                    "scaling_factor": "1",
                },
                {
                    "index": "9",
                    "band_name": "09_-_Water_vapour",
                    "description": "",
                    "scaling_factor": "1",
                },
                {
                    "index": "10",
                    "band_name": "11_-_SWIR",
                    "description": "",
                    "scaling_factor": "1",
                },
                {
                    "index": "11",
                    "band_name": "12_-_SWIR",
                    "description": "",
                    "scaling_factor": "1",
                },
                {
                    "index": "12",
                    "band_name": "Cloud_Probability",
                    "description": "",
                    "scaling_factor": "1",
                },
            ],
            "connector": "sentinelhub",
            "collection": "s2_l2a",
            "modality_tag": "S2L1C",
            "file_suffix": "_S2Hand.tif"
        },
        {
            "bands": [
                {"index": "0", "band_name": "VV (Gray)", "description": ""},
                {"index": "1", "band_name": "VH", "description": ""},
            ],
            "connector": "sentinelhub",
            "collection": "s1_grd",
            "modality_tag": "S1GRD",
            "align_dates": "true",
            "file_suffix": "_S1Hand.tif",
            "scaling_factor": [1, 1],
        },
    ],
    "label_categories": [
        {"id": "0", "name": "No Floods", "description": "Flooding assets"},
        {"id": "1", "name": "Floods", "description": "Flooding assets"},
    ],
    "dataset_url": uploaded_links["download_url"],
    "description": "Flood data from places",
    "label_suffix": "_LabelHand.tif",
    "purpose": "Segmentation",
}
# Multi-modal data
# Edit the details in the dict and dataset_url below to suit your dataset

multi_modal_datasetDict = {
    "dataset_name": "Sentinel Flood Multimodal Test",
    "data_sources": [
        {
            "bands": [
                {
                    "index": "0",
                    "band_name": "Coastal_aerosol",
                    "description": "",
                    "scaling_factor": "1",
                },
                {
                    "index": "1",
                    "band_name": "Blue",
                    "RGB_band": "B",
                    "description": "",
                    "scaling_factor": "1",
                },
                {
                    "index": "2",
                    "band_name": "Green",
                    "RGB_band": "G",
                    "description": "",
                    "scaling_factor": "1",
                },
                {
                    "index": "3",
                    "band_name": "Red",
                    "RGB_band": "R",
                    "description": "",
                    "scaling_factor": "1",
                },
                {
                    "index": "4",
                    "band_name": "05_-_Vegetation_Red_Edge",
                    "description": "",
                    "scaling_factor": "1",
                },
                {
                    "index": "5",
                    "band_name": "06_-_Vegetation_Red_Edge",
                    "description": "",
                    "scaling_factor": "1",
                },
                {
                    "index": "6",
                    "band_name": "07_-_Vegetation_Red_Edge",
                    "description": "",
                    "scaling_factor": "1",
                },
                {
                    "index": "7",
                    "band_name": "08_-_NIR",
                    "description": "",
                    "scaling_factor": "1",
                },
                {
                    "index": "8",
                    "band_name": "08A_-_Vegetation_Red_Edge",
                    "description": "",
                    "scaling_factor": "1",
                },
                {
                    "index": "9",
                    "band_name": "09_-_Water_vapour",
                    "description": "",
                    "scaling_factor": "1",
                },
                {
                    "index": "10",
                    "band_name": "11_-_SWIR",
                    "description": "",
                    "scaling_factor": "1",
                },
                {
                    "index": "11",
                    "band_name": "12_-_SWIR",
                    "description": "",
                    "scaling_factor": "1",
                },
                {
                    "index": "12",
                    "band_name": "Cloud_Probability",
                    "description": "",
                    "scaling_factor": "1",
                },
            ],
            "connector": "sentinelhub",
            "collection": "s2_l2a",
            "modality_tag": "S2L1C",
            "file_suffix": "_S2Hand.tif"
        },
        {
            "bands": [
                {"index": "0", "band_name": "VV (Gray)", "description": ""},
                {"index": "1", "band_name": "VH", "description": ""},
            ],
            "connector": "sentinelhub",
            "collection": "s1_grd",
            "modality_tag": "S1GRD",
            "align_dates": "true",
            "file_suffix": "_S1Hand.tif",
            "scaling_factor": [1, 1],
        },
    ],
    "label_categories": [
        {"id": "0", "name": "No Floods", "description": "Flooding assets"},
        {"id": "1", "name": "Floods", "description": "Flooding assets"},
    ],
    "dataset_url": uploaded_links["download_url"],
    "description": "Flood data from places",
    "label_suffix": "_LabelHand.tif",
    "purpose": "Segmentation",
}

In [ ]:

Copied!





# Unimodal data
# Edit the details in the dict and dataset_url below to suit your dataset
unimodal_datasetDict = {
    "dataset_name": "Inria Dataset Buildings dataset",
    "data_sources": [
        {
            "bands": [
                {
                    "index": "0",
                    "band_name": "Red",
                    "RGB_band": "R",
                    "description": "",
                    "scaling_factor": "1",
                },
                {
                    "index": "1",
                    "band_name": "Green",
                    "RGB_band": "G",
                    "description": "",
                    "scaling_factor": "1",
                },
                {
                    "index": "2",
                    "band_name": "Blue",
                    "RGB_band": "B",
                    "description": "",
                    "scaling_factor": "1",
                },
            ],
            "connector": "sentinelhub",
            "collection": "hls_l30",
            "modality_tag": "HLS_L30",
            "file_suffix": "_train.tif",
        }
    ],
    "label_categories": [
        {"id": "0", "name": "No buildings", "description": "Building assets"},
        {"id": "1", "name": "Buildings", "description": "Building assets"},
    ],
    "dataset_url": uploaded_links["download_url"],
    "description": "Inria building labeling dataset",
    "label_suffix": "_label.tif",
    "purpose": "Segmentation",
}
# Unimodal data
# Edit the details in the dict and dataset_url below to suit your dataset
unimodal_datasetDict = {
    "dataset_name": "Inria Dataset Buildings dataset",
    "data_sources": [
        {
            "bands": [
                {
                    "index": "0",
                    "band_name": "Red",
                    "RGB_band": "R",
                    "description": "",
                    "scaling_factor": "1",
                },
                {
                    "index": "1",
                    "band_name": "Green",
                    "RGB_band": "G",
                    "description": "",
                    "scaling_factor": "1",
                },
                {
                    "index": "2",
                    "band_name": "Blue",
                    "RGB_band": "B",
                    "description": "",
                    "scaling_factor": "1",
                },
            ],
            "connector": "sentinelhub",
            "collection": "hls_l30",
            "modality_tag": "HLS_L30",
            "file_suffix": "_train.tif",
        }
    ],
    "label_categories": [
        {"id": "0", "name": "No buildings", "description": "Building assets"},
        {"id": "1", "name": "Buildings", "description": "Building assets"},
    ],
    "dataset_url": uploaded_links["download_url"],
    "description": "Inria building labeling dataset",
    "label_suffix": "_label.tif",
    "purpose": "Segmentation",
}

Once we have prepared the dataset onboading payload, we can use the onboard_dataset function to onboard the dataset to studio. This sends the payload to studio backend api and the dataset onboarding process is triggered. This involves downloading the data, validating it, calculating statistics and metadata, before storing it ready for model tuning.

In [ ]:

Copied!

# Replace the data with the correct dataset payload
onboard_response = gfm_client.onboard_dataset(data=unimodal_datasetDict)
display(json.dumps(onboard_response, indent=2))
# Replace the data with the correct dataset payload
onboard_response = gfm_client.onboard_dataset(data=unimodal_datasetDict)
display(json.dumps(onboard_response, indent=2))

Monitor onboarding status¶

You can then monitor the status of the onboarding process through the API with the get_dataset() function or polling function. You can alternatively monitor progress and view the dataset in the UI.

In [ ]:

Copied!

# poll onboarding status
gfm_client.poll_onboard_dataset_until_finished(onboard_response["dataset_id"])
# poll onboarding status
gfm_client.poll_onboard_dataset_until_finished(onboard_response["dataset_id"])

In [ ]:

Copied!

gfm_client.get_dataset(dataset_id=onboard_response["dataset_id"])
gfm_client.get_dataset(dataset_id=onboard_response["dataset_id"])