001-Introduction-to-Onboarding-Tuning-Data¶
📥 Download 001-Introduction-to-Onboarding-Tuning-Data.ipynb and try it out
Introduction¶
This notebook is intended to be a guide to onboarding a new fine-tuning dataset Geospatial Studio using the python SDK.
For more information about the Geospatial Studio see the docs page: Geospatial Studio Docs
For more information about the Geospatial Studio SDK and all the functions available through it, see the SDK docs page: Geospatial Studio SDK Docs
Prerequisites¶
- Access to a deploy instance of the Geospatial Studio.
- Ability to run and edit a copy of this notebook.
- A sample dataset you want to onboard
Install SDK:¶
Prepare a python 3.9+ environment, however you normally do that (e.g. conda, pyenv, poetry, etc.) and activate this new environment.
Install Jupyter into that environment:
python -m pip install --upgrade pipthenpip install notebookInstall the SDK with:
python -m pip install geostudio
%load_ext autoreload
%autoreload 2
# import the required packages
import urllib3
import json
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
from geostudio import Client
Connecting to the platform¶
First, we set up the connection to the platform backend. To do this we need the base url for the studio UI and an API key.
To get an API Key:
- Go to the Geospatial Studio UI page and navigate to the Manage your API keys link.
- This should pop-up a window where you can generate, access and delete your api keys. NB: every user is limited to a maximum of two activate api keys at any one time.
Store the API key and geostudio ui base url in a credentials file locally, for example in /User/bob/.geostudio_config_file. You can do this by:
echo "GEOSTUDIO_API_KEY=<paste_api_key_here>" > .geostudio_config_file
echo "BASE_STUDIO_UI_URL=<paste_ui_base_url_here>" >> .geostudio_config_file
Copy and paste the file path to this credentials file in call below.
#############################################################
# Initialize Geostudio client using a geostudio config file
#############################################################
gfm_client = Client(geostudio_config_file=".geostudio_config_file")
List and explore existing datasets in the platform¶
gfm_client.list_datasets(output="df")
# paste the dataset_id of the dataset you want to explore
gfm_client.get_dataset("geodata-dtxfvhqh2poaylszpfigfd")
Onboard a new Dataset¶
In order to onboard your dataset to the Geospatial Studio, you need to have a direct download URL pointing to a zip file of the dataset. You can use this dataset url as an example to go through this notebook.
If you have the dataset locally, you can use Box, OneDrive or any other cloud storage you are used to.
Optionally, you can upload your data to a temporary location in the cloud (with in Studio object storage) and get a url which can be used to pass to the onboarding process. (NB: the same upload function can be useful for pushing files for inferecnce or to processing pipelines.)
The dataset needs to packaged as a zip file.
Optional: zip data files for upload:
zip -j flooding-dataset-upload.zip /Downloads/flooding-dataset-upload/*
# (Optional) If you wish to upload the data archive through the studio, you can uncomment and use this function and paste the path to your zipped dataset.
uploaded_links = gfm_client.upload_file('/Users/beldinemoturi/Downloads/flood-dataset-test.zip')
uploaded_links
Onboard the dataset to the dataset factory¶
Now we provide information about the dataset, including name, description, data and label file suffixes, dataset purpose, data sources, etc. Below is an example payload that defines most of the values you will need to onboard a dataset to the Studio. For more information on what you can provide during the onboarding process, check out the SDK Documentation
The Geospatial Studio allows users to onboard either multi-modal data or uni-modal data. For the multi-modal data, users shall provide, as a list, a different data source for each input modality of the dataset.
Below are some example data connectors, collections and modality_tags to be provided in the dataset need to be correctly matched. See table below. (The modality tags relate to the modalities in the Terramind model)
| Collections | Modality tag | Connector |
|---|---|---|
| s2_l1c | S2L1C | sentinelhub |
| dem | DEM | sentinelhub |
| s1_grd | S1GRD | sentinelhub |
| hls_l30 | HLS_L30 | sentinelhub |
| hls_s30 | HLS_S30 | sentinelhub |
| s2_l2a | S2L2A | sentinelhub |
# Multi-modal data
# Edit the details in the dict and dataset_url below to suit your dataset
multi_modal_datasetDict = {
"dataset_name": "Sentinel Flood Multimodal Test",
"data_sources": [
{
"bands": [
{
"index": "0",
"band_name": "Coastal_aerosol",
"description": "",
"scaling_factor": "1",
},
{
"index": "1",
"band_name": "Blue",
"RGB_band": "B",
"description": "",
"scaling_factor": "1",
},
{
"index": "2",
"band_name": "Green",
"RGB_band": "G",
"description": "",
"scaling_factor": "1",
},
{
"index": "3",
"band_name": "Red",
"RGB_band": "R",
"description": "",
"scaling_factor": "1",
},
{
"index": "4",
"band_name": "05_-_Vegetation_Red_Edge",
"description": "",
"scaling_factor": "1",
},
{
"index": "5",
"band_name": "06_-_Vegetation_Red_Edge",
"description": "",
"scaling_factor": "1",
},
{
"index": "6",
"band_name": "07_-_Vegetation_Red_Edge",
"description": "",
"scaling_factor": "1",
},
{
"index": "7",
"band_name": "08_-_NIR",
"description": "",
"scaling_factor": "1",
},
{
"index": "8",
"band_name": "08A_-_Vegetation_Red_Edge",
"description": "",
"scaling_factor": "1",
},
{
"index": "9",
"band_name": "09_-_Water_vapour",
"description": "",
"scaling_factor": "1",
},
{
"index": "10",
"band_name": "11_-_SWIR",
"description": "",
"scaling_factor": "1",
},
{
"index": "11",
"band_name": "12_-_SWIR",
"description": "",
"scaling_factor": "1",
},
{
"index": "12",
"band_name": "Cloud_Probability",
"description": "",
"scaling_factor": "1",
},
],
"connector": "sentinelhub",
"collection": "s2_l2a",
"modality_tag": "S2L1C",
"file_suffix": "_S2Hand.tif"
},
{
"bands": [
{"index": "0", "band_name": "VV (Gray)", "description": ""},
{"index": "1", "band_name": "VH", "description": ""},
],
"connector": "sentinelhub",
"collection": "s1_grd",
"modality_tag": "S1GRD",
"align_dates": "true",
"file_suffix": "_S1Hand.tif",
"scaling_factor": [1, 1],
},
],
"label_categories": [
{"id": "0", "name": "No Floods", "description": "Flooding assets"},
{"id": "1", "name": "Floods", "description": "Flooding assets"},
],
"dataset_url": uploaded_links["download_url"],
"description": "Flood data from places",
"label_suffix": "_LabelHand.tif",
"purpose": "Segmentation",
}
# Unimodal data
# Edit the details in the dict and dataset_url below to suit your dataset
unimodal_datasetDict = {
"dataset_name": "Inria Dataset Buildings dataset",
"data_sources": [
{
"bands": [
{
"index": "0",
"band_name": "Red",
"RGB_band": "R",
"description": "",
"scaling_factor": "1",
},
{
"index": "1",
"band_name": "Green",
"RGB_band": "G",
"description": "",
"scaling_factor": "1",
},
{
"index": "2",
"band_name": "Blue",
"RGB_band": "B",
"description": "",
"scaling_factor": "1",
},
],
"connector": "sentinelhub",
"collection": "hls_l30",
"modality_tag": "HLS_L30",
"file_suffix": "_train.tif",
}
],
"label_categories": [
{"id": "0", "name": "No buildings", "description": "Building assets"},
{"id": "1", "name": "Buildings", "description": "Building assets"},
],
"dataset_url": uploaded_links["download_url"],
"description": "Inria building labeling dataset",
"label_suffix": "_label.tif",
"purpose": "Segmentation",
}
Once we have prepared the dataset onboading payload, we can use the onboard_dataset function to
onboard the dataset to studio. This sends the payload to studio backend api and the dataset onboarding
process is triggered. This involves downloading the data, validating it, calculating statistics and
metadata, before storing it ready for model tuning.
# Replace the data with the correct dataset payload
onboard_response = gfm_client.onboard_dataset(data=unimodal_datasetDict)
display(json.dumps(onboard_response, indent=2))
Monitor onboarding status¶
You can then monitor the status of the onboarding process through the API with the get_dataset() function or polling function. You can alternatively monitor progress and view the dataset in the UI.
# poll onboarding status
gfm_client.poll_onboard_dataset_until_finished(onboard_response["dataset_id"])
gfm_client.get_dataset(dataset_id=onboard_response["dataset_id"])