Use Cases

GeoCroissant is designed to serve a wide range of GeoAI workflows. The sections below walk through nine representative use cases, each highlighting the relevant properties, design patterns, and practical considerations.


Use Case 1: Space Weather Datasets

Space weather datasets from missions like NASA’s GOES satellites present a unique challenge: their metadata follows the SPASE (Space Physics Archive Search and Extract) standard, which uses domain-specific constructs for observatories, instruments, and wavelength channels — none of which exist in the core Croissant vocabulary.

GeoCroissant addresses this through two composite properties that simplify authoring and enable downstream mapping to SPASE:

Property Purpose
geocr:multiWavelengthConfiguration Captures wavelength channels (e.g., via geocr:channelList)
geocr:solarInstrumentCharacteristics Captures observatory and instrument identifiers

A proof-of-concept tool, SPASECroissant Oven, demonstrated how LLMs and custom parsing logic can automate conversion from SPASE XML to GeoCroissant, including inference of measurement types, data types, and column descriptors.

{
  "geocr:multiWavelengthConfiguration": {
    "@type": "geocr:MultiWavelengthConfiguration",
    "geocr:channelList": ["171Å", "193Å"]
  },
  "geocr:solarInstrumentCharacteristics": {
    "@type": "geocr:SolarInstrumentCharacteristics",
    "geocr:observatory": "SDO",
    "geocr:instrument": "AIA"
  }
}

Use Case 2: Interoperability with Other Standards

Geospatial data infrastructures depend on standards alignment. This use case shows how GeoCroissant fields map to equivalent concepts in STAC and GeoDCAT, enabling automated format conversion and integration with existing catalogs.

STAC Field GeoCroissant Field External Vocabulary
id @id
title name dcat:title
description description dcat:description
bbox sc:spatialCoverage / dcat:bbox
geometry geosparql:hasGeometry
assets distribution dcat:distribution
proj:epsg geocr:coordinateReferenceSystem proj:epsg
gsd geocr:spatialResolution stac:gsd
eo:bands geocr:bandConfiguration, geocr:spectralBandMetadata STAC EO extension

Key properties in use:

Property Purpose
geocr:coordinateReferenceSystem Ensures consistent spatial referencing across tools
geocr:spatialResolution Supports discovery and ML-readiness via resolution metadata
geocr:bandConfiguration Standardises multi-band raster organisation for interoperable use

Use Case 3: Programmatic Metadata Access

When a dataset’s metadata is maintained in a dynamic catalog or registry, GeoCroissant can point clients directly to the relevant endpoint using geocr:recordEndpoint. This aligns with the OGC API – Records standard, which defines a RESTful interface for publishing and querying metadata records.

Clients can retrieve record-level metadata and apply spatial or temporal filters:

GET /api/records?bbox=-125,24,-66,49
GET /api/records?datetime=2018-01-01/2021-12-31

The returned GeoJSON FeatureCollection can be consumed directly by web clients, spatial dashboards, and registry services.

{
  "geocr:recordEndpoint": "https://example.org/api/records",
  "recordSet": [{
    "@type": "cr:RecordSet",
    "@id": "records_recordset",
    "field": [
      { "@type": "cr:Field", "name": "recordId", "dataType": "sc:Text" },
      { "@type": "cr:Field", "name": "spatialCoverage", "dataType": "sc:Place" },
      { "@type": "cr:Field", "name": "temporalCoverage", "dataType": "sc:Text" },
      { "@type": "cr:Field", "name": "geocr:spatialResolution", "dataType": "sc:QuantitativeValue" },
      { "@type": "cr:Field", "name": "geocr:spatialIndex", "dataType": "sc:Text" }
    ]
  }]
}

Key property in use:

Property Purpose
geocr:recordEndpoint Declares the OGC API endpoint for programmatic record retrieval and filtering

Use Case 4: Search and Discovery via GeoSPARQL

GeoCroissant can express geometry using GeoSPARQL vocabulary terms, enabling spatial reasoning over RDF datasets in semantic graph databases such as GraphDB. The optional geocr:spatialIndex property provides efficient coarse filtering before exact geometry operations are evaluated.

See Appendix C for full SPARQL query examples.

Key properties in use:

Property Purpose
geocr:spatialIndex Precomputed spatial index tokens for scalable coarse filtering
geosparql:hasGeometry Links a dataset or record to a GeoSPARQL geometry node
geosparql:asWKT Encodes geometry as a WKT typed literal
cr:recordSet Exposes record-level metadata entries for individual querying

Use Case 5: ML Pipeline Integration

Given a GeoAI dataset with image–label pairs, GeoCroissant metadata can drive end-to-end ML workflows with minimal boilerplate.

Loading metadata with mlcroissant:

import mlcroissant as mlc

dataset = mlc.Dataset("geocroissant.json")
print(dataset.metadata.to_json())

Custom PyTorch DataLoader using GeoCroissant:

import mlcroissant as mlc
import torch
from torch.utils.data import Dataset, DataLoader
import rasterio
import numpy as np
from pathlib import Path

# Load dataset metadata
dataset = mlc.Dataset("geocroissant.json")
metadata = dataset.metadata
base_path = Path(metadata.distribution[0].content_url)
image_files = sorted(base_path.glob(metadata.distribution[1].includes[0]))

class CroissantDataset(Dataset):
    def __init__(self, image_paths, split=None):
        self.image_paths = [p for p in image_paths if split in str(p)] if split else image_paths

    def __len__(self):
        return len(self.image_paths)

    def __getitem__(self, idx):
        image_path = self.image_paths[idx]
        mask_path = str(image_path).replace("_merged.tif", ".mask.tif")
        with rasterio.open(image_path) as src:
            image = src.read().astype(np.float32)
        with rasterio.open(mask_path) as src:
            mask = src.read(1).astype(np.int64)
        return {"image": torch.from_numpy(image), "mask": torch.from_numpy(mask)}

train_loader = DataLoader(CroissantDataset(image_files, split="training"), batch_size=1)

Use Case 6: Responsible GeoAI

Responsible AI metadata documents the full lifecycle, provenance, and intended use of a dataset. GeoCroissant complements the Croissant RAI extension with two geospatially aware properties: geocr:spatialBias and geocr:samplingStrategy.

Using the HLS Burn Scar dataset as an example:

{
  "geocr:spatialBias": "Spatial coverage is concentrated over the contiguous United States; high-latitude regions and other global biomes are under-represented.",
  "geocr:samplingStrategy": "512×512 chips are sampled by centring windows on burn-scar polygons; scenes are filtered to remove high cloud cover and missing data.",
  "rai:dataCollection": "Chips generated by co-locating HLS scenes with reference burn polygons.",
  "rai:dataBiases": "Potential geographic bias (CONUS focus), temporal bias (2018–2021), and class imbalance between burned and unburned pixels.",
  "rai:dataUseCases": ["Training", "Validation", "Testing", "Fine-tuning"]
}

Key properties in use:

Property Purpose
geocr:spatialBias Documents geographic representativeness limitations
geocr:samplingStrategy Describes chip/windowing strategy and filtering decisions

Use Case 7: Time-Series Support

Time-series EO datasets power GeoAI tasks like change detection, phenology monitoring, and forecasting. GeoCroissant supports these workflows with explicit temporal ordering via geocr:timeSeriesIndex and cadence documentation via geocr:temporalResolution.

{
  "geocr:temporalResolution": { "@type": "QuantitativeValue", "value": 1, "unitText": "month" },
  "recordSet": [{
    "@type": "cr:RecordSet",
    "@id": "timeseries_recordset",
    "geocr:timeSeriesIndex": { "@id": "timeseries_recordset/timestamp" },
    "field": [
      {
        "@type": "cr:Field",
        "@id": "timeseries_recordset/timestamp",
        "name": "timestamp",
        "dataType": "sc:DateTime"
      },
      {
        "@type": "cr:Field",
        "@id": "timeseries_recordset/image",
        "name": "image",
        "dataType": "sc:ImageObject",
        "cr:arrayShape": [3660, 3660, 13]
      }
    ]
  }]
}

Use Case 8: Adding Custom Properties

When a required attribute is already defined in an established vocabulary (GeoSPARQL, SPASE, Dublin Core), GeoCroissant recommends reusing that term directly by declaring the appropriate prefix in @context.

When no suitable external term exists, use sc:additionalProperty with sc:PropertyValue:

{
  "additionalProperty": [
    {
      "@type": "PropertyValue",
      "name": "cloudCoverMedian",
      "value": { "@type": "QuantitativeValue", "value": 12.3, "unitText": "%" }
    },
    {
      "@type": "PropertyValue",
      "name": "tilingScheme",
      "value": "MGRS"
    }
  ]
}

Use Case 9: Caching AI-Ready Transformations

Large geospatial datasets in formats like NetCDF and Zarr often require preprocessing before they are suitable for GPU-accelerated training. Caching these AI-ready transformations avoids repeated computation and optimises data throughput.

In this pattern, source data undergoes extraction, filtering, and transformation in advance. The result — an AI-ready Zarr store — is referenced via a standard Croissant cr:FileObject, while GeoCroissant metadata provides the spatial and temporal context. The mlcroissant library can then load these datasets as easily as a CSV file, allowing data scientists to focus on training and evaluation while the tooling handles data access at scale.

Extensions to the mlcroissant library are under development to support native Zarr payloads alongside existing format support.