Background & Motivation

Why Metadata Matters

Data without context is difficult to use responsibly. Metadata provides the background needed to understand where a dataset came from, what it represents, how it was collected, and under what terms it can be reused. Without it, datasets are frequently misunderstood, underused, or duplicated unnecessarily.

For geospatial and Earth observation data in particular, metadata is not optional — it is foundational. A pixel value means nothing without a coordinate reference system. A time series cannot be interpreted without knowing its temporal cadence. A burned-area mask is ambiguous without knowing the spatial resolution of the underlying imagery.

Consistently structured metadata enables:

Discovery — search engines and catalogs can index and surface relevant datasets
Interoperability — datasets from different sources can be aligned and compared
Reproducibility — training runs can be documented and repeated reliably
Responsible reuse — licensing, provenance, and bias information travel with the data

Initiatives like the Group on Earth Observations (GEO) Data Sharing Principles, the OECD Recommendation on Research Data, and the European Open Science Cloud (EOSC) all underscore the same point: transparent, well-documented data stewardship is a prerequisite for trustworthy science and AI.

Metadata for ML-Ready Datasets

Machine learning is only as good as the data behind it. Yet, the absence of consistent dataset descriptions has long been a barrier to progress in ML research. Without clear metadata, exploring a new dataset requires manual inspection, undocumented trial and error, and often direct contact with the original data producer.

The Croissant metadata format addresses this by providing a standardised, schema.org-based vocabulary for describing ML datasets. It streamlines how datasets are loaded into popular frameworks — PyTorch, TensorFlow, JAX — and brings structure to key aspects of the ML data lifecycle:

Dataset attributes, splits, and file layouts
Licensing and citation information
Responsible AI (RAI) documentation via the Croissant RAI extension

Croissant also improves discoverability. When publishers create Croissant metadata and host it in compatible repositories, search engines can surface datasets that were previously buried or invisible.

Geospatial Artificial Intelligence (GeoAI)

Geospatial AI applies machine learning and deep learning to location-based data — satellite imagery, airborne sensors, in-situ measurements, simulation outputs — for tasks like climate modelling, disaster response, urban planning, and crop yield prediction.

GeoAI datasets introduce complexity that generic ML datasets simply don’t have:

Location is critical. Geolocation errors or coarse spatial annotations directly degrade model predictions.
Sampling matters. Petabyte-scale datasets require careful curation to avoid class imbalance and ensure fair geographic representation.
Data has a shelf life. Temporally stale data reduces model relevance and generalisation.
Cloud-first access. Centralised, cloud-optimised storage is essential for large-scale training and reproducible collaboration.
End-to-end pipeline support. Metadata-rich formats allow seamless ingestion across the full AI workflow.

Good GeoAI starts with good metadata — and that is precisely the gap that GeoCroissant is designed to fill.

Croissant and GeoAI Datasets

Croissant provides a strong foundation for ML dataset metadata, but it was not designed with the specific demands of Earth observation in mind. The table below illustrates where the gaps lie and how GeoCroissant closes them.

Geospatial Dataset Type	Challenge in GeoAI	How GeoCroissant Helps
EO imagery (multi-band; optical/SAR)	Band semantics and sensor-specific acquisition parameters	Standardised sensor and band descriptors, ML task metadata
Spatiotemporal datasets (time series, in-situ, simulations)	Time indexing and spatiotemporal coverage consistency	Temporal modelling and time-series index support
Complex geo formats (NetCDF, HDF5, Zarr)	Nested variables, chunking, multiple assets per logical sample	Clear mapping from raw containers to AI-ready datasets
Mixed geometry data (vector, raster, point clouds)	Heterogeneous geometry types and spatial reference handling	Uniform spatial semantics and query support
Human-labelled / crowdsourced GeoAI datasets	Sampling choices and spatial representativeness affect outcomes	Explicit provenance, spatial bias, and sampling documentation

Sample GeoCroissant Metadata

Below is a representative GeoCroissant metadata record for the HLS Burn Scars dataset on Hugging Face. It demonstrates how key geospatial and ML descriptors come together in a single, machine-readable document.

{
  "@context": {
    "@language": "en",
    "@vocab": "https://schema.org/",
    "cr": "http://mlcommons.org/croissant/",
    "dct": "http://purl.org/dc/terms/",
    "geocr": "http://mlcommons.org/croissant/geo/",
    "sc": "https://schema.org/",
    "conformsTo": "dct:conformsTo",
    "citeAs": "cr:citeAs",
    "recordSet": "cr:recordSet",
    "field": "cr:field",
    "dataType": { "@id": "cr:dataType", "@type": "@vocab" },
    "source": "cr:source",
    "extract": "cr:extract",
    "fileSet": "cr:fileSet",
    "fileProperty": "cr:fileProperty",
    "includes": "cr:includes",
    "key": "cr:key",
    "arrayShape": "cr:arrayShape"
  },
  "@type": "Dataset",
  "name": "GeoCroissant Example: HLS Burn Scars",
  "description": "Minimal GeoCroissant example illustrating CRS, spatial/temporal coverage, spatial resolution, band configuration, per-band spectral metadata, and RecordSets for images and masks.",
  "datePublished": "2024-01-01",
  "version": "1.0",
  "conformsTo": [
    "http://mlcommons.org/croissant/1.1",
    "http://mlcommons.org/croissant/geo/1.0"
  ],
  "license": "https://creativecommons.org/licenses/by/4.0/",
  "spatialCoverage": {
    "@type": "Place",
    "geo": { "@type": "GeoShape", "box": "24.0 -125.0 49.0 -66.0" }
  },
  "temporalCoverage": "2018-01-01/2021-12-31",
  "geocr:coordinateReferenceSystem": "EPSG:4326",
  "geocr:spatialResolution": { "@type": "QuantitativeValue", "value": 30, "unitText": "m" },
  "geocr:bandConfiguration": {
    "@type": "geocr:BandConfiguration",
    "geocr:totalBands": 6,
    "geocr:bandNameList": ["Blue", "Green", "Red", "NIR", "SW1", "SW2"]
  },
  "geocr:spectralBandMetadata": [
    {
      "@type": "geocr:SpectralBand",
      "name": "Blue",
      "geocr:centerWavelength": { "@type": "QuantitativeValue", "value": 490, "unitText": "nm" },
      "geocr:bandwidth": { "@type": "QuantitativeValue", "value": 98, "unitText": "nm" }
    },
    {
      "@type": "geocr:SpectralBand",
      "name": "NIR",
      "geocr:centerWavelength": { "@type": "QuantitativeValue", "value": 865, "unitText": "nm" }
    }
  ]
}