Discovering data#

This notebook shows how to find out what data is available locally as well as on ESGF. It also shows how to download the data from ESGF.

from esmvalcore.config import CFG
from esmvalcore.dataset import Dataset

Configure ESMValCore so it always searches the ESGF for data

CFG["search_data"] = "complete"
CFG["projects"].pop("CMIP6", None)  # Clear existing CMIP6 configuration
CFG.nested_update(
    {
        "projects": {
            "CMIP6": {
                "data": {
                    "intake-esgf": {
                        "type": "esmvalcore.io.intake_esgf.IntakeESGFDataSource",
                        "priority": 2,
                        "facets": {
                            "activity": "activity_drs",
                            "dataset": "source_id",
                            "ensemble": "member_id",
                            "exp": "experiment_id",
                            "institute": "institution_id",
                            "grid": "grid_label",
                            "mip": "table_id",
                            "project": "project",
                            "short_name": "variable_id",
                        },
                    },
                },
            },
        },
    },
)

We define a dataset template to search for all CMIP6 datasets that provide surface air temperature (tas) on a monthly resolution for the historical experiment. Note that ESMValCore uses its own names for the facets for a more uniform naming across different CMIP phases and other projects. The mapping to the facet names used on ESGF can be found in esmvalcore.esgf.facets.FACETS.

dataset_template = Dataset(
    short_name="tas",
    mip="Amon",
    project="CMIP6",
    exp="historical",
    dataset="*",
    institute="*",
    ensemble="*",
    grid="*",
)

Next, we use the Dataset.from_files method to build a list of datasets from the available files. This may take a while as searching the ESGF for many files is a bit slow. Because the search results are cached for a configurable duration, subsequent searches will be faster.

datasets = list(dataset_template.from_files())
print(f"Found {len(datasets)} datasets, showing the first 10:")
datasets[:10]
Found 906 datasets, showing the first 10:
[Dataset:
 {'dataset': 'TaiESM1',
  'project': 'CMIP6',
  'mip': 'Amon',
  'short_name': 'tas',
  'ensemble': 'r1i1p1f1',
  'exp': 'historical',
  'grid': 'gn',
  'institute': 'AS-RCEC'},
 Dataset:
 {'dataset': 'TaiESM1',
  'project': 'CMIP6',
  'mip': 'Amon',
  'short_name': 'tas',
  'ensemble': 'r2i1p1f1',
  'exp': 'historical',
  'grid': 'gn',
  'institute': 'AS-RCEC'},
 Dataset:
 {'dataset': 'AWI-CM-1-1-MR',
  'project': 'CMIP6',
  'mip': 'Amon',
  'short_name': 'tas',
  'ensemble': 'r1i1p1f1',
  'exp': 'historical',
  'grid': 'gn',
  'institute': 'AWI'},
 Dataset:
 {'dataset': 'AWI-CM-1-1-MR',
  'project': 'CMIP6',
  'mip': 'Amon',
  'short_name': 'tas',
  'ensemble': 'r2i1p1f1',
  'exp': 'historical',
  'grid': 'gn',
  'institute': 'AWI'},
 Dataset:
 {'dataset': 'AWI-CM-1-1-MR',
  'project': 'CMIP6',
  'mip': 'Amon',
  'short_name': 'tas',
  'ensemble': 'r3i1p1f1',
  'exp': 'historical',
  'grid': 'gn',
  'institute': 'AWI'},
 Dataset:
 {'dataset': 'AWI-CM-1-1-MR',
  'project': 'CMIP6',
  'mip': 'Amon',
  'short_name': 'tas',
  'ensemble': 'r4i1p1f1',
  'exp': 'historical',
  'grid': 'gn',
  'institute': 'AWI'},
 Dataset:
 {'dataset': 'AWI-CM-1-1-MR',
  'project': 'CMIP6',
  'mip': 'Amon',
  'short_name': 'tas',
  'ensemble': 'r5i1p1f1',
  'exp': 'historical',
  'grid': 'gn',
  'institute': 'AWI'},
 Dataset:
 {'dataset': 'AWI-ESM-1-1-LR',
  'project': 'CMIP6',
  'mip': 'Amon',
  'short_name': 'tas',
  'ensemble': 'r1i1p1f1',
  'exp': 'historical',
  'grid': 'gn',
  'institute': 'AWI'},
 Dataset:
 {'dataset': 'BCC-CSM2-MR',
  'project': 'CMIP6',
  'mip': 'Amon',
  'short_name': 'tas',
  'ensemble': 'r1i1p1f1',
  'exp': 'historical',
  'grid': 'gn',
  'institute': 'BCC'},
 Dataset:
 {'dataset': 'BCC-CSM2-MR',
  'project': 'CMIP6',
  'mip': 'Amon',
  'short_name': 'tas',
  'ensemble': 'r2i1p1f1',
  'exp': 'historical',
  'grid': 'gn',
  'institute': 'BCC'}]

Let’s look at the first dataset in more detail. We can print the facets describing the dataset:

dataset = datasets[0]
dataset
Dataset:
{'dataset': 'TaiESM1',
 'project': 'CMIP6',
 'mip': 'Amon',
 'short_name': 'tas',
 'ensemble': 'r1i1p1f1',
 'exp': 'historical',
 'grid': 'gn',
 'institute': 'AS-RCEC'}

and see what files are available:

dataset.files
[IntakeESGFDataset(name='CMIP6.CMIP.AS-RCEC.TaiESM1.historical.r1i1p1f1.Amon.tas.gn')]

Load a single file as iris.cube.CubeList:

cubes = dataset.files[0].to_iris()
cubes

Air Temperature (K) time latitude longitude
Shape 1980 192 288
Dimension coordinates
time x - -
latitude - x -
longitude - - x
Scalar coordinates
height 2.0 m
Cell methods
0 area: time: mean
Attributes
Conventions 'CF-1.7 CMIP-6.2'
activity_drs 'CMIP'
activity_id 'CMIP'
branch_method 'Hybrid-restart from year 0671-01-01 of piControl'
branch_time np.float64(0.0)
branch_time_in_child np.float64(0.0)
branch_time_in_parent np.float64(171550.0)
cmor_version '3.5.0'
comment 'near-surface (usually, 2 meter) air temperature'
contact 'Dr. Wei-Liang Lee (leelupin@gate.sinica.edu.tw)'
creation_date '2020-06-08T08:53:23Z'
data_specs_version '01.00.31'
experiment 'all-forcing simulation of the recent past'
experiment_id 'historical'
external_variables 'areacella'
forcing_index np.int32(1)
frequency 'mon'
further_info_url 'https://furtherinfo.es-doc.org/CMIP6.AS-RCEC.TaiESM1.historical.none.r ...'
grid 'finite-volume grid with 0.9x1.25 degree lat/lon resolution'
grid_label 'gn'
history "2020-06-08T08:53:23Z altered by CMOR: Treated scalar dimension: 'height'. ..."
initialization_index np.int32(1)
institution 'Research Center for Environmental Changes, Academia Sinica, Nankang, Taipei ...'
institution_id 'AS-RCEC'
license 'CMIP6 model data produced by NCC is licensed under a Creative Commons Attribution ...'
member_id 'r1i1p1f1'
mip_era 'CMIP6'
model_id 'TaiESM1'
nominal_resolution '100 km'
original_name 'TREFHT'
parent_activity_id 'CMIP'
parent_experiment_id 'piControl'
parent_mip_era 'CMIP6'
parent_source_id 'TaiESM1'
parent_sub_experiment_id 'none'
parent_time_units 'days since 1850-01-01'
parent_variant_label 'r1i1p1f1'
physics_index np.int32(1)
product 'model-output'
realization_index np.int32(1)
realm 'atmos'
references '10.5194/gmd-2019-377'
run_variant 'N/A'
source 'TaiESM 1.0 (2018): \naerosol: SNAP (same grid as atmos)\natmos: TaiAM1 ...'
source_file '/mnt/d/data/CMIP6/CMIP/AS-RCEC/TaiESM1/historical/r1i1p1f1/Amon/tas/gn ...'
source_id 'TaiESM1'
source_type 'AOGCM AER BGC'
sub_experiment 'none'
sub_experiment_id 'none'
table_id 'Amon'
table_info 'Creation Date:(24 July 2019) MD5:0bb394a356ef9d214d027f1aca45853e'
title 'TaiESM1 output prepared for CMIP6'
tracking_id 'hdl:21.14100/997cf563-6411-4a78-a9c4-7369ae27d698'
variable_id 'tas'
variant_label 'r1i1p1f1'

Dataset.from_files can also handle derived variables properly:

dataset_template = Dataset(
    short_name="lwcre",
    mip="Amon",
    project="CMIP6",
    exp="historical",
    dataset="*",
    institute="*",
    ensemble="r1i1p1f1",
    grid="gn",
    derive=True,
    force_derivation=True,
)
datasets = list(dataset_template.from_files())
print(f"Found {len(datasets)} datasets, showing the first 10:")
datasets[:10]
Found 37 datasets, showing the first 10:
[Dataset:
 {'dataset': 'GISS-E2-2-G',
  'project': 'CMIP6',
  'mip': 'Amon',
  'short_name': 'lwcre',
  'derive': True,
  'ensemble': 'r1i1p1f1',
  'exp': 'historical',
  'force_derivation': True,
  'grid': 'gn',
  'institute': 'NASA-GISS'},
 Dataset:
 {'dataset': 'FGOALS-g3',
  'project': 'CMIP6',
  'mip': 'Amon',
  'short_name': 'lwcre',
  'derive': True,
  'ensemble': 'r1i1p1f1',
  'exp': 'historical',
  'force_derivation': True,
  'grid': 'gn',
  'institute': 'CAS'},
 Dataset:
 {'dataset': 'CESM2-WACCM-FV2',
  'project': 'CMIP6',
  'mip': 'Amon',
  'short_name': 'lwcre',
  'derive': True,
  'ensemble': 'r1i1p1f1',
  'exp': 'historical',
  'force_derivation': True,
  'grid': 'gn',
  'institute': 'NCAR'},
 Dataset:
 {'dataset': 'GISS-E2-1-H',
  'project': 'CMIP6',
  'mip': 'Amon',
  'short_name': 'lwcre',
  'derive': True,
  'ensemble': 'r1i1p1f1',
  'exp': 'historical',
  'force_derivation': True,
  'grid': 'gn',
  'institute': 'NASA-GISS'},
 Dataset:
 {'dataset': 'BCC-CSM2-MR',
  'project': 'CMIP6',
  'mip': 'Amon',
  'short_name': 'lwcre',
  'derive': True,
  'ensemble': 'r1i1p1f1',
  'exp': 'historical',
  'force_derivation': True,
  'grid': 'gn',
  'institute': 'BCC'},
 Dataset:
 {'dataset': 'CAS-ESM2-0',
  'project': 'CMIP6',
  'mip': 'Amon',
  'short_name': 'lwcre',
  'derive': True,
  'ensemble': 'r1i1p1f1',
  'exp': 'historical',
  'force_derivation': True,
  'grid': 'gn',
  'institute': 'CAS'},
 Dataset:
 {'dataset': 'MPI-ESM-1-2-HAM',
  'project': 'CMIP6',
  'mip': 'Amon',
  'short_name': 'lwcre',
  'derive': True,
  'ensemble': 'r1i1p1f1',
  'exp': 'historical',
  'force_derivation': True,
  'grid': 'gn',
  'institute': 'HAMMOZ-Consortium'},
 Dataset:
 {'dataset': 'CESM2-FV2',
  'project': 'CMIP6',
  'mip': 'Amon',
  'short_name': 'lwcre',
  'derive': True,
  'ensemble': 'r1i1p1f1',
  'exp': 'historical',
  'force_derivation': True,
  'grid': 'gn',
  'institute': 'NCAR'},
 Dataset:
 {'dataset': 'BCC-ESM1',
  'project': 'CMIP6',
  'mip': 'Amon',
  'short_name': 'lwcre',
  'derive': True,
  'ensemble': 'r1i1p1f1',
  'exp': 'historical',
  'force_derivation': True,
  'grid': 'gn',
  'institute': 'BCC'},
 Dataset:
 {'dataset': 'ICON-ESM-LR',
  'project': 'CMIP6',
  'mip': 'Amon',
  'short_name': 'lwcre',
  'derive': True,
  'ensemble': 'r1i1p1f1',
  'exp': 'historical',
  'force_derivation': True,
  'grid': 'gn',
  'institute': 'MPI-M'}]

The facet force_derivation=True ensures variable derivation. If omitted and files that provide the variable lwcre without derivation are present, only those are returned.

If variable derivation is necessary (this will always be the case if force_derivation=True is used), the files attribute of the datasets may be empty. In this case, the input files of the input variables necessary for derivation can be accessed via the Dataset.input_datasets attribute:

dataset = datasets[0]
dataset.files
[]
for d in dataset.required_datasets:
    print(d["short_name"])
    print(d.files)
rlut
[IntakeESGFDataset(name='CMIP6.CMIP.NASA-GISS.GISS-E2-2-G.historical.r1i1p1f1.Amon.rlut.gn')]
rlutcs
[IntakeESGFDataset(name='CMIP6.CMIP.NASA-GISS.GISS-E2-2-G.historical.r1i1p1f1.Amon.rlutcs.gn')]