{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "bd168fbd-f5e8-4b32-906f-5c658b9758a0",
   "metadata": {},
   "source": [
    "# Discovering data\n",
    "\n",
    "This notebook shows how to find out what data is available locally as well as on ESGF. It also shows how to download the data from ESGF."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f0ccfe7f-c535-4606-99ce-be24960aece1",
   "metadata": {},
   "outputs": [],
   "source": [
    "from esmvalcore.config import CFG\n",
    "from esmvalcore.dataset import Dataset\n",
    "from esmvalcore.esgf import download"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "f4374495-19c4-4c3b-9fac-d929a5e595ad",
   "metadata": {},
   "source": [
    "Configure ESMValCore so it always searches the ESGF for data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "5d2711ea-6738-4a82-97b1-bc7d1212098a",
   "metadata": {},
   "outputs": [],
   "source": [
    "CFG[\"search_esgf\"] = \"always\""
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "aea7a272-7d26-44d9-8766-379379e5d152",
   "metadata": {},
   "source": [
    "We define a dataset template to search for all CMIP6 datasets that provide surface air temperature (tas) on a monthly resolution for the historical experiment. Note that ESMValCore uses its own names for the facets for a more uniform naming across different CMIP phases and other projects. The mapping to the facet names used on ESGF can be found in [esmvalcore.esgf.facets.FACETS](https://docs.esmvaltool.org/projects/esmvalcore/en/latest/api/esmvalcore.esgf.html#esmvalcore.esgf.facets.FACETS)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "23c26e29-ea87-40d7-a962-85a06fc77221",
   "metadata": {},
   "outputs": [],
   "source": [
    "dataset_template = Dataset(\n",
    "    short_name=\"tas\",\n",
    "    mip=\"Amon\",\n",
    "    project=\"CMIP6\",\n",
    "    exp=\"historical\",\n",
    "    dataset=\"*\",\n",
    "    institute=\"*\",\n",
    "    ensemble=\"*\",\n",
    "    grid=\"*\",\n",
    ")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "baf29fbb-eed5-47bd-8805-c27ad34b0539",
   "metadata": {},
   "source": [
    "Next, we use the `Dataset.from_files` method to build a list of datasets from the available files. This may take a while as searching the ESGF for many files is a bit slow. Because the search results are cached for a [configurable duration](https://docs.esmvaltool.org/projects/esmvalcore/en/latest/quickstart/configure.html#esgf-configuration), subsequent searches will be faster."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "d657320b-25c7-48f3-bfe1-5f3b94d7b789",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Found 727 datasets, showing the first 10:\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[Dataset:\n",
       " {'dataset': 'TaiESM1',\n",
       "  'project': 'CMIP6',\n",
       "  'mip': 'Amon',\n",
       "  'short_name': 'tas',\n",
       "  'ensemble': 'r1i1p1f1',\n",
       "  'exp': 'historical',\n",
       "  'grid': 'gn',\n",
       "  'institute': 'AS-RCEC'},\n",
       " Dataset:\n",
       " {'dataset': 'TaiESM1',\n",
       "  'project': 'CMIP6',\n",
       "  'mip': 'Amon',\n",
       "  'short_name': 'tas',\n",
       "  'ensemble': 'r2i1p1f1',\n",
       "  'exp': 'historical',\n",
       "  'grid': 'gn',\n",
       "  'institute': 'AS-RCEC'},\n",
       " Dataset:\n",
       " {'dataset': 'AWI-CM-1-1-MR',\n",
       "  'project': 'CMIP6',\n",
       "  'mip': 'Amon',\n",
       "  'short_name': 'tas',\n",
       "  'ensemble': 'r1i1p1f1',\n",
       "  'exp': 'historical',\n",
       "  'grid': 'gn',\n",
       "  'institute': 'AWI'},\n",
       " Dataset:\n",
       " {'dataset': 'AWI-CM-1-1-MR',\n",
       "  'project': 'CMIP6',\n",
       "  'mip': 'Amon',\n",
       "  'short_name': 'tas',\n",
       "  'ensemble': 'r2i1p1f1',\n",
       "  'exp': 'historical',\n",
       "  'grid': 'gn',\n",
       "  'institute': 'AWI'},\n",
       " Dataset:\n",
       " {'dataset': 'AWI-CM-1-1-MR',\n",
       "  'project': 'CMIP6',\n",
       "  'mip': 'Amon',\n",
       "  'short_name': 'tas',\n",
       "  'ensemble': 'r3i1p1f1',\n",
       "  'exp': 'historical',\n",
       "  'grid': 'gn',\n",
       "  'institute': 'AWI'},\n",
       " Dataset:\n",
       " {'dataset': 'AWI-CM-1-1-MR',\n",
       "  'project': 'CMIP6',\n",
       "  'mip': 'Amon',\n",
       "  'short_name': 'tas',\n",
       "  'ensemble': 'r4i1p1f1',\n",
       "  'exp': 'historical',\n",
       "  'grid': 'gn',\n",
       "  'institute': 'AWI'},\n",
       " Dataset:\n",
       " {'dataset': 'AWI-CM-1-1-MR',\n",
       "  'project': 'CMIP6',\n",
       "  'mip': 'Amon',\n",
       "  'short_name': 'tas',\n",
       "  'ensemble': 'r5i1p1f1',\n",
       "  'exp': 'historical',\n",
       "  'grid': 'gn',\n",
       "  'institute': 'AWI'},\n",
       " Dataset:\n",
       " {'dataset': 'AWI-ESM-1-1-LR',\n",
       "  'project': 'CMIP6',\n",
       "  'mip': 'Amon',\n",
       "  'short_name': 'tas',\n",
       "  'ensemble': 'r1i1p1f1',\n",
       "  'exp': 'historical',\n",
       "  'grid': 'gn',\n",
       "  'institute': 'AWI'},\n",
       " Dataset:\n",
       " {'dataset': 'AWI-ESM-1-REcoM',\n",
       "  'project': 'CMIP6',\n",
       "  'mip': 'Amon',\n",
       "  'short_name': 'tas',\n",
       "  'ensemble': 'r1i1p1f1',\n",
       "  'exp': 'historical',\n",
       "  'grid': 'gn',\n",
       "  'institute': 'AWI'},\n",
       " Dataset:\n",
       " {'dataset': 'BCC-CSM2-MR',\n",
       "  'project': 'CMIP6',\n",
       "  'mip': 'Amon',\n",
       "  'short_name': 'tas',\n",
       "  'ensemble': 'r1i1p1f1',\n",
       "  'exp': 'historical',\n",
       "  'grid': 'gn',\n",
       "  'institute': 'BCC'}]"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "datasets = list(dataset_template.from_files())\n",
    "print(f\"Found {len(datasets)} datasets, showing the first 10:\")\n",
    "datasets[:10]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3f88a30e-9dcd-431d-b469-3efd367795de",
   "metadata": {},
   "source": [
    "Let's look at the first dataset in more detail. We can print the facets describing the dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "8216affd-fa1a-4499-abd5-7ec836d14fd6",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Dataset:\n",
       "{'dataset': 'TaiESM1',\n",
       " 'project': 'CMIP6',\n",
       " 'mip': 'Amon',\n",
       " 'short_name': 'tas',\n",
       " 'ensemble': 'r1i1p1f1',\n",
       " 'exp': 'historical',\n",
       " 'grid': 'gn',\n",
       " 'institute': 'AS-RCEC'}"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dataset = datasets[0]\n",
    "dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4792d842-0476-48b4-97ae-eca16da09c42",
   "metadata": {},
   "source": [
    "and see what files are available:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "2cb3a047-bbaf-415d-bc0b-44bf473d858d",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[ESGFFile:CMIP6/CMIP/AS-RCEC/TaiESM1/historical/r1i1p1f1/Amon/tas/gn/v20200623/tas_Amon_TaiESM1_historical_r1i1p1f1_gn_185001-201412.nc on hosts ['esgf.ceda.ac.uk', 'esgf.rcec.sinica.edu.tw', 'esgf3.dkrz.de', 'esgf3.dkrz.de']]"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dataset.files"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "60d88a34-c886-4b9d-a9e9-a9d18fa97917",
   "metadata": {},
   "source": [
    "A single file can be downloaded using its `download` method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "cdb2b043-c2f4-4cd2-b2d3-d75bb28571a2",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "LocalFile('/home/manuel/climate_data/CMIP6/CMIP/AS-RCEC/TaiESM1/historical/r1i1p1f1/Amon/tas/gn/v20200623/tas_Amon_TaiESM1_historical_r1i1p1f1_gn_185001-201412.nc')"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dataset.files[0].download(CFG[\"download_dir\"])"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "3821b594-3797-497b-a51d-1798d5b2fc80",
   "metadata": {},
   "source": [
    "For downloading many files, the [esmvalcore.esgf.download](https://docs.esmvaltool.org/projects/esmvalcore/en/latest/api/esmvalcore.esgf.html#esmvalcore.esgf.download) function is recommended because it will download the files in parallel. The ESMValCore will try to guess the fastest host and download from there. If it is not available for some reason, it will automatically fall back to the next host."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "9676ff81-232e-4ff8-b784-686f0d06c469",
   "metadata": {},
   "outputs": [],
   "source": [
    "download(dataset.files, CFG[\"download_dir\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d3006d90",
   "metadata": {},
   "source": [
    "`Dataset.from_files` can also handle derived variables properly:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "b75314e3",
   "metadata": {},
   "outputs": [],
   "source": [
    "dataset_template = Dataset(\n",
    "    short_name=\"lwcre\",\n",
    "    mip=\"Amon\",\n",
    "    project=\"CMIP6\",\n",
    "    exp=\"historical\",\n",
    "    dataset=\"*\",\n",
    "    institute=\"*\",\n",
    "    ensemble=\"r1i1p1f1\",\n",
    "    grid=\"gn\",\n",
    "    derive=True,\n",
    "    force_derivation=True,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "b87c247f",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Found 36 datasets, showing the first 10:\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[Dataset:\n",
       " {'dataset': 'TaiESM1',\n",
       "  'project': 'CMIP6',\n",
       "  'mip': 'Amon',\n",
       "  'short_name': 'lwcre',\n",
       "  'derive': True,\n",
       "  'ensemble': 'r1i1p1f1',\n",
       "  'exp': 'historical',\n",
       "  'force_derivation': True,\n",
       "  'grid': 'gn',\n",
       "  'institute': 'AS-RCEC'},\n",
       " Dataset:\n",
       " {'dataset': 'AWI-CM-1-1-MR',\n",
       "  'project': 'CMIP6',\n",
       "  'mip': 'Amon',\n",
       "  'short_name': 'lwcre',\n",
       "  'derive': True,\n",
       "  'ensemble': 'r1i1p1f1',\n",
       "  'exp': 'historical',\n",
       "  'force_derivation': True,\n",
       "  'grid': 'gn',\n",
       "  'institute': 'AWI'},\n",
       " Dataset:\n",
       " {'dataset': 'AWI-ESM-1-1-LR',\n",
       "  'project': 'CMIP6',\n",
       "  'mip': 'Amon',\n",
       "  'short_name': 'lwcre',\n",
       "  'derive': True,\n",
       "  'ensemble': 'r1i1p1f1',\n",
       "  'exp': 'historical',\n",
       "  'force_derivation': True,\n",
       "  'grid': 'gn',\n",
       "  'institute': 'AWI'},\n",
       " Dataset:\n",
       " {'dataset': 'AWI-ESM-1-REcoM',\n",
       "  'project': 'CMIP6',\n",
       "  'mip': 'Amon',\n",
       "  'short_name': 'lwcre',\n",
       "  'derive': True,\n",
       "  'ensemble': 'r1i1p1f1',\n",
       "  'exp': 'historical',\n",
       "  'force_derivation': True,\n",
       "  'grid': 'gn',\n",
       "  'institute': 'AWI'},\n",
       " Dataset:\n",
       " {'dataset': 'BCC-CSM2-MR',\n",
       "  'project': 'CMIP6',\n",
       "  'mip': 'Amon',\n",
       "  'short_name': 'lwcre',\n",
       "  'derive': True,\n",
       "  'ensemble': 'r1i1p1f1',\n",
       "  'exp': 'historical',\n",
       "  'force_derivation': True,\n",
       "  'grid': 'gn',\n",
       "  'institute': 'BCC'},\n",
       " Dataset:\n",
       " {'dataset': 'BCC-ESM1',\n",
       "  'project': 'CMIP6',\n",
       "  'mip': 'Amon',\n",
       "  'short_name': 'lwcre',\n",
       "  'derive': True,\n",
       "  'ensemble': 'r1i1p1f1',\n",
       "  'exp': 'historical',\n",
       "  'force_derivation': True,\n",
       "  'grid': 'gn',\n",
       "  'institute': 'BCC'},\n",
       " Dataset:\n",
       " {'dataset': 'CAMS-CSM1-0',\n",
       "  'project': 'CMIP6',\n",
       "  'mip': 'Amon',\n",
       "  'short_name': 'lwcre',\n",
       "  'derive': True,\n",
       "  'ensemble': 'r1i1p1f1',\n",
       "  'exp': 'historical',\n",
       "  'force_derivation': True,\n",
       "  'grid': 'gn',\n",
       "  'institute': 'CAMS'},\n",
       " Dataset:\n",
       " {'dataset': 'CAS-ESM2-0',\n",
       "  'project': 'CMIP6',\n",
       "  'mip': 'Amon',\n",
       "  'short_name': 'lwcre',\n",
       "  'derive': True,\n",
       "  'ensemble': 'r1i1p1f1',\n",
       "  'exp': 'historical',\n",
       "  'force_derivation': True,\n",
       "  'grid': 'gn',\n",
       "  'institute': 'CAS'},\n",
       " Dataset:\n",
       " {'dataset': 'FGOALS-g3',\n",
       "  'project': 'CMIP6',\n",
       "  'mip': 'Amon',\n",
       "  'short_name': 'lwcre',\n",
       "  'derive': True,\n",
       "  'ensemble': 'r1i1p1f1',\n",
       "  'exp': 'historical',\n",
       "  'force_derivation': True,\n",
       "  'grid': 'gn',\n",
       "  'institute': 'CAS'},\n",
       " Dataset:\n",
       " {'dataset': 'IITM-ESM',\n",
       "  'project': 'CMIP6',\n",
       "  'mip': 'Amon',\n",
       "  'short_name': 'lwcre',\n",
       "  'derive': True,\n",
       "  'ensemble': 'r1i1p1f1',\n",
       "  'exp': 'historical',\n",
       "  'force_derivation': True,\n",
       "  'grid': 'gn',\n",
       "  'institute': 'CCCR-IITM'}]"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "datasets = list(dataset_template.from_files())\n",
    "print(f\"Found {len(datasets)} datasets, showing the first 10:\")\n",
    "datasets[:10]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "18e3a0b7",
   "metadata": {},
   "source": [
    "The facet `force_derivation=True` ensures variable derivation. If omitted and files that provide the variable `lwcre` without derivation are present, only those are returned."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f00a886f",
   "metadata": {},
   "source": [
    "If variable derivation is necessary (this will always be the case if `force_derivation=True` is used), the `files` attribute of the datasets may be empty. In this case, the input files of the input variables necessary for derivation can be accessed via the `Dataset.input_datasets` attribute:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "c5edfa65",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[]"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dataset = datasets[0]\n",
    "dataset.files"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "97cdf12d",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "rlut\n",
      "[ESGFFile:CMIP6/CMIP/AS-RCEC/TaiESM1/historical/r1i1p1f1/Amon/rlut/gn/v20200623/rlut_Amon_TaiESM1_historical_r1i1p1f1_gn_185001-201412.nc on hosts ['esgf.ceda.ac.uk', 'esgf.rcec.sinica.edu.tw', 'esgf3.dkrz.de', 'esgf3.dkrz.de']]\n",
      "rlutcs\n",
      "[ESGFFile:CMIP6/CMIP/AS-RCEC/TaiESM1/historical/r1i1p1f1/Amon/rlutcs/gn/v20200623/rlutcs_Amon_TaiESM1_historical_r1i1p1f1_gn_185001-201412.nc on hosts ['esgf.ceda.ac.uk', 'esgf.rcec.sinica.edu.tw', 'esgf3.dkrz.de']]\n"
     ]
    }
   ],
   "source": [
    "for d in dataset.input_datasets:\n",
    "    print(d[\"short_name\"])\n",
    "    print(d.files)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "esm",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.9"
  },
  "vscode": {
   "interpreter": {
    "hash": "17e81e49408864327be43d3caebcb8eca32ff92a01becb15aa27be73c37f0517"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}