Skip to content

ESGFInterfaceGroups|InformationArchitecture

Stephen Pascoe edited this page Apr 9, 2014 · 7 revisions
Wiki Reorganisation
This page has been classified for reorganisation. It has been given the category REVISE.
This page contains useful content but needs revision. It may contain out of date or inaccurate content.

Information Architecture (Data Model) Group

ESGF Catalog Proposal

The relationship between the ESGF data model and THREDDS would be clarified by defining an independent format for encoding the ESGF information. This format could then evolve independently of the constraints of THREDDS XML to include new features such as

  • Separating local access information from location-independent dataset information
  • Generating dataset hashes for version tracking
  • ...

Each dataset-version can be represented by an ESGF catalog document. From any ESGF catalog a unique hash can be calculated that is guarenteed to be unique for that dataset-version such that any changes to the data or immutable metadata will change the hash value.

Therefore we can make the following sorts of statements about ESGF dataset:

  • Dataset cmip5.EXAMPLE.MOHC.HadGEM2-ES.rcp45.mon.ocean.Omon.r1i1p1 at version 20111206 has a hash id of 4a58ddd7a9556c3375a02695953ece961805ade0

These dataset_id, version, hash mappings can be used to check consistency of data holdings at ESGF data centres and by users who have downloaded data.

General encoding and format

ESGF catalogs are encoded as JSON documents with two main parts: header and body. The body contains information that is immutable for a dataset_version, including DRS facets and file checksums. the header contains mutable information plus a hash of a canonical serialisation of the body.

The body is serialised according to these rules which are closely based on the canonicalisation descrived at http://wiki.laptop.org/go/Canonical_JSON :

  1. strict addherance to the grammar at http://www.json.org (e.g. object keys must be quoted with '"', no trailing commas in members and elements),

  2. no whitespace between JSON tokens,

  3. all map keys are serialised in lexographically sorted order,

  4. no floating-point values allowed (because of ambiguities in converting to/from ASCII),

  5. no leading zeros in integers or the integer value minus 0,

  6. only escaped characters for '' and '"' are allowed and backslash must be encoded "\"

Document template

The proposed document structure is:

[   1] {
[   2]     "header": {
[   3]         //!TODO: I'm not sure we need this id
[   4]         "id": "<some-identifier>",
[   5]         "catalog_version": "0.0.1",
[   6]         // body_hash is generated from a canonical serialisation of the "body" object. 
[   7]         "body_hash": "<sha1-hash>",
[   8]         "body_hash_type": "SHA1",
[   9]         "created": "<utc-timestamp>",
[  10]         "properties": {
[  11]             "title": "<title>",
[  12]             /* Other mutable facets could be declared here 
[  13]                for converting to SOLr but should be in body.facets
[  14]                if they are also in the NetCDF. 
[  15]             */
[  16]         },
[  17]         // Details about services and other metadata can be given here
[  18]         "links": {
[  19]             'thredds': '<thredds-catalog-url>',
[  20]         }
[  21]     },
[  22]     "body": {
[  23]         // The dot-separated DRS dataset_id minus version
[  24]         "dataset_id": "<dataset_id>",
[  25]         // The version string without "v" prefix (YYYYMMDD for CMIP5)
[  26]         "version": "<version>",
[  27]         /* All immutable facets.  I.e. facets that are part of the dataset_id,
[  28]            are baked into the NetCDF metadata or are otherwise considered immutable.
[  29]         */
[  30]         "facets": {
[  31]             "activity": "<activity/project>",
[  32]             "product": "<product>",
[  33]             "institute": "<institute>",
[  34]             "model": "<model>",
[  35]             // ... more DRS facets here
[  36]         },
[  37]         "files": {
[  38]             // filepaths are relative to a dataset directory.  For CMIP5 this would be "<variable>/<filename>.nc"
[  39]             // Future checksum_types could contain more structure.  E.g. separate hashes
[  40]             // for data and metadata or a merkle tree.
[  41]             "<filepath>": {"checksum": "<checksum>", "checksum_type": "<type>", "size": <size-in-bytes>},
[  42]             "<filepath>": {"checksum": "<checksum>", "checksum_type": "<type>", "size": <size-in-bytes>},
[  43]             "<filepath>": {"checksum": "<checksum>", "checksum_type": "<type>", "size": <size-in-bytes>},
[  44]             // ...
[  45]         }  
[  46]     }
[  47] }

Document Example

For instance an example document would be

[   1] {
[   2]     "header": {
[   3]         "id": "cmip5.EXAMPLE.output1.MOHC.HadCM3.1pctto4x.mon.ocean.Omon.r1i1p1.v20120320",
[   4]         "catalog_version": "0.0.1",
[   5]         "body_hash": "6127d07cbbb4464ace675b21835da3c5070e592b",
[   6]         "body_hash_type": "SHA1",
[   7]         "created": "2012-03-20 13:03:11+00:00",
[   8]         "properties": {
[   9]             "title": "An example ESGF catalog document",
[  10]         },
[  11]         "links": {
[  12]             'thredds': 'http://cmip-dn.badc.rl.ac.uk/thredds/EXAMPLE/catalog.xml',
[  13]         }
[  14]     },
[  15]     "body": {
[  16]         "dataset_id": "cmip5.EXAMPLE.output1.MOHC.HadCM3.1pctto4x.mon.ocean.Omon.r1i1p1",
[  17]         "version": "20120320",
[  18]         "facets": {
[  19]             "activity": "cmip5",
[  20]             "product": "EXAMPLE",
[  21]             "institute": "MOHC",
[  22]             "model": "HadCM3",
[  23]             "experiment": "1pctto4x",
[  24]             "frequency": "mon",
[  25]             "realm": "ocean",
[  26]             "mip_table": "Omon",
[  27]             "ensemble": "r1i1p1"
[  28]         },
[  29]         "files": {
[  30]             "thetao/thetao_Omon_HadCM3_1pctto4x_r1i1p1_2001123114-2004010104.nc": {
[  31]                 "checksum": "5933f07ad44047c3e7b10451af2e5d55", "checksum_type": "MD5", "size": 42
[  32]             },
[  33]             "thetao/thetao_Omon_HadCM3_1pctto4x_r1i1p1_2005123119-2008010109.nc": {
[  34]                 "checksum": "351573a8493903964ad65a6d0e35e6a4", "checksum_type": "MD5", "size": 42
[  35]             },
[  36]             "thetao/thetao_Omon_HadCM3_1pctto4x_r1i1p1_2008010109-2010010100.nc": {
[  37]                 "checksum": "98488b3a621096ec1f3dcdfdd04c5973", "checksum_type": "MD5", "size": 42
[  38]             },
[  39]             "thetao/thetao_Omon_HadCM3_1pctto4x_r1i1p1_2004010104-2005123119.nc": {
[  40]                 "checksum": "71aeefe23ebd08ef76733f8eea280728", "checksum_type": "MD5", "size": 42
[  41]             },
[  42]             "thetao/thetao_Omon_HadCM3_1pctto4x_r1i1p1_2000010100-2001123114.nc": {
[  43]                 "checksum": "09dfd9d793f9edcbb8348f029214bba0", "checksum_type": "MD5", "size": 42
[  44]             },
[  45]         }
[  46]     }
[  47] }

For generating the body-hash you would encode the body into the following byte-stream (line-breaks included for readability only):

{"dataset_id":"cmip5.EXAMPLE.output1.MOHC.HadCM3.1pctto4x.mon.ocean.Omon.r1i1p
1","facets":{"activity":"cmip5","ensemble":"r1i1p1","experiment":"1pctto4x","f
requency":"mon","institute":"MOHC","mip_table":"Omon","model":"HadCM3","produc
t":"EXAMPLE","realm":"ocean"},"files":{"thetao/thetao_Omon_HadCM3_1pctto4x_r1i
1p1_2000010100-2001123114.nc":{"checksum":"09dfd9d793f9edcbb8348f029214bba0","
checksum_type":"MD5","size":42},"thetao/thetao_Omon_HadCM3_1pctto4x_r1i1p1_200
1123114-2004010104.nc":{"checksum":"5933f07ad44047c3e7b10451af2e5d55","checksu
m_type":"MD5","size":42},"thetao/thetao_Omon_HadCM3_1pctto4x_r1i1p1_2004010104
-2005123119.nc":{"checksum":"71aeefe23ebd08ef76733f8eea280728","checksum_type"
:"MD5","size":42},"thetao/thetao_Omon_HadCM3_1pctto4x_r1i1p1_2005123119-200801
0109.nc":{"checksum":"351573a8493903964ad65a6d0e35e6a4","checksum_type":"MD5",
"size":42},"thetao/thetao_Omon_HadCM3_1pctto4x_r1i1p1_2008010109-2010010100.nc
":{"checksum":"98488b3a621096ec1f3dcdfdd04c5973","checksum_type":"MD5","size":
42}},"version":"20120320"}

which has a sha1-hash of

$ sha1hash example.json
6127d07cbbb4464ace675b21835da3c5070e592b  example.json

TODO : why no tracking_id?

Tools Required

We will need a toolkit that can do these things:

  1. validate a catalog (check body_hash is valid)

  2. THREDDS XML --> catalog (from URL or filesystem)

  3. esgpublish mapfile --> catalog

  4. esgpublish database --> catalog

  5. Compare catalog to filesystem (list file checksum misses, missing files)

Clone this wiki locally