Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Partial data #467

Merged
merged 62 commits into from
Jun 16, 2023
Merged
Show file tree
Hide file tree
Changes from 26 commits
Commits
Show all changes
62 commits
Select commit Hold shift + click to select a range
70f6d9b
Add header for parial data appendix
rartino Jun 8, 2023
9a5a36f
First paragraph of partial data appendix
rartino Jun 8, 2023
0680b2f
Adding a JSON-API response example and to partial data examples.
sauliusg Jun 8, 2023
4063e13
Updating the partial response examples.
sauliusg Jun 8, 2023
1a7230f
A format of partial data URLs agreed with Giovanni.
sauliusg Jun 8, 2023
14b9e9d
Removing scaffold comments.
sauliusg Jun 8, 2023
104ed78
Fixinhg the formatting: removing trailing blanks, unfolding text lines.
sauliusg Jun 8, 2023
9fa3b29
Updating the partial data examples to be consistent with the new
sauliusg Jun 8, 2023
6ec5a11
Checking spelling, updating the ".words.lst" file.
sauliusg Jun 8, 2023
92e7c12
Full text of partial data format appendix
rartino Jun 8, 2023
cc3da46
Merge branch 'partial_data' of https://github.com/rartino/OPTIMADE in…
sauliusg Jun 8, 2023
7a1f684
Slight changes in the text.
sauliusg Jun 8, 2023
0245999
Apply suggestions from review
rartino Jun 8, 2023
06d6444
Apply suggestions from review
rartino Jun 8, 2023
3e5fa16
Delete trailing whitespace
rartino Jun 8, 2023
7eddd27
Fix descriptio of the data -> meta fields in the JSON response format
rartino Jun 8, 2023
3e1f04c
Fixing the "next" link definition.
sauliusg Jun 9, 2023
05eacc2
Update optimade.rst
rartino Jun 10, 2023
beaaeef
Apply suggestions from review
rartino Jun 10, 2023
50c355e
Update based on review
rartino Jun 10, 2023
7a92260
Revert unneseccary change to .words.lst
rartino Jun 11, 2023
8f4db09
Apply suggestions from review
rartino Jun 12, 2023
16d60f6
Slightly change the format of the markers
rartino Jun 12, 2023
e109706
Improve clarity for when number of lines does not match response_range
rartino Jun 12, 2023
34bdf2a
Remove trailing whitespace
rartino Jun 12, 2023
961f5b7
Apply suggestions from review
rartino Jun 14, 2023
874bd52
Apply suggestions from review
rartino Jun 15, 2023
7b314af
Add a key to the header to identify the format as OPTIMADE partial data
rartino Jun 15, 2023
6faf8db
Remove trailing whitespace
rartino Jun 15, 2023
316df78
Clarify handling of missing items in partial data
rartino Jun 15, 2023
b080cf2
Change markers to be more detectable in stream
rartino Jun 15, 2023
bd93804
Change markers to be more detectable in stream
rartino Jun 15, 2023
10bc845
Change markers to be more detectable in stream
rartino Jun 15, 2023
39d9ae5
Change format to representation to avoid a clash in terms and fieldnames
rartino Jun 15, 2023
2a24c1a
Enable for efficient parsing of responses a server knows has no refer…
rartino Jun 15, 2023
9d9e26e
Change format to representation to avoid a clash in terms and fieldnames
rartino Jun 15, 2023
ff5a27c
Rename partial_data_url and url to link to better conform to JSON API…
rartino Jun 15, 2023
8ae1928
Rename partial_data_url and url to link to better conform to JSON API…
rartino Jun 15, 2023
d8a11cb
Rename partial_data_url and url to link to better conform to JSON API…
rartino Jun 15, 2023
11900c5
Rename partial_data_url and url to link to better conform to JSON API…
rartino Jun 15, 2023
1b4093e
Remove trailing whitespace
rartino Jun 15, 2023
496b6ca
Change representation to layout to not confuse with URL representatio…
rartino Jun 15, 2023
4d906a2
Remove accidental leftover text.
rartino Jun 15, 2023
b6ab3ae
Fix segment incorrectly placed
rartino Jun 15, 2023
ee4c1e3
Fix braces in partial data examples
rartino Jun 15, 2023
1b0d1a6
Make returned_range RECOMMENDED and move a sentence that had ended up…
rartino Jun 15, 2023
1b9c607
Fix whitespace
rartino Jun 15, 2023
562d651
Improve formulation about partial data URLs
rartino Jun 15, 2023
498d169
Slightly adjust wording
rartino Jun 15, 2023
e5e6046
Slightly adjust wording
rartino Jun 15, 2023
4906c4f
Slightly adjust wording
rartino Jun 15, 2023
864450d
Slightly adjust wording
rartino Jun 15, 2023
e574106
Minor reformulations
rartino Jun 15, 2023
336ef21
Minor reformulations
rartino Jun 15, 2023
93ee583
Rearrange some text to be more logical
rartino Jun 15, 2023
edf4f25
Clarify optimade-partial-data/format field futureproofing
rartino Jun 15, 2023
5b13315
Minor reformulations and adjustments
rartino Jun 15, 2023
2cfe8c0
Allow an inline item_schema in addition to the link
rartino Jun 15, 2023
4e9fb4d
Fix missing quotation marks
rartino Jun 15, 2023
b50d93d
Minor language corrections from review
rartino Jun 16, 2023
dfc24d4
Add sentence about implementations decision on what is partial data
rartino Jun 16, 2023
a0aa533
Merge branch 'develop' into partial_data
rartino Jun 16, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion .words.lst
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,7 @@ bandgap
bd
booktitle
boolean
bzip
calc
cartesian
checksums
Expand Down Expand Up @@ -115,18 +116,21 @@ exclusiveMinimum
exmpl
fieldname
firstname
hdf
howpublished
href
html
http
hydrogens
hydroperoxide
implementers
incrementing
internaldb
javascript
json
jsonapi
jsonc
jsonlines
kvak
lastname
libc
Expand Down Expand Up @@ -203,4 +207,4 @@ xy
yacc
zeo
zeolites
�ngstr�m
ångström
rartino marked this conversation as resolved.
Show resolved Hide resolved
196 changes: 195 additions & 1 deletion optimade.rst
Original file line number Diff line number Diff line change
Expand Up @@ -442,6 +442,54 @@ For example, the following query can be sent to API implementations `exmpl1` and

:filter:`filter=_exmpl1_band_gap<2.0 OR _exmpl2_band_gap<2.5`

Transmission of large property values
-------------------------------------

A property value may be too large to fit in a single response.
OPTIMADE provides a mechanism for a client to handle such properties by fetching them in separate series of requests.
rartino marked this conversation as resolved.
Show resolved Hide resolved

In this case, the response to the initial query gives the value :val:`null` for the property.
A list of one or more data URLs together with their respective partial data formats are given in the response.
How this list is provided is response format-dependent.
For the JSON response format, see the description of the :field:`partial_data_urls` field inside :field:`meta` inside :field:`data` in the section `JSON Response Schema: Common Fields`_.
rartino marked this conversation as resolved.
Show resolved Hide resolved

The default partial data format is named "jsonlines" and is described in the Appendix `OPTIMADE JSON lines partial data format`_.
An implementation SHOULD always include this format as one of alternative partial data formats provided for a property that has been omitted from the response to the initial query.
Implementations MAY provide links to their own non-standard formats, but non-standard format names MUST be prefixed by a database-provider-specific prefix.

Below follows an example of the data and meta parts in a response using the JSON response format that communicates that the property value has been omitted from the response, with three different URLs for different partial data formats provided.
rartino marked this conversation as resolved.
Show resolved Hide resolved

.. code:: jsonc
{
// ...
"data": {
"type": "structures",
"id": "2345678",
"attributes": {
"a": null
}
}
"meta": {
rartino marked this conversation as resolved.
Show resolved Hide resolved
"partial_data_urls": {
rartino marked this conversation as resolved.
Show resolved Hide resolved
"a": [
{
"format": "jsonlines",
"url": "https://example.org/optimade/v1.2/extensions/partial_data/structures/2345678/a/default_format"
rartino marked this conversation as resolved.
Show resolved Hide resolved
},
{
"format": "_exmpl_bzip2_jsonlines",
"url": "https://db.example.org/assets/partial_values/structures/2345678/a/bzip2_format"
},
{
"format": "_exmpl_hdf5",
"url": "https://cloud.example.org/ACCHSORJGIHWOSJZG"
}
]
}
}
// ...
}
rartino marked this conversation as resolved.
Show resolved Hide resolved

Responses
=========

Expand Down Expand Up @@ -593,6 +641,22 @@ Every response SHOULD contain the following fields, and MUST contain at least :f
- **data**: The schema of this value varies by endpoint, it can be either a *single* `JSON API resource object <http://jsonapi.org/format/1.0/#document-resource-objects>`__ or a *list* of JSON API resource objects.
Every resource object needs the :field:`type` and :field:`id` fields, and its attributes (described in section `API Endpoints`_) need to be in a dictionary corresponding to the :field:`attributes` field.

Every resource object MAY also contain a :field:`meta` field with the following keys:

- **partial_data_urls**: an object used to list URLs which can be used to fetch data that has been omitted from the :field:`data` part of the response.
rartino marked this conversation as resolved.
Show resolved Hide resolved
The keys are the names of the fields in :field:`attributes` for which partial data URLs are available.
Each value is a list of items that MUST have the following keys:

- **format**: String.
A name of the format provided via this URL.
One of the items SHOULD be "jsonlines", which refers to the format in `OPTIMADE JSON lines partial data format`_.

- **url**: String.
rartino marked this conversation as resolved.
Show resolved Hide resolved
The URL from which the data can be fetched.
There is no requirement on the syntax or format of the URL.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Storing this under meta means that this is not directly queryable, correct? Is there a way to query for fields that are too big? An JSON:API-based alternative (apologies if this was already discussed...) could be to include these links as relationships with links, with otherwise the same descriptions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we discussed solving it with relationships. But, the idea has been that the "too big to fit" handling lives completely on the implementation level as a mechanism mostly orthogonal to the rest of OPTIMADE. It is specifically designed as a way for a client to get data that, if it had fitted, would have been inlined in the response. What is "large" and "not large" isn't "data", doesn't belong in the database and thus cannot be queried. Arguably it could change at a moment notice if the server is trying to keep response data down.

rartino marked this conversation as resolved.
Show resolved Hide resolved

For more information about the mechanism to transmit large property values, including an example of the format of :field:`partial_data_urls`, see `Transmission of large property values`_.
rartino marked this conversation as resolved.
Show resolved Hide resolved

giovannipizzi marked this conversation as resolved.
Show resolved Hide resolved
The response MAY also return resources related to the primary data in the field:

- **links**: `JSON API links <http://jsonapi.org/format/1.0/#document-links>`__ is REQUIRED for implementing pagination.
Expand Down Expand Up @@ -915,7 +979,8 @@ OPTIONALLY it can also contain the following fields:

- **self**: the entry's URL

- **meta**: a `JSON API meta object <https://jsonapi.org/format/1.0/#document-meta>`__ that contains non-standard meta-information about the object.
- **meta**: a `JSON API meta object <https://jsonapi.org/format/1.0/#document-meta>`__ that is used to communicate metadata.
See `JSON Response Schema: Common Fields`_ for more information about this field.
Comment on lines +984 to +985
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this not come from PR #463?

Suggested change
- **meta**: a `JSON API meta object <https://jsonapi.org/format/1.0/#document-meta>`__ that is used to communicate metadata.
See `JSON Response Schema: Common Fields`_ for more information about this field.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to define the meta field here to hold the "partial_data_links" key? Otherwise this PR would be inconsistent.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I came across another commit which removed part of the definition of the metadata fields. So it looked like you forgot this piece, which is why I mentioned it.
Either both should be in or both should be out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Earlier I indeed removed a segment here that defined the property_metadata subkey of meta, which I agree belong better in #463. But, the segment you have marked now defines the meta superkey we need for the partial_data_links subkey.

I'm confused over what you are asking for. Are you saying we absolutely should not mention meta with a link to 'JSON Response Schema: Common Fields' that defines meta -> partial_data_links; despite that with this PR that key is an absolutely vital part of the 'Entry Listing JSON Response Schema'?


- **relationships**: a dictionary containing references to other entries according to the description in section `Relationships`_ encoded as `JSON API Relationships <https://jsonapi.org/format/1.0/#document-resource-object-relationships>`__.
The OPTIONAL human-readable description of the relationship MAY be provided in the :field:`description` field inside the :field:`meta` dictionary of the JSON API resource identifier object.
Expand Down Expand Up @@ -3421,3 +3486,132 @@ The strings below contain Extended Regular Expressions (EREs) to recognize ident
#BEGIN ERE strings
"([^\"]|\\.)*"
#END ERE strings

OPTIMADE JSON lines partial data format
---------------------------------------
The OPTIMADE JSON lines partial data format is a lightweight format for transmitting property data that are too large to fit in a single OPTIMADE response.
The format is based on `JSON Lines <https://jsonlines.org/>`__, which allows for streaming handling of large datasets.
rartino marked this conversation as resolved.
Show resolved Hide resolved
Note: since the below definition references both JSON fields and OPTIMADE properties, the data type names depend on context: for JSON they are, e.g., "array" and "object" and for OPTIMADE properties they are, e.g., "list" and "dictionary".

.. _slice object:

To aid the definition of the format below, we first define a "slice object" to be a JSON object describing slices of arrays.
The dictionary has the following OPTIONAL fields:

- :field:`"start"`: Integer.
The slice starts at the value with the given index (inclusive).
The default is 0, i.e., the value at the start of the array.
- :field:`"stop"`: Integer.
The slice ends at the value with the given index (inclusive).
If omitted, the end of the slice is the last index of the array.
- :field:`"step"`: Integer.
The absolute difference in index between two subsequent values that are included in the slice.
The default is 1, i.e., every value in the range indicated by :field:`start` and :field:`stop` is included in the slice.
Hence, a value of 2 denotes a slice of every second value in the array.

For example, for the array `["a", "b", "c", "d", "e", "f", "g", "h", "i", "j"]` the slice object `{"start":1, "end":7, "step": 3}` refers to the items `["b", "e", "h"]`.
rartino marked this conversation as resolved.
Show resolved Hide resolved

Furthermore, we also define the following special markers:

- The "end-of-data-marker" is this exact JSON: :val:`["end", [""]]`.
- A "reference-marker" is this exact JSON: :val:`["ref", ["URL"]]`, where :val:`"URL"` is to be replaced with a URL being referenced.
rartino marked this conversation as resolved.
Show resolved Hide resolved
- A "next-marker" is this exact JSON: :val:`["next", ["URL"]]`, where :val:`"URL"` is to be replaced with the target URL for the next link.
rartino marked this conversation as resolved.
Show resolved Hide resolved

There is no requirement on the syntax or format of the URLs provided in these markers.
The data provided via the URLs MUST be the JSON lines partial data format, i.e., the markers cannot be used to link to partial data provided in other formats.
rartino marked this conversation as resolved.
Show resolved Hide resolved
The markers have been deliberately designed to be valid JSON objects but *not* valid OPTIMADE property values.
Since the OPTIMADE list data type is defined as list of values of the same data type or :val:`null`, the above markers cannot be encountered inside the actual data of an OPTIMADE property.
rartino marked this conversation as resolved.
Show resolved Hide resolved

The full response MUST be valid `JSON Lines <https://jsonlines.org/>`__ that adheres to the following format:

- The first line is a header object (defined below).
- The following lines are data lines adhering to the formats described below.
- The final line is either an end-of-data-marker (indicating that there is no more data to be given), or a next-marker indicating that more data is available, which can be obtained by retrieving data from the provided URL.

The first line MUST be a JSON object providing header information.
The header object MUST contain the key:

- :field:`"format"`: String.
A string either equal to :val:`"dense"` or :val:`"sparse"` to indicate whether the returned format is dense or sparse.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not sufficient to identify the stream as containing OPTIMADE data, can we also use the key from #471? Something like

{"x-optimade": {"format": "dense", "api_version": "1.2.0"}}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, what do you think of the "optimade-partial-data" I added? I don't want to mess too much with the rest of the header unless/until others chime in.

rartino marked this conversation as resolved.
Show resolved Hide resolved
rartino marked this conversation as resolved.
Show resolved Hide resolved

The header object MAY also contain the key:

- :field:`"returned_ranges"`: Array of Object.

This comment was marked as outdated.

For dense data, and sparse data of one dimensional list properties, the array contains a single element which is a `slice object`_ representing the range of data present in the response.
rartino marked this conversation as resolved.
Show resolved Hide resolved
Once the client has encountered an end-of-data-marker, any data not covered by any of the encountered slices are to be assigned the value :val:`null`.
If the field :field:`"format"` is `"dense"` and :field:`"returned_ranges"` is omitted, then the client MUST assume that the data is a continuous range of data from the start of the array up to the number of elements given until reaching the end-of-data-marker or next-marker.
If :field:`"returned_ranges"` is included and the client encounters a next or end-of-data-marker before receiving all lines indicated by the slice, it should proceed by not assigning any values to those items, i.e., this is not an error.
rartino marked this conversation as resolved.
Show resolved Hide resolved
In the specific case of a hierarchy of list properties represented as a sparse multi-dimensional array, if the field :field:`"returned_ranges"` is given, it MUST contain one slice object per dimension of the multi-dimensional array, representing slices for each dimension that cover the data given in the response.

The format of data lines of the response (i.e., all lines except the first and the last) depends on whether the header object specifies the format as :val:`"dense"` or :val:`"sparse"`.
rartino marked this conversation as resolved.
Show resolved Hide resolved

- **Dense format:** In the dense partial data format, each data line reproduces one list item in the OPTIMADE list property being transmitted in JSON format.
If OPTIMADE list properties are embedded inside the item, they can either be included in full or replaced with a reference-marker.
If a list is replaced by a reference marker, the client MAY use the provided URL to obtain the list items.

- **Sparse format for one-dimensional list:** When the response sparsely communicates items for a one-dimensional OPTIMADE list property, each data line contains a JSON array on the format:

- The first item is the zero-based index of the item provided.
- The second item is a JSON representation of the item, with the same format as the lines in the dense format.
In the same way as for the dense format, reference-markers are allowed for data that does not fit in the response (see example below).

- **Sparse format for multi-dimensional lists:** We provide a sparse format specifically for the case that the OPTIMADE property represents a series of directly hierarchically embedded lists (i.e., a multidimensional sparse array).
Then, the server MAY represent them using the following sparse multi-dimensional format for a number of aggregated dimensions.
In this case, each data line contains a JSON array in the format of:

- All items except the last item are integer zero-based indices of the value being provided in this line; these indices refer to the aggregated dimensions in the order of outermost to innermost.
- The last item is a JSON representation of the item at those coordinates, with the same format as the lines in the dense format.
In the same way as for the dense format, reference-markers are allowed for data that does not fit in the response.
rartino marked this conversation as resolved.
Show resolved Hide resolved

Examples
--------

Below follows an example of a dense response for a partial array data of integer values.
The request returns the first three items and provides the next-marker link to continue fetching data:

.. code:: json
{"format": "dense", "returned_ranges": [{"start": 10, "stop": 20, "step": 2}]}
rartino marked this conversation as resolved.
Show resolved Hide resolved
123
345
-12.6
["next", ["https://example.db.org/value4"]]
rartino marked this conversation as resolved.
Show resolved Hide resolved

Below follows an example of a dense response for a list property as a partial array of multidimensional array values.
The item with index 10 in the original list is provided explicitly in the response and is the first one provided in the response since start=10.
The item with index 12 in the list, the second data item provided since start=10 and step=2, is not included only referenced.
The third provided item (index 14 in the original list) is only partially returned: it is a list of three items, the first and last are explicitly provided, the second one is only referenced.

.. code:: json
{"format": "dense", "returned_ranges": [{"start": 10, "stop": 20, "step": 2}]}
rartino marked this conversation as resolved.
Show resolved Hide resolved
[[10,20,21], [30,40,50]]
["ref", ["https://example.db.org/value2"]]
[[11, 110], ["ref", ["https://example.db.org/value3"]], [550, 333]]
["next", ["https://example.db.org/value4"]]
rartino marked this conversation as resolved.
Show resolved Hide resolved

Below follows an example of the sparse format for multi-dimensional lists with three aggregated dimensions.
rartino marked this conversation as resolved.
Show resolved Hide resolved
The underlying property value can be taken to be sparse data in lists in four dimensions of 10000 x 10000 x 10000 x N, where the innermost list is a non-sparse list of abitrary length of numbers.
The only non-null items in the outer three dimensions are, say, [3,5,19], [30,15,9], and [42,54,17].
The response below communicates the first item explicitly; the second one by deferring the innermost list using a reference-marker; and the third item is not included in this response, but deferred to another page via a next-marker.

.. code:: json
{"format": "sparse"}
rartino marked this conversation as resolved.
Show resolved Hide resolved
[3,5,19, [10,20,21,30]]
[30,15,9, ["ref", ["https://example.db.org/value1"]]]
["next", ["https://example.db.org/"]]
rartino marked this conversation as resolved.
Show resolved Hide resolved

An example of the sparse format for multi-dimensional lists with three aggregated dimensions and integer values:
rartino marked this conversation as resolved.
Show resolved Hide resolved

.. code:: json
{"format": "sparse"}
rartino marked this conversation as resolved.
Show resolved Hide resolved
[3,5,19, 10]
[30,15,9, 31]
["next", ["https://example.db.org/"]]
rartino marked this conversation as resolved.
Show resolved Hide resolved

An example of the sparse format for multi-dimensional lists with three aggregated dimensions and values that are multidimensional lists of integers of arbitrary lengths:
rartino marked this conversation as resolved.
Show resolved Hide resolved

.. code:: json
{"format": "sparse"}
rartino marked this conversation as resolved.
Show resolved Hide resolved
[3,5,19, [ [10,20,21], [30,40,50] ]
rartino marked this conversation as resolved.
Show resolved Hide resolved
[3,7,19, ["ref", ["https://example.db.org/value2"]]]
[4,5,19, [ [11, 110], ["ref", ["https://example.db.org/value3"]], [550, 333]]
["end", [""]]
rartino marked this conversation as resolved.
Show resolved Hide resolved