-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
registry format / schema #5
Comments
@jhpoelen Can you point me to the vocab you are using for the above? Couldn't find an obvious mapping all within Prov, probably because Prov intends us to draw on other namespaces. I'd propose borrowing terms directly from Dublin Core here, since that seems like the most widely recognized (and cross-mapped)
Thoughts on this? I'm actually not sure how to treat |
I do have thoughts on this and my current thinking is reflected in the preston implementation. Generally, they follow a process oriented description, rather than a resource centered description. I'd very much like to hear your thoughts on this and figure out a way to align these implementation. Here goes . . . Traditionally, with a url resource oriented description, you'd say stuff like:
However, the provenance of these entries are not clear. Who determined that the url had an identifier, what does the date mean? This is where prov / pav come in.
where is a download process and hash://sha256/94a4... is the retrieved content. Coming back to your specific examples:
Here's an example of
some content was generated by <124b...>
this content (hash://sha256/7dbc...) has a qualified generation related to this product of the download event. This is a way to assign qualities to a generation "event".
here, the qualified generation <285...> is used to describe at what time the generation occurred.
This line describes that the <285...> is in fact, a generation (event).
Now, the generation event is related to the download specific download event.
The generation event used resource https://ropensci.org .
With this, the claim can be support that the resource accessed via https://ropensci.org has version hash://sha256/7dbc... . The last line is more of a shortcut and can be directed from the qualified generation event. |
Note that with the "preston" approach, you can still create familiar tables like:
without being locked into that schema. The table becomes a user interface on top of the download (or generation) event in the provenance log. |
@jhpoelen Thanks very much for this. Yup, I completely agree that we can create familiar tables and alter the schema later (i..e. have tools that translate those terms into schema terms, just like we map from one schema to another). And obviously that's what hash-archive.org has already done under the hood. For the moment that's what I'm doing, I'm not worried about having full URIs or actual RDF for the 'internal' registry; I just need to call these fields something. I guess I could just stick with the terms hash-archive.org is using for the 'internal' table schema, and kick this discussion down the road until I'm actually implementing this a provenance and storing this as quads (more on that for a separate thread). However, I'd still prefer terms in a table match a standardized schema (and preferably just one namespace). As you know, from a tooling perspective it's nice if these terms are somewhat consistent (I know one of the main points of RDF is that I can call it schema:name and you can call it dc:title and the computer can read the OWL file and know these are the same). So, in concrete terms:
Related to the last one, I've struggled a bit on what to call these things in the documentation ("hash URIs? content hash? content hash URI? ....) I think most researchers actually aren't familiar with the term URI, but I kinda like calling it an identifier, with the pitch being that this can play may of the same roles that the research community has come to associate with a DOI. maybe that's a discussion for a separate issue. |
Thanks for sharing your desire to use a single, well-defined schema as the basis for a table view of a registry. Here is the most basic table I can come up with using the PROV schema. The idea is that registries record the process of generating content-based identifiers in the form of hash uris. So, each creation of a content uri is unique and gets a uuid (we may want to hide this uuid for simplified view). Then, the url used in the process generated the content uri that ended at provided time. You can rename columns to use friendlier labels, but they are well defined.
|
Based on feedback from #9 now implemented in #11, I agree that the core registry concept doesn't need a You're probably right that each row in the registry.tsv.gz should get a |
I'm slowly coming around to the realization that you are again correct! we probably do want these uuids, even if we hide them from casual display.... |
@jhpoelen I think we still want to think more about registry semantics. I like #5 (comment), but I'm not sure it's quite accurate. @mbjones suggests we consider a In the proposed triple,
uses On the flip side, I'm still not sure how to write this out with |
The qualitiedGeneration can already express the used relation: from
But perhaps there's better ways to express the relation between the generation of the hash and the resource used to generate it. |
so, the generation event uses a resource (e.g., https://example.org) and generates a content hash uri. |
@jhpoelen ah nice, that looks pretty reasonable to me at least |
What schema should we use in a local registry?
We could mimic the schema of hash-archive.org, with 6 fields:
url
,timestamp
,status
,type
,length
, andhashes
like so:Obviously we can represent this in tabular form as well or what not. (We could also define this as a simple S3 class and associate a print method with it which would just show, say, the non-base64 version of the sha-256 string in
hash://sha-256/
format...)I'm aware that we will want to think more about a richer provenance record too, but for the moment I think of that as a higher layer, and want to get this layer right first.
For instance, if we define this schema, then I can better define user-facing functions (like
pin()
) that use this registry as a backend. The functions I want right now are those that would take a content URI as input and return a location (most recent location, most recent on-disk location, online location, etc). Of course that assumes that for any registry I know where 'location' (i.e.eurl
in the above schema) andtimestamp
fields can be found.It would be nice if we used some valid RDF namespace for this too, instead of the arbitrary names used by hash-archive.org -- maybe PROV? (Then I can easily map the hash-archive.org terms into our preferred namespace before displaying them).
The text was updated successfully, but these errors were encountered: