store and retrieve as simple inverses #23

cboettig · 2020-03-04T00:24:28Z

This PR is a rethink on store and retrieve that I think is closer to the original proposal.

store is a stupidly simple function that takes bits & bytes (specifically, a generalized connection object, url, or path, see #22) and copies those bytes into a local on-disk cache using a the preston content-based naming scheme. This means URL content gets downloaded (to tempdir), hashed, and copied into the store, local files just get copied, nothing is copied if content already exists (as determined by hashing). store returns the content identifier URI.

retrieve is the trivial inverse of this, it takes a content identifier (URI) and returns the content by parsing the identifier into the location.

Note that there's no registry being created in either case, and there's no metadata here. Obviously those things are important, but they would need to be handled explicitly or wrapped into a higher level function -- this pair of functions is all about storing and retrieving content and nothing more.

store() no longer explicitly tries to set the store up as a bagit bag, but the preston-style file structure can easily be wrapped up this way. I think this could be quite useful if one wanted to publish or move around the store, so I've provided the (not yet exported) function bagit_manifest() which generates the manifest-sha256.txt and bagit.txt for store's directory. Not sure how useful that is but it's pretty trivial to do. I do want to think a bit more about workflows for easily publishing a 'content store' like this. (Again, this is probably only useful to anyone once it is coupled with a metadata registry, but I think there's value in keeping those as separate modules).

In addition to store and retrieve, even a simple store like this needs a few more capabilities, e.g. inventory() & discard() or something like that to list what's in the store and remove one or more blobs from the store. Suggestions for these verbs welcome. (See #13)

I wonder if it would be better to name these functions dsk_store and dsk_retrieve or something similar, to invoke that they are but one possible implementation of a store (i.e. one built on local disk storage). We could then obviously implement other kinds of stores. Then the generic store() and retrieve() could operate over multiple stores at once, like register() and query() already do. Only I'm not really sure that is a good idea. I do think I might want to retrieve content by looking across multiple stores, though I think that action needs a higher level verb. I'm less sure I would ever want to simultaneously try to store() content in multiple stores, that seems misguided.

So I'm sticking with store() and retrieve() as just being the local disk-based storage functions, but also thinking about defining a new verb, resolve() which is charged with the task of taking a content identifier and "resolving" (retrieving?) those bits and bytes from whatever stores it knows about (actually by querying registries of "stores"). Any thoughts on resolve()? (yes, this is rebranding my retrieve() from #15)

jhpoelen · 2020-03-04T01:03:31Z

I like the symmetry and simplicity of the store/retrieve pair. As far as different implementation goes - as long as the client can configure the kind of store without having to change the interface (e.g., store/retrieve function names + arguments), you can go to town with implementations.

In elton/preston implementations, I've built stores that use tar balls, url patterns, pull-though (~ local store backed by remotes, where only retrieved bits 'n bytes are kept locally) etc.

Perhaps nice to separate read-only interface (retrieve only) and read/write (retrieve/store) to simplify the most common use case: retrieving of data.

As far as discard goes . . . you should never toss data right?

And as far as the inventory goes ~ I'd imagine that the provenance store (aka registry) would take care of that.

Have you given through to how and when to provenance store versions into the content store?

As far as the bag-it thing goes. . . the manifest would pretty much map hashes to hashes, but I can see how the redundancy would make the file naming convention more explicit using an existing bag format.

My fingers are itching to implement a registry and store implementation to allow for querying and retrieving all of GBIF, DataONE, BHL etc. via contenturi.

cboettig · 2020-03-04T16:44:43Z

I like the symmetry and simplicity of the store/retrieve pair

👍 me too!

Perhaps nice to separate read-only interface (retrieve only) and read/write (retrieve/store)

Agree 💯. Currently, "retrieve-only" is handled (at least in part) by query() end (e.g. I think of that as being an operation involving a registry). As discussed, the location given by query is likely to be protocol-specific. In time, we may have a function that is smart enough to figure that out (and select which location -- maybe resolve?), but for the time being I think we are okay.

inventory == query(registry = app_dir()

of course! good call.

you should never toss data

I think it's a mistake to see data stored on the local harddisk of some machine as in any way "permanent". Disks have finite capacity, disks fail, the machine could be a cloud machine that exists only while computation is running (or while bills are being paid). As written, store is interacting with that local disk, and users need to (a) not think that's 'permanent', and (b) be able to manage space on that disk. That's partly why I think we also need the ability to "publish" a store (#24). Of course that's not bullet-proof either, but it symbolizes the transition of 'this data is available/stored beyond the mere confines of my fragile laptop".

That said, I agree we can probably do without discard for now. One could imagine different workflows (e.g. 'archive' or 'publish' my current store, then wipe it from my machine), or use query to remove specific locations 'manually' (e.g. thinking of the git model where it is 'hard' but not impossible to actually discard data).

how and when to provenance store versions into the content store...

Yeah, I'm itching to get to provenance as well. I'm leaning towards starting a separate package, say, prov, that manages a provenance-based registry and depends on contenturi. contenturi would handle everything to do with the bits & bytes (hashing bytes to generate identifiers, store/retrieve identifiers, query by identifiers), prov would merely refer to these identifiers. For many users/use-cases, prov would thus be more of the user-facing interface. Obviously this needs a lot more thinking through (at least on my end!) so this division would leave the narrow set of content-based operations register/query/store/retrieve in a relatively stable package (contenturi) while we mess around with the much harder stuff on semantics. I suspect we might actually end up with more than one such metadata/provenance package, which uses contenturi on the back-end but offers a different metadata model (or perhaps those packages "stack", adding progressively more metadata.

My fingers are itching to implement a registry and store implementation to allow for querying and retrieving all of GBIF, DataONE, BHL etc. via contenturi.

Me too! Actually I think this is a great next step, as it is probably the best way for me to start wrapping my head around the preston provenance model more as well. I'm thinking let's start with GBIF. I think we are very close to the point where I should be able to "pull in" or interact with preston-generated store and a preston-generated registry of GBIF directly from contenturi (modulo some parsing of nquads...). Pointers on where we start would be very welcome!

jhpoelen · 2020-03-04T22:57:24Z

@cboettig

you should never toss data

Was roughly my way of saying what you said - we can postpone implementing discard functionality until we need it. As far as publishing data to a store - I've been thinking of this in terms of a git repository - you can sync, push or pull stores to/from some other store/registry location. And . . . this goes along the git model is terms of provenance - if you remove data that is referenced in a provenance log . . . you might lose the ability to resolve the content hash uri.

re: postponing provenance discussion
In my mind, the registry already captures basic provenance information (where did it come from? when was it registered?). However, as discussed, there's two pieces we need to keep the door open to a rich provenance universe: the event uuid and the provenance hash . The former can be seen as the a unique identifiers that allows for pointing at the registration event. The latter is a reference to a provenance log (= content) that may provide more context for the registration event than that already is captured in the registry table. Without this uuid and provenance log pair, I would find it hard to keep this concept of registry as provenance store alive. That said, I can imagine that a specific implementation of a registry might be implemented in a separate package.

Pointers on where we start would be very welcome!

https://github.com/bio-guoda/preston-scripts/tree/master/query#select-only-idigbio--gbif--biocase contains some canned sparql queries + results that produce tables with urls, hashes and timestamps. Note that activity uuids were introduced since to have a unique uuid for each "registration" or "generation" event. If this pointer is too vague, please let me know what information you'd like to have to get access to the versioned GBIF registry as obtained by Preston, and I can throw something together.

jhpoelen · 2020-03-04T22:58:49Z

Note that the stores are defined implicitly using an endpoint (e.g. ,https://deeplinker.bio ) and a path convention (e.g., https://deeplinker.bio/aa/bb/aabb1234.... ) for retrieving hash://sha256/aabb1234.... via http get .

cboettig · 2020-03-05T00:55:16Z

Great, thanks for this, sounds like we are mostly on the same page on all this.

Re the preston definition of store being endpoint + file convention -- yup, love this. I'm doing the exact same conventions with contenturi store(). The data-tracker example in #16 takes advantage of this to use a push-to-github and a github url as the 'base'. Clearly works nicely on a typical webserver where the store is a public dir. Using contenturi::resolve(id, dir = basedir) should be able to retrieve files by id from a local preston store.

Clearly I'm still catching up on the prov side. To add a prov hash, we'd need to decide on what provenance we're always recording. which is what? We've started discussing this in #5, but I don't think we've converged entirely there yet. In similar vein, if we are keeping some core set of prov fields (say for sake of argument: uuid, content hash, location, generatedAtTime, mimeType, byteSize), it's not clear to me that we should be sticking, say, uuid, hash, and location and prov-hash in a registry.tsv file and the rest in prov.tsv file (which is hashed to create the prov-hash column). If that's all the provenance we're tracking, why not keep it in a single table (a la hash-archive.org). I'm still hoping to dig more into the metadata/prov model SoftwareHeritage is using to describe registry events too, and maybe other examples as well.

I am looking at the example you gave from preston,

<hash://sha256/ea8fac7c65fb589b0d53560f5251f74f9e9b243478dcb6b3ea79b5e36449c8d9> <http://www.w3.org/ns/prov#qualifiedGeneration> <23ee3afa-8b78-4782-87c8-4f21272a36e4> .
<23ee3afa-8b78-4782-87c8-4f21272a36e4> <http://www.w3.org/ns/prov#generatedAtTime> "2020-02-21T22:35:11.970Z"^^<http://www.w3.org/2001/XMLSchema#dateTime> .
<23ee3afa-8b78-4782-87c8-4f21272a36e4> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Generation> .
<23ee3afa-8b78-4782-87c8-4f21272a36e4> <http://www.w3.org/ns/prov#activity> <2d3aa8f5-28ac-48c1-9e53-7cf46b9bd757> .
<23ee3afa-8b78-4782-87c8-4f21272a36e4> <http://www.w3.org/ns/prov#used> <https://example.com> .

and while I see the value of capturing this in formal prov semantics, I think these triples are also implicit (i.e. derivable) from, say, the return object of https://hash-archive.org/api/history/https://example.com (which also makes additional statements that obviously could also be written out in prov RDF, I'm just trying to distinguish the discussion of form from the discussion of function.

uuid is something I'm afraid I'm similarly on the fence about. To put it another way, https://hash-archive.org does not give me a UUID for registration events. In what way does this make it hobble or fall short of your definition of a registry? It seems to me that the lack of UUIDs for hash-archive's registration events is something I can work around, but there's probably certain uses / tasks that I haven't thought out where it becomes an issue.

Anyway, I think as far as store/retrieve we are very closely aligned now. Clearly more to discuss on prov, but not sure where the best thread is for that. should we move this chat back over to #5? or start a new thread?

cboettig added 2 commits March 4, 2020 00:02

store and retrieve as simple inverses

2f25a07

tidy up readme, retrieve -> resolve

e726714

cboettig merged commit b908c7c into master Mar 4, 2020

cboettig deleted the store-retrieve branch March 5, 2020 06:54

This was referenced Mar 5, 2020

core functions: retrieve() #15

Closed

core functions: store() #13

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

store and retrieve as simple inverses #23

store and retrieve as simple inverses #23

cboettig commented Mar 4, 2020

jhpoelen commented Mar 4, 2020 •

edited

Loading

cboettig commented Mar 4, 2020

jhpoelen commented Mar 4, 2020

jhpoelen commented Mar 4, 2020

cboettig commented Mar 5, 2020

store and retrieve as simple inverses #23

store and retrieve as simple inverses #23

Conversation

cboettig commented Mar 4, 2020

jhpoelen commented Mar 4, 2020 • edited Loading

cboettig commented Mar 4, 2020

jhpoelen commented Mar 4, 2020

jhpoelen commented Mar 4, 2020

cboettig commented Mar 5, 2020

jhpoelen commented Mar 4, 2020 •

edited

Loading