-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
store and retrieve as simple inverses #23
Conversation
I like the symmetry and simplicity of the store/retrieve pair. As far as different implementation goes - as long as the client can configure the kind of store without having to change the interface (e.g., store/retrieve function names + arguments), you can go to town with implementations. In elton/preston implementations, I've built stores that use tar balls, url patterns, pull-though (~ local store backed by remotes, where only retrieved bits 'n bytes are kept locally) etc. Perhaps nice to separate read-only interface (retrieve only) and read/write (retrieve/store) to simplify the most common use case: retrieving of data. As far as And as far as the Have you given through to how and when to provenance store versions into the content store? As far as the bag-it thing goes. . . the manifest would pretty much map hashes to hashes, but I can see how the redundancy would make the file naming convention more explicit using an existing bag format. My fingers are itching to implement a registry and store implementation to allow for querying and retrieving all of GBIF, DataONE, BHL etc. via contenturi. |
👍 me too!
Agree 💯. Currently, "retrieve-only" is handled (at least in part) by
of course! good call.
I think it's a mistake to see data stored on the local harddisk of some machine as in any way "permanent". Disks have finite capacity, disks fail, the machine could be a cloud machine that exists only while computation is running (or while bills are being paid). As written, That said, I agree we can probably do without
Yeah, I'm itching to get to provenance as well. I'm leaning towards starting a separate package, say,
Me too! Actually I think this is a great next step, as it is probably the best way for me to start wrapping my head around the |
Was roughly my way of saying what you said - we can postpone implementing re: postponing provenance discussion
https://github.com/bio-guoda/preston-scripts/tree/master/query#select-only-idigbio--gbif--biocase contains some canned sparql queries + results that produce tables with urls, hashes and timestamps. Note that activity uuids were introduced since to have a unique uuid for each "registration" or "generation" event. If this pointer is too vague, please let me know what information you'd like to have to get access to the versioned GBIF registry as obtained by Preston, and I can throw something together. |
Note that the stores are defined implicitly using an endpoint (e.g. ,https://deeplinker.bio ) and a path convention (e.g., https://deeplinker.bio/aa/bb/aabb1234.... ) for retrieving hash://sha256/aabb1234.... via http get . |
Great, thanks for this, sounds like we are mostly on the same page on all this. Re the Clearly I'm still catching up on the prov side. To add a prov hash, we'd need to decide on what provenance we're always recording. which is what? We've started discussing this in #5, but I don't think we've converged entirely there yet. In similar vein, if we are keeping some core set of prov fields (say for sake of argument: uuid, content hash, location, generatedAtTime, mimeType, byteSize), it's not clear to me that we should be sticking, say, uuid, hash, and location and prov-hash in a I am looking at the example you gave from preston,
and while I see the value of capturing this in formal prov semantics, I think these triples are also implicit (i.e. derivable) from, say, the return object of https://hash-archive.org/api/history/https://example.com (which also makes additional statements that obviously could also be written out in prov RDF, I'm just trying to distinguish the discussion of form from the discussion of function. uuid is something I'm afraid I'm similarly on the fence about. To put it another way, https://hash-archive.org does not give me a UUID for registration events. In what way does this make it hobble or fall short of your definition of a registry? It seems to me that the lack of UUIDs for hash-archive's registration events is something I can work around, but there's probably certain uses / tasks that I haven't thought out where it becomes an issue. Anyway, I think as far as |
This PR is a rethink on
store
andretrieve
that I think is closer to the original proposal.store
is a stupidly simple function that takes bits & bytes (specifically, a generalized connection object, url, or path, see #22) and copies those bytes into a local on-disk cache using a the preston content-based naming scheme. This means URL content gets downloaded (to tempdir), hashed, and copied into the store, local files just get copied, nothing is copied if content already exists (as determined by hashing).store
returns the content identifier URI.retrieve
is the trivial inverse of this, it takes a content identifier (URI) and returns the content by parsing the identifier into the location.Note that there's no registry being created in either case, and there's no metadata here. Obviously those things are important, but they would need to be handled explicitly or wrapped into a higher level function -- this pair of functions is all about storing and retrieving content and nothing more.
store()
no longer explicitly tries to set the store up as abagit
bag, but the preston-style file structure can easily be wrapped up this way. I think this could be quite useful if one wanted to publish or move around the store, so I've provided the (not yet exported) functionbagit_manifest()
which generates themanifest-sha256.txt
andbagit.txt
forstore
's directory. Not sure how useful that is but it's pretty trivial to do. I do want to think a bit more about workflows for easily publishing a 'content store' like this. (Again, this is probably only useful to anyone once it is coupled with a metadata registry, but I think there's value in keeping those as separate modules).In addition to store and retrieve, even a simple store like this needs a few more capabilities, e.g.
inventory()
&discard()
or something like that to list what's in the store and remove one or more blobs from the store. Suggestions for these verbs welcome. (See #13)I wonder if it would be better to name these functions
dsk_store
anddsk_retrieve
or something similar, to invoke that they are but one possible implementation of a store (i.e. one built on local disk storage). We could then obviously implement other kinds of stores. Then the genericstore()
andretrieve()
could operate over multiple stores at once, likeregister()
andquery()
already do. Only I'm not really sure that is a good idea. I do think I might want toretrieve
content by looking across multiple stores, though I think that action needs a higher level verb. I'm less sure I would ever want to simultaneously try to store() content in multiple stores, that seems misguided.So I'm sticking with
store()
andretrieve()
as just being the local disk-based storage functions, but also thinking about defining a new verb,resolve()
which is charged with the task of taking a content identifier and "resolving" (retrieving?) those bits and bytes from whatever stores it knows about (actually byquery
ing registries of "stores"). Any thoughts onresolve()
? (yes, this is rebranding myretrieve()
from #15)