-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider support for aliases #79
Comments
👍 Nice! I was wondering whether the term "nick" or "nickname" is a little more friendly than the more technical "alias" . from http://xmlns.com/foaf/spec/#term_nick
|
fyi @mielliott |
I'm pretty concerned about (1). To store the alias-hash mapping outside of the script or project breaks reproducibility. Saving in the local working directory is OK. I would try to think of a file organization covention that is transparent and could work with hand-editing and a manually coded workflow, so it's obvious to most users what is going on, and then just think of a convenience function on top of that. I'm actually curious about the premise. Have users expressed this to you? Yes you want to have readable names for the files, but how is that different than just having a few |
@noamross yeah, me too. some thoughts: First, note we could completely alleviate that risk if the alias file itself is referenced by it's content identifier & not a local path. The obvious downside there is that the user now responsible for making their alias file discoverable somehow. Tossing it in your GitHub repo and triggering a SoftwareHeritage snapshot is probably the easiest option, but probably still unattractive to users. Which is to say yeah, for lightweight use at least, I think I agree that some calls like Just for motivating the discussion though, the alias sheet is basically the same concept as the 'data manifest' we discussed earlier. e.g. consider the case of an R package like The literal hashes are not embedded into the code file that actually calls @jhpoelen I'm ok with |
I am continuing having fun experimenting with aliases - see example on creating a (versioned) alias pointing to a part of a file, in this case a bee name: bio-guoda/preston#135 (comment) . Also,
Yes! In Preston, the alias is automatically added to a new version of the publication. Because I like your suggestion and agree that This makes me wonder - aren't urls and filenames just aliases for specific content? Why treat a new name (e.g., In the end, the difference is the process which helps describe the relation between some name (or alias) and some specific content. For downloads this is a whole chain of actors (e.g., web server, DNS, firewall, proxy), whereas the explicit alias might only depend on some offline process or individual actor. For both cases, the end result is a statement saying: this name is associated with that content (modeled as a process activity in the provenance log). |
@jhpoelen but the problem is that both names and urls are sometimes aliases for different "versions" of the "same" content (aka different content!), but other times they are aliases for static, unchanging content. Our use of filename and URL semantics is invariably imprecise on this point.
I think this makes sense, but sometimes I'm a bit unclear on how that universe is defined. e.g. if I run the command on a different computer, or put |
Yes, aliases are names, and the meaning of a name is in their relation of the context they exist. In the preston implementation, this context is the provenance log, allowing for non-unique aliases to explicitly exist in well-defined statement in the content universe (e.g., at position (a time proxy) X in prov log Y, alias Z point to hash A, where A and Z are content ids).
You can publish / distribute a Preston dataset by copying the A (remote) preston "push" is not yet implemented because this can be done with existing copyTo tools (e.g., A preston "pull" has been implemented in the For examples see https://github.com/bio-guoda/preston/#archiving and more. |
yes, totally, I get this. but as you note my preston |
Yep, we are a little off-course 😕, but I think this conversation is crucial to making aliases well-defined, useful and shareable. right now, the
These can all be stored separately. So for instance, you can have a superlightweight preston repo, with a 78 byte pointer in it, then, have a cascading sequence of remotes that store provenance logs, and content separately. preston clone --remote https://mysuperlightweight.info/,https://provenance.store.bio/,https://archive.org where:
When implementing my first pass at remote archives in Zenodo, I used these three levels to speed up performance: the link files were stored as is, just like the provenance logs. And the content was packaged up in tar balls with segmented by hash prefix. This way, the An example is:
|
Perhaps a good way to try this is to setup a mirror in your lab . . . I'd welcome the duplication and donated (shared) server space ; ) |
The great thing about it is that you don't have to give me access to anything, you can simply run |
Both the history files (or provenance links) and provenance logs are utf-8 encoded text files. The third layer, the "content", is just a bunch of bits and bytes (content agnostic) as far as Preston, or anyone else, is concerned. While the provenance files are rdf/nquads I usually just use grep and friends to discover the provenance logs. For more rigorous analysis, I load the logs into a triple store. So, I think that the setup is pretty platform agnostic. @cboettig curious to hear whether I have addressed your concerns. . . or perhaps raised new ones ; ) |
@cboettig The easiest way to grab just the provenance files out of $ preston ls --remote https://raw.githubusercontent.com/bio-guoda/preston-amazon/master/data/ > /dev/null
$ ls data/*/*/*
data/1a/a3/1aa34112ade084ccc8707388fbc329dcb8fae5f895cb266e3ad943f7495740b3
data/2a/5d/2a5de79372318317a382ea9a2cef069780b852b01210ef59e06b640a3539cb5a
data/59/15/5915dffe1569ccd29cc7f6b8aea1341754318d206fef8daf225d8c40154ef8be
data/62/95/6295d7136ff2652a7849262c84af85244688fc13689791c058ae41c44dd0af4a
data/d7/b7/d7b73e3472d5a1989598f2a46116a4fc11dfb9ceacdf0a2b2f7f69737883c951
data/d8/f7/d8f72bd865686e99eac413b36d198fd15f305966d2864091320f1868279451ff where all of those To tie back to the original topic 😉 after getting just the provenance/hexastore files, I hope this helps! |
I still really like this thread, but digesting it slowly. Is it accurate to say that an So Is there an inverse operation, i.e. how do I ask preston for known aliases of some content hash? (Such a reverse operation could also be used/abused to associate additional metadata with a given hash). Or is that what you would do with To tie into #69 , one could imagine similar operations in which |
I'm still on the fence with aliases as part of the contentid API. It is simple enough for a user to maintain their own look-up table identifying aliases to identifiers suitable to their needs. Alternatively, a metadata record will often contain both object names (i.e. user friendly aliases) and ids (for instance, a http://schema.org/dataDownload has an http://schema.org/id, which can be a hash id, and a http://schema.org/name, which can be an alias.) It seems to me the natural way to use aliases would be to refer to this. |
A major limitation in the current model is that many users are reluctant to deploy long hashes in code:
looks rather cumbersome. Assigning aliases could work around this. This is not dissimilar to the use of aliases in
pins
orstorrr
, but in our case, the alias does not become the primary key for the data. The alias is merely a shorthand for the hashalias(id, name)
would create an entry in a local file (tsv
maybe?) associating the alias with the id.resolve
would detect if a string was an alias (do we namespace aliases or merely attempt to resolve anything that doesn't start withhash://
as a potential alias reference?), and if so, attempt to translate it into the corresponding hash and resolve that as usual.resolve
would gain an optional argument ofaliases
to locate the alias file with a simple default location.Issues:
~/.share/local/R/contentid
), though users may want to utilize aliases across projects.schema:name
. This may have much added utility it in figuring out what's what, and open the door to using other file metadata (filename, format, description, author, etc) as a mechanism to resolve hashes. Downside is that parsing overhead may degrade performance, and greater complexity of implementation = more room for errors etc.Inspired by
preston
bio-guoda/preston#135 cc @jhpoelenThe text was updated successfully, but these errors were encountered: