-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sofware Heritage Interface #32
Conversation
# | ||
|
||
|
||
sources_swh <- function(id, host = "https://archive.softwareheritage.org", ...){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice to see that the software heritage web api maps into the concept of source
. Some questions:
-
how to do you imagine to keep track of who made the assertion that relates some web location to some content uri?
-
how would calling
source
work for reproducing lookup in offline workflows? I am trying to imagine how to account for that web locations are likely to change and do not guarantee to provide the same content. In this case that applies to the web apis (and implicitly, the related "software agents") that produce the claims relating locations to some hash uri -
what is the use case for using software hashes in R ? Are you imaging to load packages from content uris?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
> id <- paste0("hash://sha256/9412325831dab22aeebdd",
+ "674b6eb53ba6b7bdd04bb99a4dbb21ddff646287e37")
> df <- sources_swh(id)
> df
identifer
1 hash://sha256/9412325831dab22aeebdd674b6eb53ba6b7bdd04bb99a4dbb21ddff646287e37
source
1 https://archive.softwareheritage.org/api/1/content/sha256:9412325831dab22aeebdd674b6eb53ba6b7bdd04bb99a4dbb21ddff646287e37/raw/
date
1 2020-03-23 17:27:24
As you can see, sources_swh()
contains 1 entry. It asserts the requested hash can be found in the Software Heritage Archive. The assertion is being made by Software Heritage, since it is coming from their API and pointing to their content (technically it's also being made by contenturi's parsing of an assertion being made by SWH). But the whole point to me is you don't have to trust any of that -- you can rely on the hash to verify the content is what you want.
- Recall that
source("hash://sha256/xxx")
returns a table of all known sources, including local sources, for a given hash. However, I imagine most users would use this only viaresolve("hash://sha256/xxx")
, which simply automates the process of (a) looking for a local copy that matches the hash first, (b) then looking for online copies listed bysource()
(and verifying the hash of the content). Like you say, some sites will no longer have the content or it will have changed, and soresolve()
will then try the next one until it gets a crytographic proof it has the requested content. Of course this is not at all specific to Software Heritage. Note that Software Heritage is providing a content-based store, and it is not returning random web addresses to GitHub or other locations, it is only returning the the address to Software Heritage's archival-quality copy. Time will tell of course, but Software Heritage is trying to be a permanent archive, like Zenodo, so I think it is perhaps one of the more robust stores we have. - Sorry, I'm not sure what you mean by software hashes. Yeah, SoftwareHeritage understands git-sha1 as well, but we aren't using those here. The functions in this PR work only with the same sha256 content hashes to content. They don't anything about whether that content is 'software' or comes from git or anything else, it's just the sha-256 hash of the data file itself. I know they have "Software" in the name, but as you know, it's super common to put scientific data on GitHub (e.g. here's the John Hopkins team's COVID-19 data, as daily csv files on GitHub: https://github.com/CSSEGISandData/COVID-19). Of course I could submit the GitHub download URLs for that page to hash-archive.org, but that would be slow and wouldn't create an archival copy. I can submit that URL to
store_swh()
and SWH will create snapshot permanent archive from which I can request any of those files by their (stand-alone) sha256 hash.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the reply. I think we are in agreement, and might be talking about different things. Perhaps this comes back to the discussion around provenance stores (aka registries) and content stores.
It asserts the requested hash can be found in the Software Heritage Archive. The assertion is being made by Software Heritage, since it is coming from their API and pointing to their content (technically it's also being made by contenturi's parsing of an assertion being made by SWH).
I think the key here is that the local registry does not explicitly keep track of who made the assumption. The connection between the swh api reply and the statement extracted from that reply appears to be lost. So, yes, I can independently verify that:
$ curl --silent "https://archive.softwareheritage.org/api/1/content/sha256:9412325831dab22aeebdd674b6eb53ba6b7bdd04bb99a4dbb21ddff646287e37/raw/" | sha256sum
9412325831dab22aeebdd674b6eb53ba6b7bdd04bb99a4dbb21ddff646287e37 -
However, I can't see where that claim came from originally (was it derived from a local process that retrieved the content and calculated the hash? or was the location-hash claim derived from some from some response json retrieve from a web api?). I think the provenance of the claim is important because I imagine that data publications would include a content store and a provenance store (aka registry) to help trace the origin of content. In offline scenarios, you can't independently verify that hashes were sourced from specific locations so you'd have rely on the provenance source to refer to the context that the content exists in. I think this context is important because it helps to relate (and cite!) content to their natural habitat and the humans that take care them.
And related, 2. (using sources in offline scenario), I agree that the current lookup table provides for a way to lookup locations that served some hash at some point in time. Perhaps I should rephrase the question - how do you attribute the sources of specific content, and the process that related the sources to the content, in an offline scenario? The provenance (aka registry) store and content store that keep the claims and content respectively, made some effort to keep the content around and . . . should be attributed for that somehow. I thinks this would also help establish a trust relationship and make informed decision about which sources to cite. For instance, if GBIF is known to be a registry of dataset locations that produce changing content (by design), I'd rather chose some locations that are associated with data networks that have (measurably) shown to produce stable content over time and I'd have to have some sort of way to group sources by the network (or project) they were registered in.
About the "software" hashes. I agree that SWH archives anything in version controlled repos. And these version controlled repositories could include both software and data, as your example shows. Perhaps, I should rephrase my comment into a question: can we use R packages archived by SWH as a way to establish some sort of decentralized and hash-versioned CRAN ?
Thanks for being patient with me as I am trying to figure out minimal information need to keep track of the origin of content.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the replies!
Re "software" hashes. Yes, I think we could -- a 'decentralize hash-versioned CRAN' is a very appealing concept, but it would take some thinking through, at least for me. i.e. I think the natural thing to hash would be the tar.gz package that gets installed, and I don't think that's a hash that is available via SWH. SWH gives us all the individual file hashes, but the zip doesn't exist as a blob in it's content store...
Re citation and provenance, yes, this is a very interesting discussion, relates to our ongoing discussion about provenance in #5 and elsewhere. I think that discussion is probably more general than the SWH issue anyway. It might be helpful for me to better understand the use cases enabled by, say, knowing that the assertion that https://archive.softwareheritage.org/api/1/content/sha256:9412325831dab22aeebdd674b6eb53ba6b7bdd04bb99a4dbb21ddff646287e37/raw/ serves content with sha256sum
9412325831dab22aeebdd674b6eb53ba6b7bdd04bb99a4dbb21ddff646287e37 was made by software heritage software rather than by contenturi (running a specific version of libcurl on a particular architecture etc).
I'm not trying to be dismissive of concerns of attribution, provenance, and citation. These are very important issues, I am just afraid of failing to do them proper justice. I am concerned that none of what I would programmatically capture from this request for said sha hash would answer that question. The data itself has a DOI: https://data.ess-dive.lbl.gov/view/doi:10.3334/CDIAC/ATG.009, though that sits behind a website that requires logins for downloads, so most sources, including NASA and NOAA's own public webpages, usually point to either the CDIAC's FTP server instead, https://cdiac.ess-dive.lbl.gov/ftp/trends/co2/vostok.icecore.co2, though can be found on other places (now including software heritage), some of which are in the hash-archive.org record. Some citation metadata is embedded into the document itself, though the requested citation listed in the metadata of the official records (i.e. LBL DOI location, and the previous landing page: https://cdiac.ess-dive.lbl.gov/trends/co2/ice_core_co2.html), request a citation to:
Petit J.R., Jouzel J., Raynaud D., Barkov N.I., Barnola J.M., et al. 1999, Climate and Atmospheric History of the Past 420,000 years from the Vostok Ice Core, Antarctica, Nature 399: 429-436.
This is where things get really fun, because (a) we can see that citation is 1999, and from the FTP site we can find vostok.icecore_old.co2 which is also dated 1999, which is different than the content above which refers to an updated version in 2003. That Nature paper from 1999 of course has a DOI, https://doi.org/10.1038/20859, which, wait for it, is throwing a 404 error at the moment (nor does a hit for this paper come up in Google right now in Nature itself, though it will find many other copies of the nature paper).
So apologies for the long tangent, but Petit et al, and the earlier Nature paper that provided the original first half of the data, https://doi.org/10.1038/329408a0 , are perhaps the canonical citations that have most of the critical metadata for interpreting where this data really came from, how it was measured, how much you should trust it, etc. A researcher could probably glean this information from the embedded header in the file itself and a little googling, but I am not sure how much tracking down the provenance trail of the SWH copy would help in that?
inst/examples/dataone.R
Outdated
library(tibble) | ||
library(contenturi) | ||
|
||
resp <- httr::GET("https://cn.dataone.org/cn/v2/query/solr/?q=datasource:*KNB&fl=identifier,checksum,checksumAlgorithm,replicaMN&wt=json") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool to see dataONE make an entry into the package. Did you intend to include this in the pull request?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, this belongs on the dataone
branch, that was just me being sloppy on branches. It's just an example for proof-of-principle / exploration. Here's what I've learned so far, would love to compare notes!
- DataONE already will return hashes for all it's objects already, but the different repos use different algorithms. MD5 is the most common.
- the member node download URLs aren't always stable, some more stable than others. The most recent are usually (but not always -- the central node may have to notify the member nodes if they move content and don't update the list) returned by that solr query.
- Some of the dataone objects are quite large. There's about 2.5 million objects listed from the above solr query, ~ 4 are over a TB each, the list there totals about 61 TB. This is because it doesn't include the largest objects on DataONE, which have metadata files on record but not download URLs, I believe the total size of data one is over 1 PB. Even hashing the 61 TB independently will probably take some time.
Anyway, would love to brainstorm more about this, including if there's a way to make use of the pre-computed hashes that aren't sha256 -- can move into a separate issue thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for sharing your insights on DataONE - especially the wide range of object sizes - having this size estimate helps to have more detailed conversations about the data network, especially around the resources (e.g., money, time, bandwidth, storage) needed to find, access, move, and archive the associated content.
The md5 hashes or other hashes, I can imagine using hash://md5/1234.... notation . Happy to contribute to brainstorm especially if there's a clear use case around some content registered and accessed through the DataONE network.
inst/examples/dataone.R
Outdated
otherwise = as.character(NA)) | ||
|
||
|
||
register_remote <- purrr::possibly(function(x) contenturi::register(x, registries = "https://hash-archive.org"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
neat to see the re- and cross-registration of existing DataONE registered content in local and remote registries .
Merging just so I can reduce my branches here, since I think there are issues still to discuss about provenance generally but they aren't specific to this PR. |
This is an initial pass at adding support for Software Heritage API, #19
This provides
sources_swh()
,retrieve_swh()
,store_swh()
, andhistory_swh()
.sources_swh()
works like anysources()
generic in that it will take a content id and return software heritage location and date of the content (note that because SWH store is archival and content based, we can be quite confident we get the content requested, as opposed to, say, getting any url location back from hash-archive.org).retrieve_swh()
provides the connection URL to the bits and bytes requested.store_swh()
andhistory_swh()
work slightly differently than standardstore()
andhistory()
, in that they don't take arbitrary URLs. Instead, the URL must point to the origin for agit
,hg
(mecurial) orsvn
repo. (GitHub & GitLab repos are automatically archived when requested, others must go into a manual queue).store_swh()
will trigger SWH to snapshot archive that whole repo, computing sha-256 hashes we can later use to query that content.history_swh()
gives the history of requested archive visits. There's no 'register_swh' as you cannot register a hash to SWH without also storing the content in their permanent archive.The SWH API is much richer than this set of functions, but these ones fit most closely to existing verbs. While it could be added in as an optional extension, I think the SWH functions are particularly compelling as they give a possible archival storage option out-of-the-box to contenturi users.