Sofware Heritage Interface #32

cboettig · 2020-03-23T05:22:25Z

This is an initial pass at adding support for Software Heritage API, #19

This provides sources_swh(), retrieve_swh(), store_swh(), and history_swh().

sources_swh() works like any sources() generic in that it will take a content id and return software heritage location and date of the content (note that because SWH store is archival and content based, we can be quite confident we get the content requested, as opposed to, say, getting any url location back from hash-archive.org). retrieve_swh() provides the connection URL to the bits and bytes requested.

store_swh() and history_swh() work slightly differently than standard store() and history(), in that they don't take arbitrary URLs. Instead, the URL must point to the origin for a git, hg (mecurial) or svn repo. (GitHub & GitLab repos are automatically archived when requested, others must go into a manual queue). store_swh() will trigger SWH to snapshot archive that whole repo, computing sha-256 hashes we can later use to query that content. history_swh() gives the history of requested archive visits. There's no 'register_swh' as you cannot register a hash to SWH without also storing the content in their permanent archive.

The SWH API is much richer than this set of functions, but these ones fit most closely to existing verbs. While it could be added in as an optional extension, I think the SWH functions are particularly compelling as they give a possible archival storage option out-of-the-box to contenturi users.

jhpoelen · 2020-03-23T16:57:43Z

R/software-heritage.R

+#
+
+
+sources_swh <- function(id, host = "https://archive.softwareheritage.org", ...){


Nice to see that the software heritage web api maps into the concept of source. Some questions:

how to do you imagine to keep track of who made the assertion that relates some web location to some content uri?

how would calling source work for reproducing lookup in offline workflows? I am trying to imagine how to account for that web locations are likely to change and do not guarantee to provide the same content. In this case that applies to the web apis (and implicitly, the related "software agents") that produce the claims relating locations to some hash uri

what is the use case for using software hashes in R ? Are you imaging to load packages from content uris?

> id <- paste0("hash://sha256/9412325831dab22aeebdd", + "674b6eb53ba6b7bdd04bb99a4dbb21ddff646287e37") > df <- sources_swh(id) > df identifer 1 hash://sha256/9412325831dab22aeebdd674b6eb53ba6b7bdd04bb99a4dbb21ddff646287e37 source 1 https://archive.softwareheritage.org/api/1/content/sha256:9412325831dab22aeebdd674b6eb53ba6b7bdd04bb99a4dbb21ddff646287e37/raw/ date 1 2020-03-23 17:27:24

As you can see, sources_swh() contains 1 entry. It asserts the requested hash can be found in the Software Heritage Archive. The assertion is being made by Software Heritage, since it is coming from their API and pointing to their content (technically it's also being made by contenturi's parsing of an assertion being made by SWH). But the whole point to me is you don't have to trust any of that -- you can rely on the hash to verify the content is what you want.

Recall that source("hash://sha256/xxx") returns a table of all known sources, including local sources, for a given hash. However, I imagine most users would use this only via resolve("hash://sha256/xxx"), which simply automates the process of (a) looking for a local copy that matches the hash first, (b) then looking for online copies listed by source() (and verifying the hash of the content). Like you say, some sites will no longer have the content or it will have changed, and so resolve() will then try the next one until it gets a crytographic proof it has the requested content. Of course this is not at all specific to Software Heritage. Note that Software Heritage is providing a content-based store, and it is not returning random web addresses to GitHub or other locations, it is only returning the the address to Software Heritage's archival-quality copy. Time will tell of course, but Software Heritage is trying to be a permanent archive, like Zenodo, so I think it is perhaps one of the more robust stores we have.

Sorry, I'm not sure what you mean by software hashes. Yeah, SoftwareHeritage understands git-sha1 as well, but we aren't using those here. The functions in this PR work only with the same sha256 content hashes to content. They don't anything about whether that content is 'software' or comes from git or anything else, it's just the sha-256 hash of the data file itself. I know they have "Software" in the name, but as you know, it's super common to put scientific data on GitHub (e.g. here's the John Hopkins team's COVID-19 data, as daily csv files on GitHub: https://github.com/CSSEGISandData/COVID-19). Of course I could submit the GitHub download URLs for that page to hash-archive.org, but that would be slow and wouldn't create an archival copy. I can submit that URL to store_swh() and SWH will create snapshot permanent archive from which I can request any of those files by their (stand-alone) sha256 hash.

Thanks for the reply. I think we are in agreement, and might be talking about different things. Perhaps this comes back to the discussion around provenance stores (aka registries) and content stores.

It asserts the requested hash can be found in the Software Heritage Archive. The assertion is being made by Software Heritage, since it is coming from their API and pointing to their content (technically it's also being made by contenturi's parsing of an assertion being made by SWH).

I think the key here is that the local registry does not explicitly keep track of who made the assumption. The connection between the swh api reply and the statement extracted from that reply appears to be lost. So, yes, I can independently verify that:

$ curl --silent "https://archive.softwareheritage.org/api/1/content/sha256:9412325831dab22aeebdd674b6eb53ba6b7bdd04bb99a4dbb21ddff646287e37/raw/" | sha256sum 9412325831dab22aeebdd674b6eb53ba6b7bdd04bb99a4dbb21ddff646287e37 -

However, I can't see where that claim came from originally (was it derived from a local process that retrieved the content and calculated the hash? or was the location-hash claim derived from some from some response json retrieve from a web api?). I think the provenance of the claim is important because I imagine that data publications would include a content store and a provenance store (aka registry) to help trace the origin of content. In offline scenarios, you can't independently verify that hashes were sourced from specific locations so you'd have rely on the provenance source to refer to the context that the content exists in. I think this context is important because it helps to relate (and cite!) content to their natural habitat and the humans that take care them.

And related, 2. (using sources in offline scenario), I agree that the current lookup table provides for a way to lookup locations that served some hash at some point in time. Perhaps I should rephrase the question - how do you attribute the sources of specific content, and the process that related the sources to the content, in an offline scenario? The provenance (aka registry) store and content store that keep the claims and content respectively, made some effort to keep the content around and . . . should be attributed for that somehow. I thinks this would also help establish a trust relationship and make informed decision about which sources to cite. For instance, if GBIF is known to be a registry of dataset locations that produce changing content (by design), I'd rather chose some locations that are associated with data networks that have (measurably) shown to produce stable content over time and I'd have to have some sort of way to group sources by the network (or project) they were registered in.

About the "software" hashes. I agree that SWH archives anything in version controlled repos. And these version controlled repositories could include both software and data, as your example shows. Perhaps, I should rephrase my comment into a question: can we use R packages archived by SWH as a way to establish some sort of decentralized and hash-versioned CRAN ?

Thanks for being patient with me as I am trying to figure out minimal information need to keep track of the origin of content.

Thanks for the replies!

Re "software" hashes. Yes, I think we could -- a 'decentralize hash-versioned CRAN' is a very appealing concept, but it would take some thinking through, at least for me. i.e. I think the natural thing to hash would be the tar.gz package that gets installed, and I don't think that's a hash that is available via SWH. SWH gives us all the individual file hashes, but the zip doesn't exist as a blob in it's content store...

Re citation and provenance, yes, this is a very interesting discussion, relates to our ongoing discussion about provenance in #5 and elsewhere. I think that discussion is probably more general than the SWH issue anyway. It might be helpful for me to better understand the use cases enabled by, say, knowing that the assertion that https://archive.softwareheritage.org/api/1/content/sha256:9412325831dab22aeebdd674b6eb53ba6b7bdd04bb99a4dbb21ddff646287e37/raw/ serves content with sha256sum
9412325831dab22aeebdd674b6eb53ba6b7bdd04bb99a4dbb21ddff646287e37 was made by software heritage software rather than by contenturi (running a specific version of libcurl on a particular architecture etc).

I'm not trying to be dismissive of concerns of attribution, provenance, and citation. These are very important issues, I am just afraid of failing to do them proper justice. I am concerned that none of what I would programmatically capture from this request for said sha hash would answer that question. The data itself has a DOI: https://data.ess-dive.lbl.gov/view/doi:10.3334/CDIAC/ATG.009, though that sits behind a website that requires logins for downloads, so most sources, including NASA and NOAA's own public webpages, usually point to either the CDIAC's FTP server instead, https://cdiac.ess-dive.lbl.gov/ftp/trends/co2/vostok.icecore.co2, though can be found on other places (now including software heritage), some of which are in the hash-archive.org record. Some citation metadata is embedded into the document itself, though the requested citation listed in the metadata of the official records (i.e. LBL DOI location, and the previous landing page: https://cdiac.ess-dive.lbl.gov/trends/co2/ice_core_co2.html), request a citation to:

Petit J.R., Jouzel J., Raynaud D., Barkov N.I., Barnola J.M., et al. 1999, Climate and Atmospheric History of the Past 420,000 years from the Vostok Ice Core, Antarctica, Nature 399: 429-436.

This is where things get really fun, because (a) we can see that citation is 1999, and from the FTP site we can find vostok.icecore_old.co2 which is also dated 1999, which is different than the content above which refers to an updated version in 2003. That Nature paper from 1999 of course has a DOI, https://doi.org/10.1038/20859, which, wait for it, is throwing a 404 error at the moment (nor does a hit for this paper come up in Google right now in Nature itself, though it will find many other copies of the nature paper).

So apologies for the long tangent, but Petit et al, and the earlier Nature paper that provided the original first half of the data, https://doi.org/10.1038/329408a0 , are perhaps the canonical citations that have most of the critical metadata for interpreting where this data really came from, how it was measured, how much you should trust it, etc. A researcher could probably glean this information from the embedded header in the file itself and a little googling, but I am not sure how much tracking down the provenance trail of the SWH copy would help in that?

jhpoelen · 2020-03-23T17:05:19Z

inst/examples/dataone.R

+library(tibble)
+library(contenturi)
+
+resp <- httr::GET("https://cn.dataone.org/cn/v2/query/solr/?q=datasource:*KNB&fl=identifier,checksum,checksumAlgorithm,replicaMN&wt=json")


Cool to see dataONE make an entry into the package. Did you intend to include this in the pull request?

right, this belongs on the dataone branch, that was just me being sloppy on branches. It's just an example for proof-of-principle / exploration. Here's what I've learned so far, would love to compare notes!

DataONE already will return hashes for all it's objects already, but the different repos use different algorithms. MD5 is the most common.

the member node download URLs aren't always stable, some more stable than others. The most recent are usually (but not always -- the central node may have to notify the member nodes if they move content and don't update the list) returned by that solr query.

Some of the dataone objects are quite large. There's about 2.5 million objects listed from the above solr query, ~ 4 are over a TB each, the list there totals about 61 TB. This is because it doesn't include the largest objects on DataONE, which have metadata files on record but not download URLs, I believe the total size of data one is over 1 PB. Even hashing the 61 TB independently will probably take some time.

Anyway, would love to brainstorm more about this, including if there's a way to make use of the pre-computed hashes that aren't sha256 -- can move into a separate issue thread.

Thanks for sharing your insights on DataONE - especially the wide range of object sizes - having this size estimate helps to have more detailed conversations about the data network, especially around the resources (e.g., money, time, bandwidth, storage) needed to find, access, move, and archive the associated content.

The md5 hashes or other hashes, I can imagine using hash://md5/1234.... notation . Happy to contribute to brainstorm especially if there's a clear use case around some content registered and accessed through the DataONE network.

jhpoelen · 2020-03-23T17:08:00Z

inst/examples/dataone.R

                             otherwise = as.character(NA))


+register_remote <- purrr::possibly(function(x) contenturi::register(x, registries = "https://hash-archive.org"),


neat to see the re- and cross-registration of existing DataONE registered content in local and remote registries .

cboettig · 2020-03-24T04:36:06Z

Merging just so I can reduce my branches here, since I think there are issues still to discuss about provenance generally but they aren't specific to this PR.

cboettig added 8 commits March 19, 2020 23:31

a first go at software heritage functions, #19

879daea

dataone testing

de7c870

export the suite of software heritage endpoints

195a5e9

updated example script

fe607f0

documentation & initial testing

a2806ce

revised dataone knb registration

a52e42e

more info, more robust

483078a

more fields

85d1194

jhpoelen reviewed Mar 23, 2020

View reviewed changes

cboettig added 2 commits March 23, 2020 21:58

better example of for store_swh()

4b1a451

yeah this is on the wrong branch, no worries

560eed5

cboettig merged commit 6dc6de3 into master Mar 24, 2020

cboettig deleted the swh branch March 25, 2020 22:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sofware Heritage Interface #32

Sofware Heritage Interface #32

cboettig commented Mar 23, 2020

jhpoelen Mar 23, 2020 •

edited

Loading

cboettig Mar 23, 2020

jhpoelen Mar 23, 2020

cboettig Mar 24, 2020

jhpoelen Mar 23, 2020 •

edited

Loading

cboettig Mar 23, 2020

jhpoelen Mar 23, 2020

jhpoelen Mar 23, 2020

cboettig commented Mar 24, 2020

		#


		sources_swh <- function(id, host = "https://archive.softwareheritage.org", ...){

		otherwise = as.character(NA))


		register_remote <- purrr::possibly(function(x) contenturi::register(x, registries = "https://hash-archive.org"),

Sofware Heritage Interface #32

Sofware Heritage Interface #32

Conversation

cboettig commented Mar 23, 2020

jhpoelen Mar 23, 2020 • edited Loading

Choose a reason for hiding this comment

cboettig Mar 23, 2020

Choose a reason for hiding this comment

jhpoelen Mar 23, 2020

Choose a reason for hiding this comment

cboettig Mar 24, 2020

Choose a reason for hiding this comment

jhpoelen Mar 23, 2020 • edited Loading

Choose a reason for hiding this comment

cboettig Mar 23, 2020

Choose a reason for hiding this comment

jhpoelen Mar 23, 2020

Choose a reason for hiding this comment

jhpoelen Mar 23, 2020

Choose a reason for hiding this comment

cboettig commented Mar 24, 2020

jhpoelen Mar 23, 2020 •

edited

Loading

jhpoelen Mar 23, 2020 •

edited

Loading