-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Preserve original file name? #93
Comments
I'd say that keeping track of the provenance of data (including their original filename and method used to retrieve the content) is key to helping to preserve the content, especially when the provenance logs are treated as content themselves. To me, the question "Give me content with content id hash://sha256/abc...?" goes hand in hand with the question "What is the origin of content with id hash://sha256/abc... ?" |
Thanks @joelnitta , very sympathetic to this. On wrinkle to keep in mind is that the same content can often have multiple filenames -- e.g. I often see this in cases where flatfile snapshots are produced periodically with timestamps in the filename. For any given snapshot, sometimes the data is unchanged and thus the hash is unchanged. I second @jhpoelen that this is really an issue of broader metadata management -- e.g. file size, created/modified timestamps, file content type, etc, are also often essential -- so it is unsurprising that POSIX filesystems, web headers, object stores, and common metadata formats all frequently try to capture this information too. My current preferred solution is to write json metadata files, like https://github.com/ropensci/taxadb/blob/master/inst/extdata/schema.json or https://github.com/ropensci/rfishbase/blob/master/inst/prov/fb.prov, that record the association of a filename and a hash (along with any other metadata you might want to add) using the schema.org spec. Given that there are so many standardized metadata formats for this with software ecosystems built around them, I'm reluctant to invent a new one in |
I fully agree that 1) we need this metadata to interpret things, and 2) we should reuse and leverage existing metadata schemes for it. When objects are registered with DataONE, we support many metadata dialects. We've discussed this before in other contentid issues, and I wrote up a summary of areas for improvement in contentid in the context of an intro tutorial to the concepts for researchers. One I proposed is supporting less opaque metadata about objects to make them easier to use, such as names and fileNames. I also show an example of how to generate a citation for a contentid object that is stored in DataONE. I'd love to work out standard approaches to that metadata access across systems. |
Note that Preston uses Provenance Ontology (PROV) and Provenance Authoring and Versioning (PAV) ontology to (automatically) keep track of the content origin. And, the methods used to link provenance to their content using . . . drumroll . . . content ids . . . can use any kind of meta data format as long as it is digital. General concept described in pre-print Elliott, M. J., Poelen, J. H., & Fortes, J. (2022, August 29). Signed Citations: Making Persistent and Verifiable Citations of Digital Scientific Content. https://doi.org/10.31222/osf.io/wycjn and attached uncorrected Scientific Data paper proof . I see data and metadata as one thing, and realize that they are linked, and should be linked in a verifiable way. proof_41597_2023_2230_OnlinePDF.pdf Curious to see how this scheme would integrate with the scheme you are proposing. |
Coming back to this... sounds like a great paper @jhpoelen, I'll take a look. I think the major challenge with file naming is that often there is more than one that is legit. Within DataONE, each contentid can be associated with more than one classDiagram
Dataset "*" --o "0..*" DataObject
DataObject "*" --> "1" Sha256ContentId : has
class DataObject {
+String PID
+String fileName
+String filePath
}
DataObject "*" --> "1" Sha512ContentId : has
DataObject "*" --> "1" MD5ContentId : has
Sha256ContentId --|> ContentId : is a
Sha512ContentId --|> ContentId : is a
MD5ContentId --|> ContentId : is a
where the I wrote up a reproducible data access tutorial on this stuff for a course we taught in 2021 -- and included in it an approach that I could see fruitful of being able to provide metadata descriptions based on contentid values. The example I give in the tutorial is being able to generate the citation for a dataset (e.g., for credit) for a specific contentid that was referenced in a script. |
Would it be possible to add an option to
resolve()
to preserve the original filename? Sometimes (for better or worse) there is useful metadata in the filename that one may want. My specific use-case is verifying the date of files from https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump_archive/, which have the date as part of the name.The text was updated successfully, but these errors were encountered: