Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preserve original file name? #93

Open
joelnitta opened this issue Jun 16, 2023 · 5 comments
Open

Preserve original file name? #93

joelnitta opened this issue Jun 16, 2023 · 5 comments

Comments

@joelnitta
Copy link

Would it be possible to add an option to resolve() to preserve the original filename? Sometimes (for better or worse) there is useful metadata in the filename that one may want. My specific use-case is verifying the date of files from https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump_archive/, which have the date as part of the name.

@jhpoelen
Copy link
Collaborator

I'd say that keeping track of the provenance of data (including their original filename and method used to retrieve the content) is key to helping to preserve the content, especially when the provenance logs are treated as content themselves. To me, the question "Give me content with content id hash://sha256/abc...?" goes hand in hand with the question "What is the origin of content with id hash://sha256/abc... ?"

@cboettig
Copy link
Owner

Thanks @joelnitta , very sympathetic to this.

On wrinkle to keep in mind is that the same content can often have multiple filenames -- e.g. I often see this in cases where flatfile snapshots are produced periodically with timestamps in the filename. For any given snapshot, sometimes the data is unchanged and thus the hash is unchanged.

I second @jhpoelen that this is really an issue of broader metadata management -- e.g. file size, created/modified timestamps, file content type, etc, are also often essential -- so it is unsurprising that POSIX filesystems, web headers, object stores, and common metadata formats all frequently try to capture this information too.

My current preferred solution is to write json metadata files, like https://github.com/ropensci/taxadb/blob/master/inst/extdata/schema.json or https://github.com/ropensci/rfishbase/blob/master/inst/prov/fb.prov, that record the association of a filename and a hash (along with any other metadata you might want to add) using the schema.org spec. Given that there are so many standardized metadata formats for this with software ecosystems built around them, I'm reluctant to invent a new one in contentid that only we use. But also, I know that not all users will like my current preference of schema.org (probably including me n-years in the future or n-years in the past!).

@mbjones
Copy link

mbjones commented Jun 16, 2023

I fully agree that 1) we need this metadata to interpret things, and 2) we should reuse and leverage existing metadata schemes for it. When objects are registered with DataONE, we support many metadata dialects. We've discussed this before in other contentid issues, and I wrote up a summary of areas for improvement in contentid in the context of an intro tutorial to the concepts for researchers. One I proposed is supporting less opaque metadata about objects to make them easier to use, such as names and fileNames. I also show an example of how to generate a citation for a contentid object that is stored in DataONE. I'd love to work out standard approaches to that metadata access across systems.

@jhpoelen
Copy link
Collaborator

jhpoelen commented Jun 16, 2023

Note that Preston uses Provenance Ontology (PROV) and Provenance Authoring and Versioning (PAV) ontology to (automatically) keep track of the content origin. And, the methods used to link provenance to their content using . . . drumroll . . . content ids . . . can use any kind of meta data format as long as it is digital.

General concept described in pre-print

Elliott, M. J., Poelen, J. H., & Fortes, J. (2022, August 29). Signed Citations: Making Persistent and Verifiable Citations of Digital Scientific Content. https://doi.org/10.31222/osf.io/wycjn

and attached uncorrected Scientific Data paper proof .

I see data and metadata as one thing, and realize that they are linked, and should be linked in a verifiable way.

proof_41597_2023_2230_OnlinePDF.pdf

Curious to see how this scheme would integrate with the scheme you are proposing.

@mbjones
Copy link

mbjones commented Jun 20, 2023

Coming back to this... sounds like a great paper @jhpoelen, I'll take a look.

I think the major challenge with file naming is that often there is more than one that is legit. Within DataONE, each contentid can be associated with more than one schema:Dataset, sometimes from different people, and have different metadata in each. We frequently find the same csv file included with different file names in different datasets -- we can tell they are the same due to the hash match, but it might be named and arranged very differently by different people. A rough model of the relationships we frequently see is:

classDiagram
    Dataset "*" --o "0..*" DataObject
    DataObject "*" --> "1" Sha256ContentId : has
    class DataObject {
        +String PID
        +String fileName
        +String filePath
    }
    DataObject "*" --> "1" Sha512ContentId : has
    DataObject "*" --> "1" MD5ContentId : has
    Sha256ContentId --|> ContentId : is a
    Sha512ContentId --|> ContentId : is a
    MD5ContentId --|> ContentId : is a
Loading

where the PID is an authority-based identifier (such as a UUID or DOI) and the fileName and path are frequently specific to a particular dataset arrangement.

I wrote up a reproducible data access tutorial on this stuff for a course we taught in 2021 -- and included in it an approach that I could see fruitful of being able to provide metadata descriptions based on contentid values. The example I give in the tutorial is being able to generate the citation for a dataset (e.g., for credit) for a specific contentid that was referenced in a script.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants