Skip to content

2024.09.17 F2F Editors Meeting

Rosalyn Metz edited this page Sep 20, 2024 · 92 revisions

Meeting Planning

Attendees

  • Julian Morley (Stanford)
  • Rosy Metz (Emory)
  • Simeon Warner (Cornell)
  • Andrew Woods (Harvard)
  • Neil Jefferies (Oxford)

Agreed Upon Travel Schedule:

  • Arrival in Ghent no later than Sunday Sept. 15th
  • Do the workshop on Monday Sept. 16th
  • Details TBD during the week, but we will have meetings Tuesday - Thursday
  • Full day meeting on Friday Sept. 20th that will end no earlier than Friday at 5pm local time (i.e., your flights need to take this into consideration)

Meeting Goals:

  • Review Use Cases
    • Triage/create tickets/topics into spec or implementation notes
    • Determine what tickets we can group together
    • Address any low hanging fruit tickets during the meeting
  • Identify a timeline for v.2.0, consider
    • what we want for release candidate timeframe to gather feedback?
    • how will we support updating tooling?
    • how many validators do we need?

Meeting Materials:

Agenda and Notes

Monday

Must be done by 4pm for the 4:30pm DPC Awards Neil is nominated for.

Time Topic
12pm - 2pm. OCFL Workshop at iPRES
2:30pm - 3:30pm Create Schedule for the Week

Tuesday

Must be done by 3:30pm for a 4pm panel Neil is on.

Time Topic
9am - 10:30am Application Profiles: Use Case #50
11:00am - 12:30pm Coffee with Jürgen
12:30pm - 1:30pm Lunch at the Conference
1:30pm - 2:30pm Application Profiles: Use Case #50
2:30pm - 3:30pm File Deletion, Corruption, Loss: Use Cases #42 and #14.
6pm - onwards Conference Reception

Application profiles notes

Use case: https://github.com/OCFL/Use-Cases/issues/50

Editors agree that application profiling will be handled through an extension and not through changes to the specification.

Extension 0008: Schema Registry may provide inspiration for addressing this use case. The extension could be as simple as pointing to documentation, or can be more complex like Jürgen's self-documenting and machine-actionable objects. The minimal storage root extension might be a profile directory in the extension directory that includes human readable description and/or links to external documentation. Refinement might allow use at the object level and possible reference from an object to profiles at the storage root level. Will create a draft extension for community discussion.

File deletion and corruption notes

Handling file loss, file corruption or version collapse all change the assumption of version immutability. This is necessarily a version 2 concept so it can only apply in a version 2+ storage root. Even with a notionally immutable system, one can have corruption. Possible solutions without mutability would be to delete corrupted objects or just store a record of corrupted files outside of the system.

We agree on the tombstones idea described in the use case #42

Question from @Brian Wheeler https://github.com/OCFL/Use-Cases/issues/42#issuecomment-1949288221 : "If the file is gone then it would not appear in the manifest?". We agree that when a file is gone then the file would be shown in the tombstones block and not in the manifest block.

We will create a new version to record that a file has been deleted, vanished or corrupted. We will recommend that no other changes be made at the same time as the recording of deletion/corruption. The creation of a new version gives the chance to write a new version message with user/time/etc. and any other human readable information about the why the change is occurring.

Why might a file be tombstoned?

  • missing/vanished, removed/deleted -- spec will not distinguish these cases: file does not appear in manifest, appears with original digest in tombstones
  • corrupted -- file appears with "new" digest in manifest, with original digest in tombstones
  • name in file system but unreadable or not reliably readable -- file appears with empty digest string (not a valid digest output in any digest format we know, and an empty string is valid as key on JSON object whereas Null etc. are not) in manifest, with original digest in tombstones

Use cases for corrupted and or unreadable:

  • write once storage where we can't delete
  • corruption where we want to keep corrupted file for possible later analysis

We will add an extra parameter in ocfl_layout.json to flag use of mutability features such as tombstones with the implication that tooling MUST check latest inventory before trying to read any version.

Implementation notes must:

  • account for deduped files
  • talk about read errors and inconsistency
  • talk about corruption characteristics of different storage types
  • talk about need for documentation in new version
  • impact on other V2 features - packages and content-linking
  • validation strategies

Example of file deletion (unchanged from 2023-09-23 comment)

{
  "digestAlgorithm": "sha512",
  "head": "v2",
  "id": "http://example.org/minimal_deletion",
  "manifest": { },
  "tombstones": {
    "7545b8...f67": [ "v1/content/file.txt" ]
  },
  "type": "https://ocfl.io/1.1/spec/#inventory",
  "versions": {
    "v1": {
      "created": "2020-10-12T01:00:00Z",
      "message": "One file",
      "state": {
        "7545b8...f67": [ "file.txt" ]
      },
      "user": {
        "address": "mailto:[email protected]",
        "name": "Alice"
      }
    },
   "v2": {
      "created": "2021-01-00T02:00:00Z",
      "message": "The one file had to be deleted entirely for legal reasons",
      "state": { },
      "user": {
        "address": "mailto:[email protected]",
        "name": "Alice"
      }
    }
  }
}

Example of marking file corruption (cannot be read, and readable but bad digest)

{
  "digestAlgorithm": "sha512",
  "head": "v2",
  "id": "http://example.org/minimal_deletion",
  "manifest": {
    "": [ "v1/content/file1.txt" ],
    "aaa143...79a": [ "v1/content/file2.txt" ]
  },
  "tombstones": {
    "7545b8...f67": [ "v1/content/file1.txt" ],
    "fe4512...e47": [ "v1/content/file2.txt" ]
  },
  "type": "https://ocfl.io/1.1/spec/#inventory",
  "versions": {
    "v1": {
      "created": "2023-10-02T12:00:00Z",
      "message": "Two files",
      "state": {
        "7545b8...f67": [ "file1.txt" ],
        "fe4512...e47": [ "file2.txt" ]
      },
      "user": {
        "address": "mailto:[email protected]",
        "name": "Alice"
      }
    },
   "v2": {
      "created": "2024-09-20T10:09:00Z",
      "message": "File 1 vanished or cannot be read reliably, exclude. File 2 is corrupted with a different checksum, change name in state",
      "state": {
        "aaa143...79a": [ "file2_corrupted.txt" ]
      },
      "user": {
        "address": "mailto:[email protected]",
        "name": "Alice"
      }
    }
  }
}

Wednesday

Time Topic
9am - 10:30am Keynote for iPRES
11am - 12:30pm Conference Presentations
12:30pm - 1:30pm Lunch at the Conference
1:30pm - 5pm Package Per Version: Use Case #33
5:30pm - 6:30pm Boat Trip to Dinner
6:30pm - onward Conference Dinner

Package Per Version Notes

See Thursday's notes.

Thursday

Must be done by 2:30pm for Neil's BOF from 3pm - 4pm.

Time Topic
9am - 10:30am Package Per Version: Use Case #33
11am - 12:30pm Conference Presentations
12:30pm - 1:30pm Conference Lunch
1:30pm - 3:30pm Conference Presentations

Package Per Version Notes

These notes reference the Package Per Version Use Case, which is Use Case #33. It potentially addresses the issue of lots of small files as well as splitting large files.

Package characteristics

  • If a version is packaged, then the ENTIRE contents (inventory, sidecar and content) of the version directory are stored in the package.
    • Depending on the packaging format the package may comprise one or more files
    • Packages are stored in the version directory. If the packages were stored in the root directory, then the user might end up with many package files in the root directory.
      • example: Stanford has files with 100+ packages per version, and cases where there is 50TB data in 10,000 5GB chunks
    • The version directory for a packaged version MUST contain no other files than the package file(s)
  • Users MUST choose one package type per version.
  • The package file(s) are enumerated in packages.json stored in the object root. Therefore there is no particular naming convention required for packages - Implementation notes will recommend a simple and systematic approach.

Questions

  • Do we include the version directory within the packages?
    • Users cannot expand the package in place as it writes another version directory.
      • i.e., you end up with v1 > v1 > all the content files, the inventory.json file, and it's sidecar file.
    • If you're recovering to a different location, and all the packages are in the same folder then you can expand all the package files and end up with the complete object.
  • Do we include a flag at the storage root level to indicate the use of packages throughout the storage root?
    • This might allow for better tooling.

packages.json file

  • Manage version packages in a packages.json file maintained at the object root.
    • Should not have any significant scaling issues
    • inventory.json does not change format
  • Separation of concerns between packaging and object versioning
    • Repackaging does not constitute a versioning event (implementation notes will discuss in further detail)
  • A given package contains all the files of a specific version so the package digest can be validated in lieu of the files in that version.
  • All paths in the packages.json file are relative to the object root
  • There is a metadata block per version that provides information about the package with an array of key/values.
    • The key values in the "metadata" block must include "format" and "formatVersion" (avoiding "type" and "version" because of namespace collisions)
    • An optional key is "extension" that points to an extension in the object’s extension folder that allows an organization to store other information about the package.

Implementation Notes

  • Outline the best method for rewriting packages...
    • rewrite the packages,
    • rewrite the packages.json
    • and update the sidecar file.
    • We aren’t going to version the packages.json file.

packages.json example

This strategy replicates the manifest block of the inventory.json file, i.e. “digest”: \[ “filename” \] and then just specifies the order in the packages list for each version:

{
  "digestAlgorithm": "sha512",
  "type": "https://ocfl.io/1.1/spec/#packages",
  "manifest": { 
    "abc..123": [ "v1/v1.zip" ],
    "cde..123": [ "v3/v3.z01" ], 
    "ade..789": [ "v3/v3.z02" ], 
    "ces..229": [ "v3/v3.zip" ]
  },
 "versions": {
     "v1": {
	  "metadata": { 
	     "format": "zip", 
	     "formatVersion": "6.3.10", 
	     "extension": "[extension-name-ref]" 
	     },
	  "packages": ["v1/v1.zip"]  
     },
     "v3": {
	  "metadata": { 
	     "format": "zip", 
	     "formatVersion": "6.3.10", 
	     "extension": "[extension-name-ref]" 
	     },
	  "packages": ["v3/v3.z01", "v3/v3.z02", "v3/v3.zip"]  
     }
 }
  • "manifest" - lists all package files for all versions, this is done in an array to match the inventory.json file.
  • "format" - lists a handful of defined package types, and also links to a controlled vocabulary extension similar to the digest algorithm extension which is optional.
  • "formatVersion" - the precise meaning of version may be dependent on the format used to package up the content.
  • "extension" - an optional extension used to include more information about the package files, the extension must be a local extension in the object, the additional information goes in extension directory
  • "packages" - the list of packages in the version in the order in which they should be unpacked.

The inventory.json file remains unchanged. The files corresponding to the above packages.json example would appear on disk:

[object root]
├── 0=ocfl_object_1.1
├── inventory.json
├── inventory.json.sha512
├── packages.json
├── packages.json.sha512
├── v1
│    └── v1.zip
├── v2
│    ├── inventory.json
│    ├── inventory.json.sha512
│    └── new-file.txt
└── v3
     ├── v3.zip 
     ├── v3.z02
     └── v3.z01

Friday

Must be done by 5pm for Neil's meeting at 6pm.

Time Topic
9am - 11am Object Forking: Use Case #44
12pm - 2pm Updating Notes for Use Cases
3pm - 3:45pm Updating Notes for Use Cases
3:45 - 4pm Version Collapsing: Use Case #46

Object Forking Notes (File Inheritance)

These notes reference the Object Forking Use Case, which is Use Case 44. The use case is supported via content addressable storage. This introduces the concept of parent (the original object) and child (the object that is forked from the original object.

  • We support this by inserting one or more pointers to one or more files in one or more parent objects. This is placed in the manifest block of the inheriting child object.
  • The version state block lists the logical path as normal, allowing users to change the file name when inherited from a parent.
  • A child object can inherit arbitrary files from multiple parent objects; it's not limited to the set of files of a single version from a single parent object. This is an implementation detail.
  • However, an implementer may choose to limit this feature to all files in a specific version of a single parent object, if desired. This is also an implementation detail.
  • A child object can only inherit files from parent objects in the same storage root. OCFL has no mechanism for referencing files outside of the current storage root.
  • Inherited files cannot be included in the child's fixity block, and the verifier must lookup the parent object.
  • The child object must use the same "digestAlgorithm" as all parent objects.
  • File inheritance MUST NOT inherit a file from a grandparent. i.e., the act of creating a file link involves verifying with the parent object that the file exists in that object, and is not itself a pointer to another object's file.
    • There is no benefit to inheriting a file from a grandparent, it only creates complexity and the specification aims for simplicity.
    • To prevent recursion loops, validators must only check to one level of recursion when validating any object.

When a parent object is deleted:

  • In a storage root that supports file inheritance a flag MUST be placed in the ocfl_layout.json file.
  • If you delete an object, you MUST check whether another object inherits files from that object. Implementation notes will address how to do this.
  • We will create an extension as part of version 2 allowing you to document the child objects that depend on files in the parent object. verification of child objects will fail with a descriptive error (parent object no longer exists)

When a referenced file is deleted in a parent object:

  • Tombstoning will be propagated via the verification process of the child object (i.e., the file has been deleted in parent object).
  • A soft delete or rename in the parent object does not impact the child object in any way, as the original bitstream remains on disk in the parent's content directory and referenced in the parent's inventory.
  • A child object is invalid if the current state block of a child object references a deleted file in a parent object.
Open questions:
  • Should the tombstones get placed in the inventory.json of the child object?
  • Or does the implementation notes address the use of tombstoning in a parent as it may make a child object invalid?

When a file is corrupted in a parent object:

  • verification process should flag it the same as in parent object (i.e. file is corrupted in parent object)

A full inventory.json example of file inheritance

{
  "digestAlgorithm": "sha512",
  "head": "v3",
  "id": "ark:/12345/bcd987",
  "manifest": {
    "4d27c8...b53": [ "v2/content/foo/bar.xml" ],
    "7dcc35...c31": [ { "objectid": "ark:/67890/fgh123" } ],
    "df83e1...a3e": [ { "objectid": "ark:/67890/fgh123" } ],
    "ffccf6...62e": [ { "objectid": "ark:/67890/fgh123" } ]
  },
  "type": "https://ocfl.io/1.1/spec/#inventory",
  "versions": {
    "v1": {
      "created": "2018-01-01T01:01:01Z",
      "message": "Initial import. bar.xml, bigdata.dat and image.tiff are inherited from a parent object.",
      "state": {
        "7dcc35...c31": [ "foo/bar.xml" ],
        "df83e1...a3e": [ "bigdata.dat" ],
        "ffccf6...62e": [ "image.tiff" ]
      },
      "user": {
        "address": "mailto:[email protected]",
        "name": "Alice"
      }
    },
    "v2": {
      "created": "2018-02-02T02:02:02Z",
      "message": "Fix bar.xml replacing import with a local edit, remove image.tiff",
      "state": {
        "4d27c8...b53": [ "foo/bar.xml" ],
        "df83e1...a3e": [ "bigdata.dat" ]
      },
      "user": {
        "address": "mailto:[email protected]",
        "name": "Bob"
      }
    },
    "v3": {
      "created": "2018-03-03T03:03:03Z",
      "message": "Reinstate image.tiff",
      "state": {
        "4d27c8...b53": [ "foo/bar.xml" ],
        "df83e1...a3e": [ "bigdata.dat" ],
        "ffccf6...62e": [ "image.tiff" ]
      },
      "user": {
        "address": "mailto:[email protected]",
        "name": "Cecilia"
      }
    }
  }
}

Collapsing OCFL Object Versions

https://github.com/OCFL/Use-Cases/issues/46

Two distinct scenarios have different solutions:

  1. The desire to collapse versions and delete intermediate file revisions where there is no need or desire to keep the details of the historical changes because they are curatorially insignificant. If versions have been created rather than using an approach such as mutable head (see https://ocfl.github.io/extensions/0005-mutable-head.html) then such changes unavoidably mutate the object. We think this is best handled by rewriting the object with selected versions removed, taking care to keep all necessary/interesting changes. This approach needs no new specification support but additional implementation notes.

  2. The desire to delete intermediate files (perhaps to save storage) but to retain the history of versions. This is handled by the Support Physical File-Deletion use case.

Clone this wiki locally