Skip to content

2024.09.17 F2F Editors Meeting

Rosalyn Metz edited this page Sep 20, 2024 · 92 revisions

  • Julian Morley (Stanford)
  • Rosy Metz (Emory)
  • Simeon Warner (Cornell)
  • Andrew Woods (Harvard)
  • Neil Jefferies (Oxford)

Agreed Upon Travel Schedule:

  • Arrival in Ghent no later than Sunday Sept. 15th
  • Do the workshop on Monday Sept. 16th
  • Details TBD during the week, but we will have meetings Tuesday - Thursday
  • Full day meeting on Friday Sept. 20th that will end no earlier than Friday at 5pm local time (i.e., your flights need to take this into consideration)

Meeting Goals:

  • Triage/create tickets/topics into spec or implementation notes
  • Determine what tickets we can group together
  • Address any low hanging fruit tickets during the meeting
  • Identify a timeline for v.2.0, consider
    • what we want for release candidate timeframe to gather feedback?
    • how will we support updating tooling?
    • how many validators do we need?

Meeting Materials:

Agenda

Monday

Must be done by 4pm for the 4:30pm DPC Awards Neil is nominated for.

Time Topic
12pm - 2pm. OCFL Workshop at iPRES
2:30pm - 3:30pm Create Schedule for the Week

Tuesday

Must be done by 3:30pm for a 4pm panel Neil is on.

Time Topic
9am - 10:30am Application Profiles: Use Case #50
11:00am - 12:30pm Coffee with Jürgen
12:30pm - 1:30pm Lunch at the Conference
1:30pm - 2:30pm Application Profiles: Use Case #50
2:30pm - 3:30pm File Deletion, Corruption, Loss: Use Cases #42 and #14.
6pm - onwards Conference Reception

Application profiles notes

Use case: https://github.com/OCFL/Use-Cases/issues/50

Editors agree that application profiling will be handled through an extension and not through changes to the specification.

Extension 0008: Schema Registry may provide inspiration for addressing this use case. The extension could be as simple as pointing to documentation, or can be more complex like Jürgen's self-documenting and machine-actionable objects. The minimal storage root extension might be a profile directory in the extension directory that includes human readable description and/or links to external documentation. Refinement might allow use at the object level and possible reference from an object to profiles at the storage root level. Will create a draft extension for community discussion.

File deletion and corruption notes

Handling file loss, file corruption or version collapse all change the assumption of version immutability. This is necessarily a version 2 concept so it can only apply in a version 2+ storage root. Even with a notionally immutable system, one can have corruption. Possible solutions without mutability would be to delete corrupted objects or just store a record of corrupted files outside of the system.

We agree on the tombstones idea described in the use case #42

Question from @Brian Wheeler https://github.com/OCFL/Use-Cases/issues/42#issuecomment-1949288221 : "If the file is gone then it would not appear in the manifest?". We agree that when a file is gone then the file would be shown in the tombstones block and not in the manifest block.

We will create a new version to record that a file has been deleted, vanished or corrupted. We will recommend that no other changes be made at the same time as the recording of deletion/corruption. The creation of a new version gives the chance to write a new version message with user/time/etc. and any other human readable information about the why the change is occurring.

Why might a file be tombstoned?

  • missing/vanished, removed/deleted -- spec will not distinguish these cases: file does not appear in manifest, appears with original digest in tombstones
  • corrupted -- file appears with "new" digest in manifest, with original digest in tombstones
  • name in file system but unreadable or not reliably readable -- file appears with empty digest string (not a valid digest output in any digest format we know, and an empty string is valid as key on JSON object whereas Null etc. are not) in manifest, with original digest in tombstones

Use cases for corrupted and or unreadable:

  • write once storage where we can't delete
  • corruption where we want to keep corrupted file for possible later analysis

We will add an extra parameter in ocfl_layout.json to flag use of mutability features such as tombstones with the implication that tooling MUST check latest inventory before trying to read any version.

Implementation notes must:

  • account for deduped files
  • talk about read errors and inconsistency
  • talk about corruption characteristics of different storage types
  • talk about need for documentation in new version
  • impact on other V2 features - packages and content-linking
  • validation strategies

Example of file deletion (unchanged from 2023-09-23 comment)

{
  "digestAlgorithm": "sha512",
  "head": "v2",
  "id": "http://example.org/minimal_deletion",
  "manifest": { },
  "tombstones": {
    "7545b8...f67": [ "v1/content/file.txt" ]
  },
  "type": "https://ocfl.io/1.1/spec/#inventory",
  "versions": {
    "v1": {
      "created": "2020-10-12T01:00:00Z",
      "message": "One file",
      "state": {
        "7545b8...f67": [ "file.txt" ]
      },
      "user": {
        "address": "mailto:[email protected]",
        "name": "Alice"
      }
    },
   "v2": {
      "created": "2021-01-00T02:00:00Z",
      "message": "The one file had to be deleted entirely for legal reasons",
      "state": { },
      "user": {
        "address": "mailto:[email protected]",
        "name": "Alice"
      }
    }
  }
}

Example of marking file corruption (cannot be read, and readable but bad digest)

{
  "digestAlgorithm": "sha512",
  "head": "v2",
  "id": "http://example.org/minimal_deletion",
  "manifest": {
    "": [ "v1/content/file1.txt" ],
    "aaa143...79a": [ "v1/content/file2.txt" ]
  },
  "tombstones": {
    "7545b8...f67": [ "v1/content/file1.txt" ],
    "fe4512...e47": [ "v1/content/file2.txt" ]
  },
  "type": "https://ocfl.io/1.1/spec/#inventory",
  "versions": {
    "v1": {
      "created": "2023-10-02T12:00:00Z",
      "message": "Two files",
      "state": {
        "7545b8...f67": [ "file1.txt" ],
        "fe4512...e47": [ "file2.txt" ]
      },
      "user": {
        "address": "mailto:[email protected]",
        "name": "Alice"
      }
    },
   "v2": {
      "created": "2024-09-20T10:09:00Z",
      "message": "File 1 vanished or cannot be read reliably, exclude. File 2 is corrupted with a different checksum, change name in state",
      "state": {
        "aaa143...79a": [ "file2_corrupted.txt" ]
      },
      "user": {
        "address": "mailto:[email protected]",
        "name": "Alice"
      }
    }
  }
}

Wednesday

Time Topic
9am - 10:30am Keynote for iPRES
11am - 12:30pm Conference Presentations
12:30pm - 1:30pm Lunch at the Conference
1:30pm - 5pm Package Per Version: Use Case #33
5:30pm - 6:30pm Boat Trip to Dinner
6:30pm - onward Conference Dinner

Package Per Version Notes

See Thursday's notes.

Thursday

Must be done by 2:30pm for Neil's BOF from 3pm - 4pm.

Time Topic
9am - 10:30am Package Per Version: Use Case #33
11am - 12:30pm Conference Presentations
12:30pm - 1:30pm Conference Lunch
1:30pm - 3:30pm Conference Presentations

Package Per Version Notes

These notes reference the Package Per Version Use Case, which is Use Case #33

Package characteristics

  • If a version is packaged, then the ENTIRE contents (inventory, sidecar and content) of the version directory are stored in the package.
    • Depending on the packaging format the package may comprise one or more files
    • Packages are stored in the version directory. If the packages were stored in the root directory, then the user might end up with many package files in the root directory.
      • example: Stanford has files with 100+ packages per version, and cases where there is 50TB data in 10,000 5GB chunks
    • The version directory for a packaged version MUST contain no other files than the package file(s)
  • Users MUST choose one package type per version.
  • The package file(s) are enumerated in packages.json stored in the object root. Therefore there is no particular naming convention required for packages - Implementation notes will recommend a simple and systematic approach.

Questions

  • Do we include the version directory within the packages?
    • Users cannot expand the package in place as it writes another version directory.
      • i.e., you end up with v1 > v1 > all the content files, the inventory.json file, and it's sidecar file.
    • If you're recovering to a different location, and all the packages are in the same folder then you can expand all the package files and end up with the complete object.
  • Do we include a flag at the storage root level to indicate the use of packages throughout the storage root?
    • This might allow for better tooling.

packages.json file

  • Manage version packages in a packages.json file maintained at the object root.
    • Should not have any significant scaling issues
    • inventory.json does not change format
  • Separation of concerns between packaging and object versioning
    • Repackaging does not constitute a versioning event (implementation notes will discuss in further detail)
  • A given package contains all the files of a specific version so the package digest can be validated in lieu of the files in that version.
  • All paths in the packages.json file are relative to the object root
  • There is a metadata block per version that provides information about the package with an array of key/values.
    • The key values in the "metadata" block must include "format" and "version" (avoiding "type" because of namespace collision)
    • An optional key is "extension" that points to an extension in the object’s extension folder that allows an organization to store other information about the package.
    • See the information in The archiveInformation Block that the Library of Norway created in their Workgroup Notes: Proposal Medium Impact

Implementation Notes

  • Outline the best method for rewriting packages...
    • rewrite the packages,
    • rewrite the packages.json
    • and update the sidecar file.
    • We aren’t going to version the packages.json file.

packages.json example

This strategy replicates the manifest block of the inventory.json file, i.e. “digest”: \[ “filename” \] and then just specifies the order in the packages list for each version:

{
  "digestAlgorithm": "sha512",
  "type": "https://ocfl.io/1.1/spec/#packages",
  "manifest": { 
    "abc..123": [ "v1/v1.zip" ],
    "cde..123": [ "v3/v3.z01" ], 
    "ade..789": [ "v3/v3.z02" ], 
    "ces..229": [ "v3/v3.zip" ]
  },
 "versions": {
     "v1": {
	  "metadata": { 
	     "format": "zip", 
	     "version": "6.3.10", 
	     "extension": "[extension-name-ref]" 
	     },
	  "packages": ["v1/v1.zip"]  
     },
     "v3": {
	  "metadata": { 
	     "format": "zip", 
	     "version": "6.3.10", 
	     "extension": "[extension-name-ref]" 
	     },
	  "packages": ["v3/v3.z01", "v3/v3.z02", "v3/v3.zip"]  
     }
 }
  • "manifest" - lists out all package files for all versions, this is done in an array to match the inventory.json file.
  • "format" - this links to a controlled vocabulary extension similar to the digest algorithm extension
  • "version" - the version of the format used to package up the content.
  • "extension" - the extension used to include more information about the package files, the extension must be a local extension in the object, the additional information goes in extension directory
  • "packages" - the list of v1 packages in the order in which they should be upackaged.

The inventory.json file remains unchanged. Below is how the above package.json example would appear on disk:

[object root]
├── 0=ocfl_object_1.1
├── inventory.json
├── inventory.json.sha512
├── v1
│    └── v1.zip
├── v2
│    ├── inventory.json
│    ├── inventory.json.sha512
│    └── new-file.txt
└── v3
     ├── v3.zip 
     ├── v3.z02
     └── v3.z01

Friday

Must be done by 5pm for Neil's meeting at 6pm.

Time Topic
Clone this wiki locally